Skip to content
312+ businesses automated avg. 14h/week savedManual workflows cost the average team £512/week fix it in 10 daysDeployed in 5–10 business days · 30-day money-back guaranteeDental · Real Estate · Agencies · E-commerce · Covered99.97% uptime SLA · Monitored 24/7 by our ops teamA full-time ops hire costs £45K+/yr PURIST delivers more in daysn8n · Make · Claude AI · 500+ workflow templatesFree automation audit limited to 5 spots this week312+ businesses automated avg. 14h/week savedManual workflows cost the average team £512/week fix it in 10 daysDeployed in 5–10 business days · 30-day money-back guaranteeDental · Real Estate · Agencies · E-commerce · Covered99.97% uptime SLA · Monitored 24/7 by our ops teamA full-time ops hire costs £45K+/yr PURIST delivers more in daysn8n · Make · Claude AI · 500+ workflow templatesFree automation audit limited to 5 spots this week312+ businesses automated avg. 14h/week savedManual workflows cost the average team £512/week fix it in 10 daysDeployed in 5–10 business days · 30-day money-back guaranteeDental · Real Estate · Agencies · E-commerce · Covered99.97% uptime SLA · Monitored 24/7 by our ops teamA full-time ops hire costs £45K+/yr PURIST delivers more in daysn8n · Make · Claude AI · 500+ workflow templatesFree automation audit limited to 5 spots this week
PURIST
312+
Clients automated
14 h/wk
Avg time saved
99.97%
Uptime SLA
< 7 days
Deploy time
PURIST AI
Claude Opus 4.7 · n8n v1.71 · <80ms
What type of business are you running? I'll show you exactly which processes we'd automate first and your estimated ROI.
Powered by n8n + Claude Opus 4.7 Book free audit →
Building a 24/7 error handling system: what happens when your automation breaks at 3am.
Operations 9 min read · 551 words

Building a 24/7 error handling system: what happens when your automation breaks at 3am.

Your automation will break. The question is whether it breaks silently and damages your business, or breaks loudly and gets fixed before anyone notices.

P

Purist

March 2026

Every automation will fail. This is not pessimism, it is statistics. The external APIs you depend on will return unexpected errors. The data you process will occasionally arrive malformed. The webhook you expect will not arrive. The n8n instance will restart during a long-running execution. The question is not whether your automation will fail; it is whether when it fails, you know about it before your business does.

Silent failures are the most expensive failures in workflow automation. A silent failure is one where the automation stops working, stops creating records, stops sending confirmations, stops routing leads, but generates no alert, no error notification, and no visible indication that anything has gone wrong. The business continues operating under the assumption that the automation is running. Days or weeks later, someone notices that the CRM has not been updated, or that clients have not received their onboarding emails, or that a payment has not been followed up. By then, the damage, lost leads, unhappy clients, missed revenue, has already occurred.

The PURIST approach to error handling in n8n is layered. The first layer is node-level error routing: every node in every production workflow has an error route defined. When a node fails, the error route triggers rather than terminating the workflow silently. The error route captures the error message, the node name, the input data that caused the failure, and the workflow execution ID, and routes this information to a centralized error logging workflow.

The second layer is the centralized error logging workflow. This single workflow receives error events from every other workflow in the system, writes them to a structured error log (we use a Postgres table with columns for workflow_name, node_name, error_type, error_message, input_payload, timestamp, and resolved status), and sends a Slack notification to the operations channel with the relevant details. The Slack notification uses a structured format: workflow name, failure time, error type, and a direct link to the n8n execution log for debugging. A P1 error, one affecting a payment flow, a patient booking, or a client-facing communication, triggers both a Slack alert and an SMS to the on-call engineer.

The third layer is health monitoring, proactive checks that detect not just failures but degraded performance. A monitoring workflow runs every 15 minutes and queries the error log database for any workflow that has had an error rate above 5% in the past hour, any workflow that has not executed in its expected interval (flagging stuck or stalled workflows), and any workflow processing time that has exceeded twice its historical average (flagging performance degradation). These health checks catch the failure modes that node-level error routing misses, the workflow that is running but slowly, the workflow that is not running at all because its trigger stopped firing.

The fourth layer is the recovery runbook. Every production workflow has a documented recovery procedure: what does a human do when this workflow fails? Where is the data that needs to be manually processed? What is the rollback procedure? This documentation is not theoretical, we test recovery procedures quarterly against intentionally triggered failures. The automation that can be recovered in 10 minutes by anyone on the team is the automation that survives a 3am failure without consequences. The automation that only the original builder knows how to fix is an operational liability.

Tags

error handlingmonitoringn8nautomation reliabilityalertingproductiondevops
P

The PURIST editorial team covers automation, AI agents, and operations strategy for businesses scaling with n8n, Make, and Claude AI.

Keep reading

More from the blog.

All articles

From audit to deployment

Experience the automation
these articles are about.

Book your free audit