Nine times out of ten, when a client tells us "the integration broke last week," the integration didn't break. A webhook broke. The sender fired its payload, got a 500 or a timeout, retried twice, gave up. No one was watching the receiving end until a report came out wrong or a customer called.
Webhooks are the connective tissue between your CRM, payment processor, document platform, and portal. When they fail quietly, your data drifts and every downstream process inherits the rot. Here's how to design endpoints that surface problems early instead of hiding them.
Why Webhook Failures Hide
Webhooks are fire-and-forget from the sender's perspective. Stripe, HubSpot, DocuSign, Plaid: they all send the event, check for a 2xx response, and log the result on their side. If your endpoint returns a 500, most providers will retry with exponential backoff for a few hours to a few days, then give up and mark the event as permanently failed.
The critical gap: almost nobody audits the "failed webhook deliveries" tab. The sender knows it gave up. The receiver never saw the event. The operator has no idea anything is wrong until reconciliation breaks weeks later.
Treat every inbound webhook as a durable message, not a function call. Accept fast, acknowledge immediately, then process asynchronously with full visibility into what happened after the 200 was sent.
The Three Failure Modes
Every webhook outage we've diagnosed falls into one of three buckets.
Endpoint down. Your server is unreachable, your app crashed, your database is locked, or a deploy is in progress. The sender gets a timeout or connection refused and starts its retry loop. This is the easiest failure to detect because the sender sees it, but most teams never check the sender's dashboard.
Signature mismatch. The sender signs payloads with an HMAC or shared secret, and your verification fails. Usually this is because someone rotated the secret on one side but not the other, or the raw body was mutated by a proxy before you computed the hash. You return a 401 or 403, the sender stops retrying because it thinks you're intentionally rejecting, and the events vanish.
Body parse error. The payload shape changed. The provider added a field, changed a type from integer to string, or started sending null where they used to send an empty string. Your handler throws, you return a 500, and you keep returning 500 on every retry because the payload doesn't change. The sender eventually gives up after burning its retry budget on the same broken message.
A good endpoint distinguishes these in its logs and alerts on them differently. Signature mismatches and parse errors are usually bugs on your side; endpoint-down is usually infrastructure.
Idempotency and Retry Semantics
Every webhook sender will eventually deliver the same event twice. Network blips, timeouts where your 200 didn't make it back, retry storms after an outage: duplicates are guaranteed. Your handler has to be idempotent or you'll get duplicate invoices, duplicate contacts, duplicate journal entries.
The pattern is simple: the sender includes an event ID (Stripe calls it id, HubSpot calls it eventId, others call it delivery_id). Before you process, you check whether you've already handled that ID. If yes, return 200 immediately and do nothing. If no, process it and record the ID in a processed_webhook_events table with a unique constraint.
Two things to get right. First, record the ID in the same transaction as the processing work, because otherwise you can process successfully, crash before recording, and reprocess on retry. Second, return 200 fast on the duplicate check; don't make the sender wait while you re-validate everything.
Dead-Letter Queues and Operator Alerts
The highest-leverage pattern we deploy is a two-stage receive. The endpoint does three things and three things only: verify the signature, write the raw payload to a webhook_events table, return 200. That's it. No parsing, no business logic, no external calls.
A separate worker reads from that table and does the actual processing. When the worker fails (parse error, downstream API down, business rule violation), it moves the event to a dead_letter_queue table with the error, the stack trace, and a retry count. An operator gets an alert. The worker keeps going on the next event.
This inverts the failure mode. Instead of the sender giving up and the event disappearing, you keep the raw event forever, you know exactly why it failed, and you can replay it after fixing the bug. Debugging a "what happened to event evt_1M8x..." question becomes a one-line SQL query instead of a trip through the provider's support team.
A Minimum Monitoring Surface
You don't need Datadog to know whether your webhooks are healthy. You need three metrics and one alert rule.
- Events received per hour, per source. A sudden drop (Stripe goes from 200/hr to 0) means either your endpoint is unreachable or the sender disabled it. Either way you want to know within the hour.
- Dead-letter count. Any non-zero count should page someone. Every event in the DLQ is a data integrity bug waiting to bite.
- Signature failure count. More than a handful per day means a secret is out of sync or someone's probing your endpoint. Both warrant investigation.
The alert rule: if DLQ count increases or received-events drops more than 80% below the rolling hourly average, send an email or SMS to whoever owns the integration. Not Slack. Not a dashboard. Something that interrupts.
What to Do This Week
Three things are worth checking today. Log into every provider dashboard and look at the failed-delivery tab. You'll probably find events you didn't know were missing. Confirm your endpoint returns 200 on duplicate event IDs without reprocessing. Verify at least one human gets paged when a webhook fails. Not a log line, an actual notification.
Want help retrofitting this pattern onto an existing integration? Schedule a consultation.