Webhook Dispatcher — Failure Modes
Status: populated Owner: Platform Engineering Last updated: 2026-04-18 Companion: APPLICATION_LOGIC · OBSERVABILITY
1. Failure Catalogue
FM-HOOK-01: Customer Endpoint Unreachable
| Attribute | Detail |
|---|---|
| Trigger | Customer webhook URL returns non-2xx, timeout, or DNS failure |
| Impact | Delivery delayed; retries scheduled up to 10.5 hours |
| Detection | hook_delivery_failures_total counter; elevated FAILED_RETRY rows in DB |
| Recovery | Automatic retry with exponential backoff; dead-letter after 5 attempts |
| Customer notification | Dead-letter event published; platform dashboard shows DEAD_LETTER status |
FM-HOOK-02: NATS Consumer Disconnected
| Attribute | Detail |
|---|---|
| Trigger | NATS server unreachable |
| Impact | New webhook.dispatch events queue in JetStream; existing retry poller continues |
| Detection | hook_nats_consumer_status gauge = 0; /ready returns 503 |
| Recovery | Automatic NATS reconnect; JetStream replays; retry poller unaffected |
FM-HOOK-03: PostgreSQL Unavailable
| Attribute | Detail |
|---|---|
| Trigger | PG primary failover or partition |
| Impact | NATS Acks blocked; new dispatch events Nak'd; retry poller stalls |
| Detection | DB error logs; hook_db_errors_total spike |
| Recovery | PgBouncer failover ~30 s; both NATS consumer and retry poller resume automatically |
FM-HOOK-04: KMS Unreachable (Secret Decryption Failure)
| Attribute | Detail |
|---|---|
| Trigger | KMS service unavailable at delivery time |
| Impact | Cannot decrypt webhook secret; delivery attempt fails with internal error |
| Detection | hook_kms_errors_total counter; ERROR log hook.kms_error |
| Recovery | Treat as transient delivery failure; retry scheduled; KMS typically recovers within seconds |
| Mitigation | Cache decrypted secret in-memory for 5 min TTL per webhook (reduces KMS calls) |
FM-HOOK-05: Retry Poller Falling Behind
| Attribute | Detail |
|---|---|
| Trigger | High volume of FAILED_RETRY rows with next_retry_at in the past; poller batch size too small |
| Detection | hook_retry_poller_lag_seconds gauge > 60 |
| Recovery | Increase RETRY_WORKER_INTERVAL_MS batch size; scale out pods (each pod runs its own poller with SKIP LOCKED) |
FM-HOOK-06: Redirect Loop at Customer Endpoint
| Attribute | Detail |
|---|---|
| Trigger | Customer URL returns 301/302 redirect |
| Impact | Delivery attempt fails (redirect not followed); retries exhaust to dead-letter |
| Detection | httpStatusCode = 301 in delivery_attempts; customer reports missing events |
| Recovery | Customer must update webhook URL to non-redirecting endpoint |
| Prevention | Validate URL format at registration; document no-redirect behaviour |
FM-HOOK-07: Dead-Letter Flood (Mass Customer Endpoint Outage)
| Attribute | Detail |
|---|---|
| Trigger | Large customer endpoint outage causes all 5 retry attempts to fail for thousands of events |
| Impact | webhook.dispatch.deadletter flood; billing alert if billing depends on delivery confirmation |
| Detection | hook_deliveries_dead_lettered_total spike; alert threshold 100/min |
| Recovery | Events are permanently dead-lettered; customer must request manual replay via support |
| Mitigation | Platform replay tooling (future feature) to re-dispatch from delivery_attempts.payload_snapshot |
2. Dependency Failure Matrix
| Dependency | Failure | Behaviour |
|---|---|---|
| NATS | Down | Consumer disconnects; retry poller continues; /ready 503 |
| PostgreSQL | Down | All operations stall; automatic recovery on PgBouncer failover |
| KMS | Down | Delivery fails; retried; short cache mitigates brief outages |
| Customer endpoint | Down | Retried with backoff; dead-lettered after 5 attempts |
| dlr-processor (upstream) | Down | No new events; existing retry queue continues |