Skip to main content

Webhook Dispatcher — Failure Modes

Status: populated Owner: Platform Engineering Last updated: 2026-04-18 Companion: APPLICATION_LOGIC · OBSERVABILITY

1. Failure Catalogue

FM-HOOK-01: Customer Endpoint Unreachable

AttributeDetail
TriggerCustomer webhook URL returns non-2xx, timeout, or DNS failure
ImpactDelivery delayed; retries scheduled up to 10.5 hours
Detectionhook_delivery_failures_total counter; elevated FAILED_RETRY rows in DB
RecoveryAutomatic retry with exponential backoff; dead-letter after 5 attempts
Customer notificationDead-letter event published; platform dashboard shows DEAD_LETTER status

FM-HOOK-02: NATS Consumer Disconnected

AttributeDetail
TriggerNATS server unreachable
ImpactNew webhook.dispatch events queue in JetStream; existing retry poller continues
Detectionhook_nats_consumer_status gauge = 0; /ready returns 503
RecoveryAutomatic NATS reconnect; JetStream replays; retry poller unaffected

FM-HOOK-03: PostgreSQL Unavailable

AttributeDetail
TriggerPG primary failover or partition
ImpactNATS Acks blocked; new dispatch events Nak'd; retry poller stalls
DetectionDB error logs; hook_db_errors_total spike
RecoveryPgBouncer failover ~30 s; both NATS consumer and retry poller resume automatically

FM-HOOK-04: KMS Unreachable (Secret Decryption Failure)

AttributeDetail
TriggerKMS service unavailable at delivery time
ImpactCannot decrypt webhook secret; delivery attempt fails with internal error
Detectionhook_kms_errors_total counter; ERROR log hook.kms_error
RecoveryTreat as transient delivery failure; retry scheduled; KMS typically recovers within seconds
MitigationCache decrypted secret in-memory for 5 min TTL per webhook (reduces KMS calls)

FM-HOOK-05: Retry Poller Falling Behind

AttributeDetail
TriggerHigh volume of FAILED_RETRY rows with next_retry_at in the past; poller batch size too small
Detectionhook_retry_poller_lag_seconds gauge > 60
RecoveryIncrease RETRY_WORKER_INTERVAL_MS batch size; scale out pods (each pod runs its own poller with SKIP LOCKED)

FM-HOOK-06: Redirect Loop at Customer Endpoint

AttributeDetail
TriggerCustomer URL returns 301/302 redirect
ImpactDelivery attempt fails (redirect not followed); retries exhaust to dead-letter
DetectionhttpStatusCode = 301 in delivery_attempts; customer reports missing events
RecoveryCustomer must update webhook URL to non-redirecting endpoint
PreventionValidate URL format at registration; document no-redirect behaviour

FM-HOOK-07: Dead-Letter Flood (Mass Customer Endpoint Outage)

AttributeDetail
TriggerLarge customer endpoint outage causes all 5 retry attempts to fail for thousands of events
Impactwebhook.dispatch.deadletter flood; billing alert if billing depends on delivery confirmation
Detectionhook_deliveries_dead_lettered_total spike; alert threshold 100/min
RecoveryEvents are permanently dead-lettered; customer must request manual replay via support
MitigationPlatform replay tooling (future feature) to re-dispatch from delivery_attempts.payload_snapshot

2. Dependency Failure Matrix

DependencyFailureBehaviour
NATSDownConsumer disconnects; retry poller continues; /ready 503
PostgreSQLDownAll operations stall; automatic recovery on PgBouncer failover
KMSDownDelivery fails; retried; short cache mitigates brief outages
Customer endpointDownRetried with backoff; dead-lettered after 5 attempts
dlr-processor (upstream)DownNo new events; existing retry queue continues