Skip to main content

Webhook Dispatcher — Observability

Status: populated Owner: Platform Engineering Last updated: 2026-04-18 Companion: FAILURE_MODES · DEPLOYMENT_TOPOLOGY

1. Metrics (Prometheus)

MetricTypeLabelsDescription
hook_dispatch_events_received_totalCounterdlr_statusInbound webhook.dispatch events
hook_delivery_attempts_totalCounterattempt_number, outcomeHTTP delivery attempts by outcome
hook_delivery_success_totalCounterattempt_numberSuccessful deliveries (2xx)
hook_delivery_failures_totalCounterhttp_status_codeFailed deliveries by HTTP status
hook_deliveries_dead_lettered_totalCounterDeliveries reaching DEAD_LETTER
hook_delivery_duration_secondsHistogramHTTP delivery round-trip time
hook_retry_poller_lag_secondsGaugeTime since oldest due retry not yet processed
hook_pending_retries_countGaugeFAILED_RETRY rows with next_retry_at <= now()
hook_db_errors_totalCounteroperationDB errors by operation
hook_kms_errors_totalCounterKMS decryption failures
hook_nats_consumer_statusGauge1 = active, 0 = disconnected
hook_webhooks_active_totalGaugeTotal active webhook_configs
hook_api_requests_totalCountermethod, path, statusREST API requests
hook_api_duration_secondsHistogrammethod, pathREST API latency

2. Structured Logs (Pino / JSON)

Levelevent fieldWhen
INFOhook.dispatch_receivedInbound NATS event (no phone numbers logged)
INFOhook.delivery_attemptEach HTTP attempt (webhookId, attemptNumber, httpStatus)
INFOhook.delivery_success2xx response (webhookId, deliveryId, durationMs)
WARNhook.delivery_failedNon-2xx or timeout (webhookId, attemptNumber, httpStatus, error)
WARNhook.dead_letteredMax retries exhausted (webhookId, deliveryId, accountId)
WARNhook.redirect_rejected3xx received (webhookId, httpStatus, location header)
ERRORhook.db_errorDatabase error (operation, sanitised message)
ERRORhook.kms_errorKMS decryption failure (webhookId, redacted)

No to (phone number) field in any log line.

3. Distributed Tracing (OpenTelemetry)

Trace spans:

  • hook.dispatch.process — root span per NATS message
  • hook.dispatch.lookup_webhooks — PG query for active configs
  • hook.delivery.attempt — child span per HTTP attempt (includes http.url, http.status_code, but URL redacted to host only)
  • hook.delivery.sign — HMAC signing
  • hook.retry.schedule — PG update for retry

4. Alerting Rules

- alert: HookHighDeadLetterRate
expr: rate(hook_deliveries_dead_lettered_total[5m]) > 1.67 # >100/min
for: 2m
labels: { severity: warning }
annotations:
summary: "Webhook dead-letter rate > 100/min"

- alert: HookRetryPollerLag
expr: hook_retry_poller_lag_seconds > 120
for: 5m
labels: { severity: warning }

- alert: HookConsumerDisconnected
expr: hook_nats_consumer_status == 0
for: 1m
labels: { severity: critical }

- alert: HookHighDeliveryLatency
expr: histogram_quantile(0.99, rate(hook_delivery_duration_seconds_bucket[5m])) > 4.5
for: 5m
labels: { severity: warning }
annotations:
summary: "Webhook delivery p99 latency approaching 5s timeout"

5. Dashboards

Grafana dashboard webhook-dispatcher-overview:

  • Delivery success/failure rate — stacked time series
  • Dead-letter rate — time series with alert threshold overlay
  • Delivery latency p50/p95/p99 — time series
  • Retry queue depth — gauge + trend
  • HTTP status code distribution — heatmap
  • Active webhook configs — gauge
  • REST API latency + error rate — time series