Webhook Dispatcher — Observability
Status: populated Owner: Platform Engineering Last updated: 2026-04-18 Companion: FAILURE_MODES · DEPLOYMENT_TOPOLOGY
1. Metrics (Prometheus)
| Metric | Type | Labels | Description |
|---|---|---|---|
hook_dispatch_events_received_total | Counter | dlr_status | Inbound webhook.dispatch events |
hook_delivery_attempts_total | Counter | attempt_number, outcome | HTTP delivery attempts by outcome |
hook_delivery_success_total | Counter | attempt_number | Successful deliveries (2xx) |
hook_delivery_failures_total | Counter | http_status_code | Failed deliveries by HTTP status |
hook_deliveries_dead_lettered_total | Counter | — | Deliveries reaching DEAD_LETTER |
hook_delivery_duration_seconds | Histogram | — | HTTP delivery round-trip time |
hook_retry_poller_lag_seconds | Gauge | — | Time since oldest due retry not yet processed |
hook_pending_retries_count | Gauge | — | FAILED_RETRY rows with next_retry_at <= now() |
hook_db_errors_total | Counter | operation | DB errors by operation |
hook_kms_errors_total | Counter | — | KMS decryption failures |
hook_nats_consumer_status | Gauge | — | 1 = active, 0 = disconnected |
hook_webhooks_active_total | Gauge | — | Total active webhook_configs |
hook_api_requests_total | Counter | method, path, status | REST API requests |
hook_api_duration_seconds | Histogram | method, path | REST API latency |
2. Structured Logs (Pino / JSON)
| Level | event field | When |
|---|---|---|
INFO | hook.dispatch_received | Inbound NATS event (no phone numbers logged) |
INFO | hook.delivery_attempt | Each HTTP attempt (webhookId, attemptNumber, httpStatus) |
INFO | hook.delivery_success | 2xx response (webhookId, deliveryId, durationMs) |
WARN | hook.delivery_failed | Non-2xx or timeout (webhookId, attemptNumber, httpStatus, error) |
WARN | hook.dead_lettered | Max retries exhausted (webhookId, deliveryId, accountId) |
WARN | hook.redirect_rejected | 3xx received (webhookId, httpStatus, location header) |
ERROR | hook.db_error | Database error (operation, sanitised message) |
ERROR | hook.kms_error | KMS decryption failure (webhookId, redacted) |
No to (phone number) field in any log line.
3. Distributed Tracing (OpenTelemetry)
Trace spans:
hook.dispatch.process— root span per NATS messagehook.dispatch.lookup_webhooks— PG query for active configshook.delivery.attempt— child span per HTTP attempt (includeshttp.url,http.status_code, but URL redacted to host only)hook.delivery.sign— HMAC signinghook.retry.schedule— PG update for retry
4. Alerting Rules
- alert: HookHighDeadLetterRate
expr: rate(hook_deliveries_dead_lettered_total[5m]) > 1.67 # >100/min
for: 2m
labels: { severity: warning }
annotations:
summary: "Webhook dead-letter rate > 100/min"
- alert: HookRetryPollerLag
expr: hook_retry_poller_lag_seconds > 120
for: 5m
labels: { severity: warning }
- alert: HookConsumerDisconnected
expr: hook_nats_consumer_status == 0
for: 1m
labels: { severity: critical }
- alert: HookHighDeliveryLatency
expr: histogram_quantile(0.99, rate(hook_delivery_duration_seconds_bucket[5m])) > 4.5
for: 5m
labels: { severity: warning }
annotations:
summary: "Webhook delivery p99 latency approaching 5s timeout"
5. Dashboards
Grafana dashboard webhook-dispatcher-overview:
- Delivery success/failure rate — stacked time series
- Dead-letter rate — time series with alert threshold overlay
- Delivery latency p50/p95/p99 — time series
- Retry queue depth — gauge + trend
- HTTP status code distribution — heatmap
- Active webhook configs — gauge
- REST API latency + error rate — time series