DLR Processor — Failure Modes
Status: populated
Owner: Platform Engineering
Last updated: 2026-04-18
Companion: APPLICATION_LOGIC · OBSERVABILITY
1. Failure Catalogue
FM-DLR-01: NATS Consumer Disconnected
| Attribute | Detail |
|---|
| Trigger | NATS server unreachable or network partition |
| Impact | DLRs queue in JetStream; no processing; billing + webhook delayed |
| Detection | dlr_nats_consumer_status gauge = 0; /ready returns 503 |
| Recovery | Automatic NATS reconnect with exponential backoff (max 30 s); JetStream replays pending messages |
| SLO Impact | Latency SLO breached after ~30 s; no data loss |
FM-DLR-02: PostgreSQL Primary Unavailable
| Attribute | Detail |
|---|
| Trigger | PG primary failover or network partition |
| Impact | Processing blocked; messages Nak'd and retried |
| Detection | DB error logs; dlr_db_errors_total counter spike; PgBouncer health alerts |
| Recovery | Automatic failover to PG replica via PgBouncer; service resumes within ~30 s |
| SLO Impact | Latency SLO breached during failover; at-least-once delivery maintained |
FM-DLR-03: High Orphan Rate
| Attribute | Detail |
|---|
| Trigger | operatorMessageId not found in orch.sms_messages at > 0.5% rate |
| Impact | Billing + webhook not triggered for affected messages |
| Detection | dlr_orphan_rate Prometheus gauge; alert threshold 0.5% over 5 min |
| Recovery | Investigate operator ID mapping; reconciliation job re-processes orphans post-fix |
| Root Causes | smpp-connector not storing operatorMessageId; race condition (DLR arrives before SENT update) |
FM-DLR-04: Duplicate DLR Flood
| Attribute | Detail |
|---|
| Trigger | Operator re-delivers large batch of already-processed DLRs |
| Impact | CPU/DB pressure; no data corruption (idempotency guard) |
| Detection | dlr_duplicates_total counter spike; elevated PG SELECT load |
| Recovery | Idempotency check exits immediately on duplicate; no action required |
| Mitigation | Bloom filter cache in Redis for operatorMessageId lookup (reduces PG reads) — planned |
FM-DLR-05: Outbox Relay Lag
| Attribute | Detail |
|---|
| Trigger | NATS publish failures or outbox table growth |
| Impact | Delayed billing events and webhook dispatches; DB table grows |
| Detection | dlr_outbox_pending_count gauge > 1000; outbox relay error logs |
| Recovery | Outbox relay retries with exponential backoff; resumes automatically when NATS recovers |
| Escalation | Page on-call if outbox count > 10 000 or lag > 5 min |
FM-DLR-06: Schema Validation Failures
| Attribute | Detail |
|---|
| Trigger | smpp-connector publishes malformed DLR event |
| Impact | Individual DLR discarded; message silently dropped |
| Detection | dlr_validation_errors_total counter; DLW (dead-letter watch) alerts |
| Recovery | Fix schema in producer; affected DLRs cannot be recovered (no replay of validated-failed messages) |
2. Dependency Failure Matrix
| Dependency | Failure | Service Behaviour |
|---|
| NATS | Down | Consumer disconnects; Kubernetes readiness probe fails; K8s stops routing if applicable |
PostgreSQL dlr schema | Down | Nak all messages; backpressure to NATS |
PostgreSQL orch schema | Down | Same as above |
billing-service (downstream) | Down | Outbox accumulates; billing events delivered when recovered |
webhook-dispatcher (downstream) | Down | Outbox accumulates; dispatched when recovered |
3. Runbooks
Runbook links (internal Confluence):
- [RB-DLR-01] High orphan rate investigation
- [RB-DLR-02] Outbox relay lag remediation
- [RB-DLR-03] NATS consumer restart procedure