Skip to main content

DLR Processor — Failure Modes

Status: populated Owner: Platform Engineering Last updated: 2026-04-18 Companion: APPLICATION_LOGIC · OBSERVABILITY

1. Failure Catalogue

FM-DLR-01: NATS Consumer Disconnected

AttributeDetail
TriggerNATS server unreachable or network partition
ImpactDLRs queue in JetStream; no processing; billing + webhook delayed
Detectiondlr_nats_consumer_status gauge = 0; /ready returns 503
RecoveryAutomatic NATS reconnect with exponential backoff (max 30 s); JetStream replays pending messages
SLO ImpactLatency SLO breached after ~30 s; no data loss

FM-DLR-02: PostgreSQL Primary Unavailable

AttributeDetail
TriggerPG primary failover or network partition
ImpactProcessing blocked; messages Nak'd and retried
DetectionDB error logs; dlr_db_errors_total counter spike; PgBouncer health alerts
RecoveryAutomatic failover to PG replica via PgBouncer; service resumes within ~30 s
SLO ImpactLatency SLO breached during failover; at-least-once delivery maintained

FM-DLR-03: High Orphan Rate

AttributeDetail
TriggeroperatorMessageId not found in orch.sms_messages at > 0.5% rate
ImpactBilling + webhook not triggered for affected messages
Detectiondlr_orphan_rate Prometheus gauge; alert threshold 0.5% over 5 min
RecoveryInvestigate operator ID mapping; reconciliation job re-processes orphans post-fix
Root Causessmpp-connector not storing operatorMessageId; race condition (DLR arrives before SENT update)

FM-DLR-04: Duplicate DLR Flood

AttributeDetail
TriggerOperator re-delivers large batch of already-processed DLRs
ImpactCPU/DB pressure; no data corruption (idempotency guard)
Detectiondlr_duplicates_total counter spike; elevated PG SELECT load
RecoveryIdempotency check exits immediately on duplicate; no action required
MitigationBloom filter cache in Redis for operatorMessageId lookup (reduces PG reads) — planned

FM-DLR-05: Outbox Relay Lag

AttributeDetail
TriggerNATS publish failures or outbox table growth
ImpactDelayed billing events and webhook dispatches; DB table grows
Detectiondlr_outbox_pending_count gauge > 1000; outbox relay error logs
RecoveryOutbox relay retries with exponential backoff; resumes automatically when NATS recovers
EscalationPage on-call if outbox count > 10 000 or lag > 5 min

FM-DLR-06: Schema Validation Failures

AttributeDetail
Triggersmpp-connector publishes malformed DLR event
ImpactIndividual DLR discarded; message silently dropped
Detectiondlr_validation_errors_total counter; DLW (dead-letter watch) alerts
RecoveryFix schema in producer; affected DLRs cannot be recovered (no replay of validated-failed messages)

2. Dependency Failure Matrix

DependencyFailureService Behaviour
NATSDownConsumer disconnects; Kubernetes readiness probe fails; K8s stops routing if applicable
PostgreSQL dlr schemaDownNak all messages; backpressure to NATS
PostgreSQL orch schemaDownSame as above
billing-service (downstream)DownOutbox accumulates; billing events delivered when recovered
webhook-dispatcher (downstream)DownOutbox accumulates; dispatched when recovered

3. Runbooks

Runbook links (internal Confluence):

  • [RB-DLR-01] High orphan rate investigation
  • [RB-DLR-02] Outbox relay lag remediation
  • [RB-DLR-03] NATS consumer restart procedure