SMS Orchestrator — Failure Modes
Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18
| # | Failure | User impact | Detection | Mitigation |
|---|---|---|---|---|
| 1 | routing-engine gRPC down | Pipeline stalls; messages retry then DLQ after 3 attempts | OrchRoutingEngineDown alert | Retry + DLQ; runbook reroutes static operator if prolonged |
| 2 | routing-engine slow (P99 > 500 ms) | Pipeline backpressure; submit latency unaffected | Grafana dashboard | Circuit breaker → retry on timeout |
| 3 | PG primary down | Submit returns 503; pipeline writes fail | OrchPgErrors alert | K8s deployment pauses new submits; pipeline NAKs NATS (redelivery) |
| 4 | PG replica lag | Stale GET /v1/sms/{id} reads | replica lag alert | Route reads to primary on lag > 2s |
| 5 | Redis cluster down | Submit idempotency fails closed (503); pipeline idempotency fails open | OrchRedisErrors alert | Documented fail-open on pipeline; NATS AckWait bounds duplicate processing |
| 6 | NATS cluster degraded | Submit can't publish → 503; pipeline stalls | OrchNatsPublishErrors alert | Publish retries in-process; escalate to NATS ops |
| 7 | NATS stream near-full | Backpressure, publish latency spikes | NATS JetStream metrics | Scale stream storage; drop pre-accept to 503 if unbounded |
| 8 | Operator queue has no consumer | Messages accumulate in smpp.operator.* | NATS stream depth alert | smpp-connector bind failing; runbook |
| 9 | DLQ publish fails | Message NAKed; eventually MaxDeliver exceeded → lost | OrchDlqBurst alert | Secondary write to PG dead_letters table before ACK |
| 10 | Kong down | No new traffic accepted | Cloudflare 5xx + Kong health alert | Active-passive Kong; see api-gateway runbook |
| 11 | Zod validation pass but routing rejects (NO_ROUTE_FOUND) | Message FAILED; customer sees failure | orch_dlq_total{reason='no_route'} | Customer-facing error; ops investigates routing config |
| 12 | Idempotency replay with different body | 409 to caller | Expected behavior | Caller fixes client bug |
| 13 | Clock skew across pods | Out-of-order statusUpdatedAt | NTP monitoring | All pods use NTP; tolerate skew in SLO computation |
| 14 | PII leak in logs | Compliance incident | Log scanner (Loki regex) | Pino transport masks body/MSISDN |
| 15 | attempt_count desync PG vs in-memory after restart | Potential extra retry | Rare; low impact | On restart, pipeline reads PG state before processing redelivery |
| 16 | Kong JWT key rotation with clock skew | Spike of 401s at Kong (not this service) | Kong auth metrics | Kong caches JWKS; overlap keys 10 min |