SMS Orchestrator — Failure Modes

Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18

#	Failure	User impact	Detection	Mitigation
1	routing-engine gRPC down	Pipeline stalls; messages retry then DLQ after 3 attempts	`OrchRoutingEngineDown` alert	Retry + DLQ; runbook reroutes static operator if prolonged
2	routing-engine slow (P99 > 500 ms)	Pipeline backpressure; submit latency unaffected	Grafana dashboard	Circuit breaker → retry on timeout
3	PG primary down	Submit returns 503; pipeline writes fail	`OrchPgErrors` alert	K8s deployment pauses new submits; pipeline NAKs NATS (redelivery)
4	PG replica lag	Stale `GET /v1/sms/{id}` reads	replica lag alert	Route reads to primary on lag > 2s
5	Redis cluster down	Submit idempotency fails closed (503); pipeline idempotency fails open	`OrchRedisErrors` alert	Documented fail-open on pipeline; NATS AckWait bounds duplicate processing
6	NATS cluster degraded	Submit can't publish → 503; pipeline stalls	`OrchNatsPublishErrors` alert	Publish retries in-process; escalate to NATS ops
7	NATS stream near-full	Backpressure, publish latency spikes	NATS JetStream metrics	Scale stream storage; drop pre-accept to 503 if unbounded
8	Operator queue has no consumer	Messages accumulate in `smpp.operator.*`	NATS stream depth alert	smpp-connector bind failing; runbook
9	DLQ publish fails	Message NAKed; eventually `MaxDeliver` exceeded → lost	`OrchDlqBurst` alert	Secondary write to PG `dead_letters` table before ACK
10	Kong down	No new traffic accepted	Cloudflare 5xx + Kong health alert	Active-passive Kong; see api-gateway runbook
11	Zod validation pass but routing rejects (`NO_ROUTE_FOUND`)	Message FAILED; customer sees failure	`orch_dlq_total{reason='no_route'}`	Customer-facing error; ops investigates routing config
12	Idempotency replay with different body	409 to caller	Expected behavior	Caller fixes client bug
13	Clock skew across pods	Out-of-order `statusUpdatedAt`	NTP monitoring	All pods use NTP; tolerate skew in SLO computation
14	PII leak in logs	Compliance incident	Log scanner (Loki regex)	Pino transport masks body/MSISDN
15	`attempt_count` desync PG vs in-memory after restart	Potential extra retry	Rare; low impact	On restart, pipeline reads PG state before processing redelivery
16	Kong JWT key rotation with clock skew	Spike of 401s at Kong (not this service)	Kong auth metrics	Kong caches JWKS; overlap keys 10 min