Skip to main content

SMS Orchestrator — Failure Modes

Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18

#FailureUser impactDetectionMitigation
1routing-engine gRPC downPipeline stalls; messages retry then DLQ after 3 attemptsOrchRoutingEngineDown alertRetry + DLQ; runbook reroutes static operator if prolonged
2routing-engine slow (P99 > 500 ms)Pipeline backpressure; submit latency unaffectedGrafana dashboardCircuit breaker → retry on timeout
3PG primary downSubmit returns 503; pipeline writes failOrchPgErrors alertK8s deployment pauses new submits; pipeline NAKs NATS (redelivery)
4PG replica lagStale GET /v1/sms/{id} readsreplica lag alertRoute reads to primary on lag > 2s
5Redis cluster downSubmit idempotency fails closed (503); pipeline idempotency fails openOrchRedisErrors alertDocumented fail-open on pipeline; NATS AckWait bounds duplicate processing
6NATS cluster degradedSubmit can't publish → 503; pipeline stallsOrchNatsPublishErrors alertPublish retries in-process; escalate to NATS ops
7NATS stream near-fullBackpressure, publish latency spikesNATS JetStream metricsScale stream storage; drop pre-accept to 503 if unbounded
8Operator queue has no consumerMessages accumulate in smpp.operator.*NATS stream depth alertsmpp-connector bind failing; runbook
9DLQ publish failsMessage NAKed; eventually MaxDeliver exceeded → lostOrchDlqBurst alertSecondary write to PG dead_letters table before ACK
10Kong downNo new traffic acceptedCloudflare 5xx + Kong health alertActive-passive Kong; see api-gateway runbook
11Zod validation pass but routing rejects (NO_ROUTE_FOUND)Message FAILED; customer sees failureorch_dlq_total{reason='no_route'}Customer-facing error; ops investigates routing config
12Idempotency replay with different body409 to callerExpected behaviorCaller fixes client bug
13Clock skew across podsOut-of-order statusUpdatedAtNTP monitoringAll pods use NTP; tolerate skew in SLO computation
14PII leak in logsCompliance incidentLog scanner (Loki regex)Pino transport masks body/MSISDN
15attempt_count desync PG vs in-memory after restartPotential extra retryRare; low impactOn restart, pipeline reads PG state before processing redelivery
16Kong JWT key rotation with clock skewSpike of 401s at Kong (not this service)Kong auth metricsKong caches JWKS; overlap keys 10 min