routing-engine — Failure Modes

Status: populated | Last updated: 2026-04-18

Failure Scenarios

#	Failure scenario	Detection method	Impact	Mitigation
F1	PostgreSQL read replica unavailable	`/ready` probe fails; DB query errors logged	Cache misses cannot be resolved; gRPC returns `UNAVAILABLE`	Route traffic to pods with warm cache; sms-orchestrator retries with backoff; alert fires
F2	Redis unavailable	Redis PING fails in `/ready`; GET/SET errors logged	Cache hits unavailable; all requests go to DB (increased latency, DB load spike)	Fallback: serve from DB without caching; alert `RedisDown`; auto-heal via replica promotion
F3	All operators UNBOUND simultaneously	`operators_healthy_total == 0` alert	All `SelectOperator` calls return gRPC `UNAVAILABLE`	sms-orchestrator queues messages; platform on-call alerted; smpp-connector reconnects in background
F4	NATS JetStream consumer disconnects	Health event processing halts; logged at `warn`	Operator health cache becomes stale; TTL 60 s auto-expires; routing may use outdated health	NATS consumer auto-reconnects with exponential backoff; stale entries expire safely
F5	Stale routing decision cache after rule change	Route:decision TTL has not expired	Wrong operator selected for up to 300 s	Operator-management-service must send a cache-bust signal (future enhancement); manual Redis key deletion as emergency procedure
F6	gRPC service OOMKilled	Kubernetes detects pod crash	gRPC requests fail; HPA spins up replacement	Memory limits set conservatively; readiness probe blocks traffic until new pod is ready
F7	mTLS certificate expiry	gRPC handshake fails; cert-manager renewal alert	All `SelectOperator` calls fail with TLS error	cert-manager renews 30 days before expiry; alert at 14 days remaining
F8	Prefix cache refresh failure	Background job logs `error`; stale entries served	New prefixes not available until next successful refresh	In-process cache retains last good state; alert fires after 3 consecutive failures
F9	No routing rule for a new destination country	`SelectOperator` returns `NOT_FOUND`	Messages to that country fail immediately	sms-orchestrator moves to DLQ; ops team adds routing rule via operator-management-service UI
F10	Hot prefix — single prefix drives 100% traffic to one operator	No specific alert; high TPS on one operator visible in metrics	Operator TPS limit breach; messages queued or dropped	PRIORITY/FAILOVER strategies distribute load; future: weighted routing support

Degraded Mode Behaviour

When Redis is unavailable, routing-engine falls back to a cache-bypass mode:

Every SelectOperator call queries PostgreSQL directly.
No results are written to cache.
A warn-level log event routing.degraded_mode.redis_unavailable is emitted on every call.
The /ready endpoint returns 503 (Kubernetes stops routing new traffic to the pod).

Note: Cache-bypass mode works but increases PostgreSQL load significantly. RDS auto-scaling or read replica promotion should be triggered automatically by the database alarm.

Failure Scenarios​

Degraded Mode Behaviour​

Failure Scenarios

Degraded Mode Behaviour