routing-engine — Failure Modes
Status: populated | Last updated: 2026-04-18
Failure Scenarios
| # | Failure scenario | Detection method | Impact | Mitigation |
|---|---|---|---|---|
| F1 | PostgreSQL read replica unavailable | /ready probe fails; DB query errors logged | Cache misses cannot be resolved; gRPC returns UNAVAILABLE | Route traffic to pods with warm cache; sms-orchestrator retries with backoff; alert fires |
| F2 | Redis unavailable | Redis PING fails in /ready; GET/SET errors logged | Cache hits unavailable; all requests go to DB (increased latency, DB load spike) | Fallback: serve from DB without caching; alert RedisDown; auto-heal via replica promotion |
| F3 | All operators UNBOUND simultaneously | operators_healthy_total == 0 alert | All SelectOperator calls return gRPC UNAVAILABLE | sms-orchestrator queues messages; platform on-call alerted; smpp-connector reconnects in background |
| F4 | NATS JetStream consumer disconnects | Health event processing halts; logged at warn | Operator health cache becomes stale; TTL 60 s auto-expires; routing may use outdated health | NATS consumer auto-reconnects with exponential backoff; stale entries expire safely |
| F5 | Stale routing decision cache after rule change | Route:decision TTL has not expired | Wrong operator selected for up to 300 s | Operator-management-service must send a cache-bust signal (future enhancement); manual Redis key deletion as emergency procedure |
| F6 | gRPC service OOMKilled | Kubernetes detects pod crash | gRPC requests fail; HPA spins up replacement | Memory limits set conservatively; readiness probe blocks traffic until new pod is ready |
| F7 | mTLS certificate expiry | gRPC handshake fails; cert-manager renewal alert | All SelectOperator calls fail with TLS error | cert-manager renews 30 days before expiry; alert at 14 days remaining |
| F8 | Prefix cache refresh failure | Background job logs error; stale entries served | New prefixes not available until next successful refresh | In-process cache retains last good state; alert fires after 3 consecutive failures |
| F9 | No routing rule for a new destination country | SelectOperator returns NOT_FOUND | Messages to that country fail immediately | sms-orchestrator moves to DLQ; ops team adds routing rule via operator-management-service UI |
| F10 | Hot prefix — single prefix drives 100% traffic to one operator | No specific alert; high TPS on one operator visible in metrics | Operator TPS limit breach; messages queued or dropped | PRIORITY/FAILOVER strategies distribute load; future: weighted routing support |
Degraded Mode Behaviour
When Redis is unavailable, routing-engine falls back to a cache-bypass mode:
- Every
SelectOperatorcall queries PostgreSQL directly. - No results are written to cache.
- A
warn-level log eventrouting.degraded_mode.redis_unavailableis emitted on every call. - The
/readyendpoint returns 503 (Kubernetes stops routing new traffic to the pod).
Note: Cache-bypass mode works but increases PostgreSQL load significantly. RDS auto-scaling or read replica promotion should be triggered automatically by the database alarm.