Skip to main content

routing-engine — Failure Modes

Status: populated | Last updated: 2026-04-18

Failure Scenarios

#Failure scenarioDetection methodImpactMitigation
F1PostgreSQL read replica unavailable/ready probe fails; DB query errors loggedCache misses cannot be resolved; gRPC returns UNAVAILABLERoute traffic to pods with warm cache; sms-orchestrator retries with backoff; alert fires
F2Redis unavailableRedis PING fails in /ready; GET/SET errors loggedCache hits unavailable; all requests go to DB (increased latency, DB load spike)Fallback: serve from DB without caching; alert RedisDown; auto-heal via replica promotion
F3All operators UNBOUND simultaneouslyoperators_healthy_total == 0 alertAll SelectOperator calls return gRPC UNAVAILABLEsms-orchestrator queues messages; platform on-call alerted; smpp-connector reconnects in background
F4NATS JetStream consumer disconnectsHealth event processing halts; logged at warnOperator health cache becomes stale; TTL 60 s auto-expires; routing may use outdated healthNATS consumer auto-reconnects with exponential backoff; stale entries expire safely
F5Stale routing decision cache after rule changeRoute:decision TTL has not expiredWrong operator selected for up to 300 sOperator-management-service must send a cache-bust signal (future enhancement); manual Redis key deletion as emergency procedure
F6gRPC service OOMKilledKubernetes detects pod crashgRPC requests fail; HPA spins up replacementMemory limits set conservatively; readiness probe blocks traffic until new pod is ready
F7mTLS certificate expirygRPC handshake fails; cert-manager renewal alertAll SelectOperator calls fail with TLS errorcert-manager renews 30 days before expiry; alert at 14 days remaining
F8Prefix cache refresh failureBackground job logs error; stale entries servedNew prefixes not available until next successful refreshIn-process cache retains last good state; alert fires after 3 consecutive failures
F9No routing rule for a new destination countrySelectOperator returns NOT_FOUNDMessages to that country fail immediatelysms-orchestrator moves to DLQ; ops team adds routing rule via operator-management-service UI
F10Hot prefix — single prefix drives 100% traffic to one operatorNo specific alert; high TPS on one operator visible in metricsOperator TPS limit breach; messages queued or droppedPRIORITY/FAILOVER strategies distribute load; future: weighted routing support

Degraded Mode Behaviour

When Redis is unavailable, routing-engine falls back to a cache-bypass mode:

  1. Every SelectOperator call queries PostgreSQL directly.
  2. No results are written to cache.
  3. A warn-level log event routing.degraded_mode.redis_unavailable is emitted on every call.
  4. The /ready endpoint returns 503 (Kubernetes stops routing new traffic to the pod).

Note: Cache-bypass mode works but increases PostgreSQL load significantly. RDS auto-scaling or read replica promotion should be triggered automatically by the database alarm.