routing-engine — Service Risk Register
Status: populated | Last updated: 2026-04-18
Risk Register
| ID | Risk | Likelihood | Impact | Severity | Mitigation | Owner |
|---|---|---|---|---|---|---|
| R1 | Latency breach — routing-engine P95 exceeds 50 ms under peak load, causing sms-orchestrator to timeout and fail message dispatch | Medium | Critical | Critical | Redis caching absorbs DB load; HPA scales pods; latency alert fires at >40 ms (warning) and >50 ms (critical); circuit breaker in sms-orchestrator | Platform Engineering |
| R2 | All operators simultaneously UNBOUND — a NATS partition or mass operator outage results in zero healthy operators; all SelectOperator calls return UNAVAILABLE | Low | Critical | High | Operator-level circuit breakers in smpp-connector; sms-orchestrator queues to DLQ; Pagerduty alert fires within 30 s; operations team runbook defines escalation | Platform Engineering / Operations |
| R3 | Stale routing cache after rule change — a routing rule is updated in ops_routing but cached decisions are not invalidated (TTL up to 300 s) | High | Medium | High | Short-term: operator-management-service posts a cache-bust webhook (planned); interim: manual Redis SCAN+DEL procedure in runbook; TTL bounds the blast radius to 5 minutes | Platform Engineering |
| R4 | PostgreSQL read replica lag — replication lag causes routing-engine to read outdated routing rules from the replica | Low | Medium | Medium | Monitor replica lag metric; alert at >5 s lag; switchover to primary in emergency; read replica is read-only writes so risk is to rule freshness only | Platform Engineering / DBA |
| R5 | mTLS certificate expiry — cert-manager fails to renew the gRPC server or client certificate before expiry, causing all gRPC connections to fail | Low | Critical | High | cert-manager automated renewal at 30 days before expiry; alert at 14 days remaining; alert at 7 days (critical); manual renewal runbook documented | Platform Engineering / Security |