Skip to main content

routing-engine — Service Risk Register

Status: populated | Last updated: 2026-04-18

Risk Register

ID	Risk	Likelihood	Impact	Severity	Mitigation	Owner
R1	Latency breach — routing-engine P95 exceeds 50 ms under peak load, causing sms-orchestrator to timeout and fail message dispatch	Medium	Critical	Critical	Redis caching absorbs DB load; HPA scales pods; latency alert fires at >40 ms (warning) and >50 ms (critical); circuit breaker in sms-orchestrator	Platform Engineering
R2	All operators simultaneously UNBOUND — a NATS partition or mass operator outage results in zero healthy operators; all SelectOperator calls return UNAVAILABLE	Low	Critical	High	Operator-level circuit breakers in smpp-connector; sms-orchestrator queues to DLQ; Pagerduty alert fires within 30 s; operations team runbook defines escalation	Platform Engineering / Operations
R3	Stale routing cache after rule change — a routing rule is updated in ops_routing but cached decisions are not invalidated (TTL up to 300 s)	High	Medium	High	Short-term: operator-management-service posts a cache-bust webhook (planned); interim: manual Redis SCAN+DEL procedure in runbook; TTL bounds the blast radius to 5 minutes	Platform Engineering
R4	PostgreSQL read replica lag — replication lag causes routing-engine to read outdated routing rules from the replica	Low	Medium	Medium	Monitor replica lag metric; alert at >5 s lag; switchover to primary in emergency; read replica is read-only writes so risk is to rule freshness only	Platform Engineering / DBA
R5	mTLS certificate expiry — cert-manager fails to renew the gRPC server or client certificate before expiry, causing all gRPC connections to fail	Low	Critical	High	cert-manager automated renewal at 30 days before expiry; alert at 14 days remaining; alert at 7 days (critical); manual renewal runbook documented	Platform Engineering / Security

Risk Register