Skip to main content

routing-engine — Service Risk Register

Status: populated | Last updated: 2026-04-18

Risk Register

IDRiskLikelihoodImpactSeverityMitigationOwner
R1Latency breach — routing-engine P95 exceeds 50 ms under peak load, causing sms-orchestrator to timeout and fail message dispatchMediumCriticalCriticalRedis caching absorbs DB load; HPA scales pods; latency alert fires at >40 ms (warning) and >50 ms (critical); circuit breaker in sms-orchestratorPlatform Engineering
R2All operators simultaneously UNBOUND — a NATS partition or mass operator outage results in zero healthy operators; all SelectOperator calls return UNAVAILABLELowCriticalHighOperator-level circuit breakers in smpp-connector; sms-orchestrator queues to DLQ; Pagerduty alert fires within 30 s; operations team runbook defines escalationPlatform Engineering / Operations
R3Stale routing cache after rule change — a routing rule is updated in ops_routing but cached decisions are not invalidated (TTL up to 300 s)HighMediumHighShort-term: operator-management-service posts a cache-bust webhook (planned); interim: manual Redis SCAN+DEL procedure in runbook; TTL bounds the blast radius to 5 minutesPlatform Engineering
R4PostgreSQL read replica lag — replication lag causes routing-engine to read outdated routing rules from the replicaLowMediumMediumMonitor replica lag metric; alert at >5 s lag; switchover to primary in emergency; read replica is read-only writes so risk is to rule freshness onlyPlatform Engineering / DBA
R5mTLS certificate expiry — cert-manager fails to renew the gRPC server or client certificate before expiry, causing all gRPC connections to failLowCriticalHighcert-manager automated renewal at 30 days before expiry; alert at 14 days remaining; alert at 7 days (critical); manual renewal runbook documentedPlatform Engineering / Security