Skip to main content

Operator Management Service — Failure Modes

Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18

#FailureUser/Platform ImpactDetectionMitigation
1Vault unavailable during operator CREATECreate fails with 503; new operator not usableOpsVaultErrors alertCompensating rollback of PG row; admin retries when Vault recovers
2Vault unavailable during credential READsmpp-connector cannot rebind; uses in-memory cache for up to 30 minOpsVaultReadErrors alertsmpp-connector caches credentials at bind time; short-lived outage transparent
3PG primary downAdmin API writes fail; internal reads failOpsPgErrors alertReads fall to replica; writes return 503
4Redis downHealth cache miss; routing-engine falls back to internal REST APIOpsRedisErrors alertrouting-engine fallback path tested; increased latency on route decisions
5NATS publish failure on config changeConfig change saved to PG but downstream not notifiedOpsNatsPublishErrors alertRetry in-process (3×); if all fail, mark event as pending_publish in PG outbox; reconciler re-publishes
6Health inbound event from smpp-connector lostHealth state stale; routing-engine may use degraded operatorHealth cache TTL expires (60 s)smpp-connector publishes health heartbeat every 10 s; one miss is tolerated
7Admin API Kong JWT rotated with overlap gapSpike of 401s for adminsKong auth metrics10-min key overlap on JWT rotation; ops team notified before rotation
8Duplicate operator created via race conditionTwo concurrent creates with same (host, port, systemId)PG unique constraint violationSerializable transaction + unique index; second request gets 409
9Vault Kubernetes auth token expired on pod restartCredential reads fail until Vault Agent sidecar renewsVault Agent logs; OpsVaultAuthErrorsVault Agent sidecar renews at 50% TTL; pod readiness probe fails if Vault unreachable
10Routing rule prefix conflict not detectedOverlapping rules cause non-deterministic routingRoutingRuleConflictChecker unit testConflict checker runs synchronously before INSERT; DB unique index on (prefix, operator_id) as backstop
11Health log partition not pre-createdINSERTs to ops.operator_health_log fail at partition boundaryPG error logspg_partman auto-creates monthly partitions; alert on partition age
12mTLS certificate expired on smpp-connectorInternal credentials endpoint rejects smpp-connectorTLS handshake error in logscert-manager auto-renews 30 days before expiry; alert at 14 days remaining