Operator Management Service — Failure Modes
Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18
| # | Failure | User/Platform Impact | Detection | Mitigation |
|---|---|---|---|---|
| 1 | Vault unavailable during operator CREATE | Create fails with 503; new operator not usable | OpsVaultErrors alert | Compensating rollback of PG row; admin retries when Vault recovers |
| 2 | Vault unavailable during credential READ | smpp-connector cannot rebind; uses in-memory cache for up to 30 min | OpsVaultReadErrors alert | smpp-connector caches credentials at bind time; short-lived outage transparent |
| 3 | PG primary down | Admin API writes fail; internal reads fail | OpsPgErrors alert | Reads fall to replica; writes return 503 |
| 4 | Redis down | Health cache miss; routing-engine falls back to internal REST API | OpsRedisErrors alert | routing-engine fallback path tested; increased latency on route decisions |
| 5 | NATS publish failure on config change | Config change saved to PG but downstream not notified | OpsNatsPublishErrors alert | Retry in-process (3×); if all fail, mark event as pending_publish in PG outbox; reconciler re-publishes |
| 6 | Health inbound event from smpp-connector lost | Health state stale; routing-engine may use degraded operator | Health cache TTL expires (60 s) | smpp-connector publishes health heartbeat every 10 s; one miss is tolerated |
| 7 | Admin API Kong JWT rotated with overlap gap | Spike of 401s for admins | Kong auth metrics | 10-min key overlap on JWT rotation; ops team notified before rotation |
| 8 | Duplicate operator created via race condition | Two concurrent creates with same (host, port, systemId) | PG unique constraint violation | Serializable transaction + unique index; second request gets 409 |
| 9 | Vault Kubernetes auth token expired on pod restart | Credential reads fail until Vault Agent sidecar renews | Vault Agent logs; OpsVaultAuthErrors | Vault Agent sidecar renews at 50% TTL; pod readiness probe fails if Vault unreachable |
| 10 | Routing rule prefix conflict not detected | Overlapping rules cause non-deterministic routing | RoutingRuleConflictChecker unit test | Conflict checker runs synchronously before INSERT; DB unique index on (prefix, operator_id) as backstop |
| 11 | Health log partition not pre-created | INSERTs to ops.operator_health_log fail at partition boundary | PG error logs | pg_partman auto-creates monthly partitions; alert on partition age |
| 12 | mTLS certificate expired on smpp-connector | Internal credentials endpoint rejects smpp-connector | TLS handshake error in logs | cert-manager auto-renews 30 days before expiry; alert at 14 days remaining |