| FM-PLTADM-01 | Redis | Cache unavailable | Flag evaluate falls back to DB; latency spike to 200+ ms; rate limits loose | Redis error rate alert; cache hit rate drops | Fail-open (evaluate from DB); alert SRE; Redis HA cluster |
| FM-PLTADM-02 | PostgreSQL | Primary unavailable | Config CRUD fails; evaluate falls back to stale cache | Health probe; DB error spike | DB failover; stale 60s cache extends evaluate availability |
| FM-PLTADM-03 | Health poller | Poller job crashes | Health aggregate becomes stale; incident blind spot | Poller absent from metrics; last_heartbeat stale | Kubernetes CronJob restart; alert on poll gap > 2× interval |
| FM-PLTADM-04 | NATS outbox | Events not published | Downstream services don't receive config/flag change notifications | Outbox age > 60 s alert | Outbox relay retries; manual replay; local cache prevents immediate user impact |
| FM-PLTADM-05 | Flag cache stale | Cache not invalidated after flag archive | Downstream service sees flag as enabled after archival | Cache TTL expires in ≤ 60 s; event-driven invalidation | Event-driven cache invalidation on platform_admin.flag.archived.v1; 60 s TTL as safety net |
| FM-PLTADM-06 | Config allow-list | Application code allow-list out of sync with deployed config | New config keys rejected; operators blocked | ADM_CONFIG_KEY_UNKNOWN 400 errors spike | Allow-list defined in code (not DB); rolling deploy updates list; alert on 400 spike |
| FM-PLTADM-07 | Health source registration | Service fails to register health endpoint at startup | Source missing from aggregate; health appears incomplete | Missing source alert; aggregate = degraded | Services retry registration with backoff; alert after 3 failed attempts |