Skip to main content

Platform Admin Service — Failure Modes

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: OBSERVABILITY · SERVICE_RISK_REGISTER

1. Failure catalog

ID	Component	Failure	User impact	Detection	Mitigation
FM-PLTADM-01	Redis	Cache unavailable	Flag evaluate falls back to DB; latency spike to 200+ ms; rate limits loose	Redis error rate alert; cache hit rate drops	Fail-open (evaluate from DB); alert SRE; Redis HA cluster
FM-PLTADM-02	PostgreSQL	Primary unavailable	Config CRUD fails; evaluate falls back to stale cache	Health probe; DB error spike	DB failover; stale 60s cache extends evaluate availability
FM-PLTADM-03	Health poller	Poller job crashes	Health aggregate becomes stale; incident blind spot	Poller absent from metrics; last_heartbeat stale	Kubernetes CronJob restart; alert on poll gap > 2× interval
FM-PLTADM-04	NATS outbox	Events not published	Downstream services don't receive config/flag change notifications	Outbox age > 60 s alert	Outbox relay retries; manual replay; local cache prevents immediate user impact
FM-PLTADM-05	Flag cache stale	Cache not invalidated after flag archive	Downstream service sees flag as enabled after archival	Cache TTL expires in ≤ 60 s; event-driven invalidation	Event-driven cache invalidation on `platform_admin.flag.archived.v1`; 60 s TTL as safety net
FM-PLTADM-06	Config allow-list	Application code allow-list out of sync with deployed config	New config keys rejected; operators blocked	`ADM_CONFIG_KEY_UNKNOWN` 400 errors spike	Allow-list defined in code (not DB); rolling deploy updates list; alert on 400 spike
FM-PLTADM-07	Health source registration	Service fails to register health endpoint at startup	Source missing from aggregate; health appears incomplete	Missing source alert; aggregate = degraded	Services retry registration with backoff; alert after 3 failed attempts

2. Dependency failure impact

Dependency	Degraded mode	Mitigation
Redis unavailable	Evaluate hits DB (acceptable p95 ~200 ms)	Fail-open; alert
PostgreSQL unavailable	Evaluate uses stale cache (60 s); config writes fail	DB failover; cache extends evaluate
NATS unavailable	Events queued; downstream notification delayed	Outbox relay; alert
Any platform service `/health` unavailable	That service shown as unhealthy in aggregate	Health poller timeout; aggregate reflects reality

3. Runbooks

Runbook	Trigger
`runbooks/platform-admin/redis-failover.md`	FM-PLTADM-01
`runbooks/platform-admin/db-failover.md`	FM-PLTADM-02
`runbooks/platform-admin/health-poller-restart.md`	FM-PLTADM-03
`runbooks/platform-admin/outbox-replay.md`	FM-PLTADM-04

1. Failure catalog
2. Dependency failure impact
3. Runbooks