Skip to main content

Platform Admin Service — Failure Modes

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: OBSERVABILITY · SERVICE_RISK_REGISTER

1. Failure catalog

IDComponentFailureUser impactDetectionMitigation
FM-PLTADM-01RedisCache unavailableFlag evaluate falls back to DB; latency spike to 200+ ms; rate limits looseRedis error rate alert; cache hit rate dropsFail-open (evaluate from DB); alert SRE; Redis HA cluster
FM-PLTADM-02PostgreSQLPrimary unavailableConfig CRUD fails; evaluate falls back to stale cacheHealth probe; DB error spikeDB failover; stale 60s cache extends evaluate availability
FM-PLTADM-03Health pollerPoller job crashesHealth aggregate becomes stale; incident blind spotPoller absent from metrics; last_heartbeat staleKubernetes CronJob restart; alert on poll gap > 2× interval
FM-PLTADM-04NATS outboxEvents not publishedDownstream services don't receive config/flag change notificationsOutbox age > 60 s alertOutbox relay retries; manual replay; local cache prevents immediate user impact
FM-PLTADM-05Flag cache staleCache not invalidated after flag archiveDownstream service sees flag as enabled after archivalCache TTL expires in ≤ 60 s; event-driven invalidationEvent-driven cache invalidation on platform_admin.flag.archived.v1; 60 s TTL as safety net
FM-PLTADM-06Config allow-listApplication code allow-list out of sync with deployed configNew config keys rejected; operators blockedADM_CONFIG_KEY_UNKNOWN 400 errors spikeAllow-list defined in code (not DB); rolling deploy updates list; alert on 400 spike
FM-PLTADM-07Health source registrationService fails to register health endpoint at startupSource missing from aggregate; health appears incompleteMissing source alert; aggregate = degradedServices retry registration with backoff; alert after 3 failed attempts

2. Dependency failure impact

DependencyDegraded modeMitigation
Redis unavailableEvaluate hits DB (acceptable p95 ~200 ms)Fail-open; alert
PostgreSQL unavailableEvaluate uses stale cache (60 s); config writes failDB failover; cache extends evaluate
NATS unavailableEvents queued; downstream notification delayedOutbox relay; alert
Any platform service /health unavailableThat service shown as unhealthy in aggregateHealth poller timeout; aggregate reflects reality

3. Runbooks

RunbookTrigger
runbooks/platform-admin/redis-failover.mdFM-PLTADM-01
runbooks/platform-admin/db-failover.mdFM-PLTADM-02
runbooks/platform-admin/health-poller-restart.mdFM-PLTADM-03
runbooks/platform-admin/outbox-replay.mdFM-PLTADM-04