Skip to main content

Config Service — Failure Modes

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 03 platform-services


1. Failure Catalog

#FailureUser / Platform ImpactDetectionMitigation
FM-01Redis unavailableResolution falls back to full DB pipeline; latency increases 3–5x; SLO at riskRedis connection error metric; alert if > 30 sRun full pipeline without cache; alert on-call; scale DB read replicas; restore Redis from snapshot
FM-02PostgreSQL unavailableAll resolution requests fail with 503; no config mutations possibleDB connection error metric; readiness probe failsK8s removes pod from LB; failover to read replica; page on-call immediately
FM-03facility-service unavailableResolution Step 1 fails; all resolves return DENY (DEPENDENCY_UNAVAILABLE)HTTP 5xx rate on upstream call; alertFail closed (deny); circuit breaker trips after 5 consecutive failures; serve from cache if unexpired
FM-04access-policy unavailableResolution Step 6 fails; all resolves return DENY (DEPENDENCY_UNAVAILABLE)HTTP 5xx on AP callFail closed; circuit breaker; cached results (TTL 60 s) served for non-expired keys
FM-05platform-admin-service unavailableSteps 2 + 3 fail; resolutions return DENY (DEPENDENCY_UNAVAILABLE)HTTP 5xx on PAS callsFail closed; circuit breaker; short-TTL cache for license/feature data (60 s)
FM-06NATS unavailableMutation events not published; cache not invalidated; audit gapNATS connection error; outbox relay failure alertOutbox pattern: events queued in DB; relay retries with backoff; audit gap flagged in DLQ
FM-07Resolution timeout > 500 msCaller receives 504 RESOLUTION_TIMEOUT; UI may blockp95/p99 latency alert; RESOLUTION_TIMEOUT counterCircuit breaker triggers; investigate slow step (BFS depth, upstream latency)
FM-08BFS cycle in production dataRole graph BFS terminates early; effective_permissions may be incompleteCycle detection log entry; alert on CIRCULAR_ROLE_INHERITANCE in productionBFS visited-set prevents infinite loop; alert for manual review and graph repair
FM-09Redis cache poisoning (stale deny)User denied access after legitimate grant; permissions appear wrongUser support tickets; resolution audit sampling shows unexpected denyNATS eviction event missed → evict full tenant cache pattern; DLQ alert triggers recheck
FM-10Database RLS misconfigurationCross-tenant data leakageIntegration test failure; cross-tenant audit alertAutomated tenant-isolation test in CI blocks deployment; RLS policy enforced via Terraform
FM-11Outbox relay backlogEvent delivery delayed; downstream cache stale; audit lagOutbox undelivered row count gauge; alert if > 100 rowsScale relay workers; check NATS connectivity; manual DLQ replay if needed
FM-12Keycloak JWT validation failureAll authenticated requests fail with 401401 rate spike on all endpointsVerify Keycloak public key rotation; check IDENTITY_JWT_ISSUER env var
FM-13User override expires mid-sessionUser's ExplicitAllow expires; next resolution returns denyMonitoring on expired override count; UX shows "access changed"Expiry is by design; notify admin in advance via scheduled job; user re-authentication triggers re-resolution

2. Circuit Breaker Configuration

UpstreamFailure thresholdOpen durationHalf-open probes
facility-service5 consecutive 5xx30 s1 probe
platform-admin-service5 consecutive 5xx30 s1 probe
access-policy5 consecutive 5xx30 s1 probe

When circuit is open, the resolution step returns the applicable DEPENDENCY_UNAVAILABLE deny immediately without making the upstream call.


3. Degraded Mode Behaviour

ModeConfig service behaviour
Redis down onlyFull DB pipeline; latency SLO breached; resolution continues
Single upstream down (any)Fail closed on affected step; other calls proceed if not involved
All upstreams downAll resolutions return deny; mutations still accepted (writes to DB + outbox)
NATS downMutations succeed (DB committed); events queued in outbox; cache not invalidated until NATS recovers