| FM-01 | Redis unavailable | Resolution falls back to full DB pipeline; latency increases 3–5x; SLO at risk | Redis connection error metric; alert if > 30 s | Run full pipeline without cache; alert on-call; scale DB read replicas; restore Redis from snapshot |
| FM-02 | PostgreSQL unavailable | All resolution requests fail with 503; no config mutations possible | DB connection error metric; readiness probe fails | K8s removes pod from LB; failover to read replica; page on-call immediately |
| FM-03 | facility-service unavailable | Resolution Step 1 fails; all resolves return DENY (DEPENDENCY_UNAVAILABLE) | HTTP 5xx rate on upstream call; alert | Fail closed (deny); circuit breaker trips after 5 consecutive failures; serve from cache if unexpired |
| FM-04 | access-policy unavailable | Resolution Step 6 fails; all resolves return DENY (DEPENDENCY_UNAVAILABLE) | HTTP 5xx on AP call | Fail closed; circuit breaker; cached results (TTL 60 s) served for non-expired keys |
| FM-05 | platform-admin-service unavailable | Steps 2 + 3 fail; resolutions return DENY (DEPENDENCY_UNAVAILABLE) | HTTP 5xx on PAS calls | Fail closed; circuit breaker; short-TTL cache for license/feature data (60 s) |
| FM-06 | NATS unavailable | Mutation events not published; cache not invalidated; audit gap | NATS connection error; outbox relay failure alert | Outbox pattern: events queued in DB; relay retries with backoff; audit gap flagged in DLQ |
| FM-07 | Resolution timeout > 500 ms | Caller receives 504 RESOLUTION_TIMEOUT; UI may block | p95/p99 latency alert; RESOLUTION_TIMEOUT counter | Circuit breaker triggers; investigate slow step (BFS depth, upstream latency) |
| FM-08 | BFS cycle in production data | Role graph BFS terminates early; effective_permissions may be incomplete | Cycle detection log entry; alert on CIRCULAR_ROLE_INHERITANCE in production | BFS visited-set prevents infinite loop; alert for manual review and graph repair |
| FM-09 | Redis cache poisoning (stale deny) | User denied access after legitimate grant; permissions appear wrong | User support tickets; resolution audit sampling shows unexpected deny | NATS eviction event missed → evict full tenant cache pattern; DLQ alert triggers recheck |
| FM-10 | Database RLS misconfiguration | Cross-tenant data leakage | Integration test failure; cross-tenant audit alert | Automated tenant-isolation test in CI blocks deployment; RLS policy enforced via Terraform |
| FM-11 | Outbox relay backlog | Event delivery delayed; downstream cache stale; audit lag | Outbox undelivered row count gauge; alert if > 100 rows | Scale relay workers; check NATS connectivity; manual DLQ replay if needed |
| FM-12 | Keycloak JWT validation failure | All authenticated requests fail with 401 | 401 rate spike on all endpoints | Verify Keycloak public key rotation; check IDENTITY_JWT_ISSUER env var |
| FM-13 | User override expires mid-session | User's ExplicitAllow expires; next resolution returns deny | Monitoring on expired override count; UX shows "access changed" | Expiry is by design; notify admin in advance via scheduled job; user re-authentication triggers re-resolution |