| F1 | Postgres primary down | All writes fail; reads degraded to reader replicas | /readyz fails; CloudWatch RDS | Failover reader → writer; circuit-breaker; 5xx with retry-after |
| F2 | Redis context cache outage | Every downstream context lookup hits DB; latency spikes | Cache miss rate → 100 %, p99 > 50 ms | Fall back to DB with rate-limit; shed non-critical reads |
| F3 | NATS JetStream outage | Outbox backs up; downstream services see staleness | facility_outbox_lag_seconds rises | Accept writes; backlog drains on recovery; warn ops |
| F4 | Access-policy outage | All writes fail 503 | Timeout metrics on access-policy.evaluate | Short-term fail-closed; admin override token for emergency |
| F5 | Licensing service outage | Writes blocked with MODULE_NOT_ACTIVE | Timeout on licensing check | Fail-closed for writes; reads unaffected |
| F6 | Identity JWKS unavailable | Edge calls fail auth | JWKS refresh errors | Cached JWKS TTL extended to 24h with warning |
| F7 | Cycle introduced by buggy client | contains cycle attempt storm | facility_cycle_rejections_total | Reject at handler; alert if > 50/min/tenant |
| F8 | Bed status race (double OCCUPIED) | Clinical hazard | Optimistic lock mismatch | Serializable TX; invariant check; alert clinical ops |
| F9 | Outbox-relay crash loop | Event publish halts | Relay pod restart count | Auto-restart; circuit breaker; page SRE |
| F10 | Wrong tenant context set in app | Cross-tenant leak risk | Integration tenant-isolation spec | RLS catches; fail-fast 500; incident review |
| F11 | Profile update breaks existing nodes | Admin UX regression | Contract test, validation warn | Profile updates never retroactive; only warn |
| F12 | Hierarchy snapshot import corruption | Tenant onboarding blocked | Dry-run errors | Dry-run required; transactional import; rollback on failure |
| F13 | Edge snapshot stale > 24h | Field clinic reads outdated hierarchy | Edge telemetry heartbeat | Forced re-sync; alert tenant admin |
| F14 | Subtree query timeout (>1000 nodes) | UI slow / 504 | hierarchy_read_latency_p95_ms | Enforce maxDepth; paginate subtree API |
| F15 | Recursive CTE plan regression | Cycle-check latency spike | Postgres slow log | Add index hints; refresh plan; performance runbook |
| F16 | Outbound FHIR projection failure | Interop lag; no core facility impact | Error rate on FHIR projector | Retry with DLQ; manual replay |