| FM-01 | Postgres primary down | 503 on all writes | readyz failing, k8s restarts, DB monitor page | Fail over to replica promoted by Patroni; retry with exponential backoff; outbox unaffected |
| FM-02 | NATS JetStream unreachable | Events queue in outbox; downstream consumers drift | billing_outbox_pending climbs, billing_outbox_lag_seconds climbs | Outbox relay retries; consumers catch up after recovery; no data loss |
| FM-03 | tenant-service down | Licensing check fails; writes blocked | Timeouts on entitlement check | Use cached entitlement for up to 5 min; beyond that, fail closed for new tenants, open for known tenants |
| FM-04 | terminology-service down | Charge code validation fails | Timeouts | Fail closed on charge capture (400 with reason); allow draft charges; queue for validation on service return |
| FM-05 | fhir-gateway down | FHIR writes retry | Retry metric climbs | Async retry queue; publish internal event regardless of FHIR outcome |
| FM-06 | Payment gateway down | Card/mobile-money payment fails | 503 from adapter | Circuit breaker per tenant adapter; fallback to manual/cash method by cashier |
| FM-07 | Ledger integrity drift | Balance ≠ sum(entries) | Nightly integrity job | P1 page; freeze writes on affected account; run reconciliation script; create reversing adjustment |
| FM-08 | Duplicate charge capture | Double-charging patient | Dedup miss on CloudEvents id | Inbox unique-index prevents; if miss, automated reversal script + incident |
| FM-09 | Idempotency TTL misconfiguration | Long retries cause duplicate payments | Monitoring shows same-key duplicates | Enforce 24 h default TTL; alert on duplicate posting for same account |
| FM-10 | Refund approval stuck | Refund never processed | billing_refunds_requested_total – approved > 50 for > 24 h | Ops review queue; auto-escalate to ops channel; clinician/supervisor paged |
| FM-11 | Statement run timeout | Statements not generated | Worker job metric | Retry partitioned; fail-open per account; report failed count in run detail |
| FM-12 | Clock skew across pods | Ordering / idempotency TTL drift | OTel detects time anomaly | NTP enforced; alert on skew > 500 ms |
| FM-13 | Currency misconfiguration for new tenant | Payments rejected | 400 spike on /payments | Onboarding checklist: currency default set; alert on tenant onboarding without currency |
| FM-14 | Price list lapse | Charge capture fails after effective_to | PRICE_NOT_FOUND rate spikes | Alert 7 days before effective_to expiry per list |
| FM-15 | RLS disabled by migration | Cross-tenant leak | Tenant-isolation integration test fails | Block release; revert migration; rotate JWT signing key |
| FM-16 | PDF renderer OOM on RTL + large statement | Statement fails | Worker restarts | Partition by account batch size; memory limit 512 MiB; fall back to basic PDF |
| FM-17 | Outbox relay behind by > 30 s | Downstream drift | Alert | Scale relay; inspect for poison message; move to DLQ |
| FM-18 | Poison event in inbox | Consumer retries forever | Consumer lag + logs | Dedicated billing.dlq.* subject after 5 redelivery attempts; quarantine and page ops |
| FM-19 | DB connection pool exhaustion | 503 | p95 latency climbs | Increase pool; open queries audited; pgbouncer in front |
| FM-20 | JWT signing key rotation without JWKS refresh | 401 across all routes | Auth failure rate spike | 10-min overlap window during rotation; preload JWKS in readiness probe |
| FM-21 | Cross-border data transfer violation | Compliance breach | Egress monitoring by region | Block egress beyond tenant residency allowlist at service mesh; alert ops + compliance |
| FM-22 | Data-residency drift during DR | PHI in wrong region | DR runbook gating | Region-tagged backups; DR only to region-compatible cluster |