Skip to main content

Facility Service — Failure Modes

Status: populated Owner: TBD Last updated: 2026-04-17 Companion: SERVICE_TEMPLATE §13

1. Catalog

#FailureUser impactDetectionMitigation
F1Postgres primary downAll writes fail; reads degraded to reader replicas/readyz fails; CloudWatch RDSFailover reader → writer; circuit-breaker; 5xx with retry-after
F2Redis context cache outageEvery downstream context lookup hits DB; latency spikesCache miss rate → 100 %, p99 > 50 msFall back to DB with rate-limit; shed non-critical reads
F3NATS JetStream outageOutbox backs up; downstream services see stalenessfacility_outbox_lag_seconds risesAccept writes; backlog drains on recovery; warn ops
F4Access-policy outageAll writes fail 503Timeout metrics on access-policy.evaluateShort-term fail-closed; admin override token for emergency
F5Licensing service outageWrites blocked with MODULE_NOT_ACTIVETimeout on licensing checkFail-closed for writes; reads unaffected
F6Identity JWKS unavailableEdge calls fail authJWKS refresh errorsCached JWKS TTL extended to 24h with warning
F7Cycle introduced by buggy clientcontains cycle attempt stormfacility_cycle_rejections_totalReject at handler; alert if > 50/min/tenant
F8Bed status race (double OCCUPIED)Clinical hazardOptimistic lock mismatchSerializable TX; invariant check; alert clinical ops
F9Outbox-relay crash loopEvent publish haltsRelay pod restart countAuto-restart; circuit breaker; page SRE
F10Wrong tenant context set in appCross-tenant leak riskIntegration tenant-isolation specRLS catches; fail-fast 500; incident review
F11Profile update breaks existing nodesAdmin UX regressionContract test, validation warnProfile updates never retroactive; only warn
F12Hierarchy snapshot import corruptionTenant onboarding blockedDry-run errorsDry-run required; transactional import; rollback on failure
F13Edge snapshot stale > 24hField clinic reads outdated hierarchyEdge telemetry heartbeatForced re-sync; alert tenant admin
F14Subtree query timeout (>1000 nodes)UI slow / 504hierarchy_read_latency_p95_msEnforce maxDepth; paginate subtree API
F15Recursive CTE plan regressionCycle-check latency spikePostgres slow logAdd index hints; refresh plan; performance runbook
F16Outbound FHIR projection failureInterop lag; no core facility impactError rate on FHIR projectorRetry with DLQ; manual replay

2. Blast Radius Summary

Dep outageReadsWritesDownstream impact
Postgres writerdegradedfailAll mutations blocked
Postgres readerfailokReads fail over to writer
RedisdegradedokLatency doubles; Licensing timeouts rise
NATSokokDownstream staleness; outbox backfills on recovery
access-policyokfailWrites blocked
LicensingokfailWrites blocked

3. Chaos drills (M3+)

  • Kill one NATS node → assert no data loss.
  • 30% Redis eviction → verify DB fallback remains within p99 ≤ 50ms.
  • Postgres writer failover → verify circuit breaker + retry.
  • Bed status flood (5k/s) → verify outbox drain and no duplicate NATS delivery.