Skip to main content

Patient Chart Service — Failure Modes

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · OBSERVABILITY

1. Failure catalog

IDFailureUser impactDetectionMitigation
FM-CHART-01PostgreSQL primary unavailableChart reads and writes fail; clinicians cannot access or record dataHealth probe fails; patient_chart_db_connection_errors_total metric firesRetry up to 3x (transient); pod fails readiness probe; Kubernetes routes traffic to healthy replicas; DR failover triggered by SRE if sustained > 5 min
FM-CHART-02NATS JetStream unavailableDomain events not published; outbox rows accumulate; no downstream fan-outpatient_chart_outbox_lag metric > threshold; NATS health check failsOutbox relay continues to retry; events delivered once NATS recovers; at-least-once guarantee maintained; no data loss
FM-CHART-03terminology-service unavailableCoded value rendering fails; problem/allergy add with auto-code failsHTTP circuit breaker opens; 503 returned to callerFail-open for free-text entry: allow record with codingPending=true flag; alert SRE; terminology retry on next edit
FM-CHART-04ai-gateway-service unavailableAI-assist for clinical notes unavailableHTTP 503 from ai-gateway; circuit breaker opensGraceful degradation: note authoring continues manually; AI assist button grayed out; alert fires
FM-CHART-05registration-service unavailablePatient exists check fails; encounter context unavailableHTTP circuit breaker opensFail-open for existing chart reads (patient ID already known); new chart open with cached patient data; alert fires
FM-CHART-06Allergy advisory sync call fails (orders-service / medication-service)Advisory check may be skipped at order entryHTTP 503; circuit breakerCallers (orders, medication) implement fail-open per contract; chart does not block; alert logged
FM-CHART-07Document-service unavailableSigned note PDF upload fails; note stays in signed state without DocumentReferenceHTTP circuit breakerNote signing completes; PDF upload retried via outbox; DocumentReference written once doc-service recovers
FM-CHART-08Optimistic lock conflict on ClinicalNoteConcurrent editor receives 409 CHART_INVALID_VERSIONHTTP 409 returnedClient must re-fetch and reapply changes; UX shows "note was updated by another user"; non-destructive
FM-CHART-09Break-glass invocation without reasonChart access blockedCHART_BREAKGLASS_REASON_MISSING 422Enforced in domain layer; no workaround; SRE cannot override
FM-CHART-10NATS duplicate event delivery (at-least-once)Duplicate audit entries or double allergy eventssource_event_id UNIQUE index in consumersInbox deduplication on eventId; idempotent handlers; harmless duplicate ACK
FM-CHART-11Note signed without cosign when policy requiresSigning blocked; CHART_NOTE_COSIGN_REQUIRED returned422 from SignNote use caseDomain invariant enforced; note moves to pending_cosign; cosign request emitted
FM-CHART-12Vitals range validation hard-stop firesRecording blockedCHART_VITALS_RANGE_REJECTED 422Configurable warn vs reject policy per facility; warn path records with flag
FM-CHART-13Memory pressure / pod OOMService restarts; in-flight requests lostKubernetes OOM kill event; alertStateless pod restarts quickly; in-flight HTTP returns 503 to caller (client retries); outbox ensures event safety
FM-CHART-14Data migration error during five-module consolidationOrphaned records in legacy schemaMigration job exit code ≠ 0; CI gateMigration is idempotent; run in dry-run mode first; rollback script restores legacy schema from snapshot
FM-CHART-15RLS misconfiguration: tenant data leakCross-tenant chart data visibleRLS integration test fails in CIMandatory tenant-isolation.spec.ts must pass before deploy; RLS policy version-controlled

2. Dependency failure impact matrix

DependencyFailure modeChart impactFail behavior
PostgreSQLUnavailableTotal write/read failureFail closed; readiness probe down
NATSUnavailableEvent delivery deferredOutbox accumulates; recover on reconnect
terminology-serviceUnavailableCode lookup failsFail-open: free-text with codingPending
registration-serviceUnavailablePatient/encounter context staleFail-open: cached data used
ai-gateway-serviceUnavailableAI assist blockedGraceful: manual note entry only
document-serviceUnavailablePDF upload deferredOutbox retry; note signing still completes
medication-serviceUnavailableSummary widget emptyPartial summary displayed; widget shows error
laboratory-serviceUnavailableLab summary widget emptySame as medication
identity-serviceUnavailableAll authenticated calls failTotal failure; no workaround

3. Blast radius — five-module consolidation

Because five legacy modules are consolidated into one service, a single service outage impacts all five clinical areas simultaneously. Mitigations:

RiskMitigation
Single point of failure for problem + allergy + vitals + notes + chart≥ 3 replicas; pod disruption budget minAvailable=2; Kubernetes multi-AZ scheduling
Migration failure corrupts all five datasetsMigration is a separate job; prod data behind snapshot; rollback restores original five schemas
Deploy regression across all five areasFeature-flag gating per area; canary deploy validates all five aggregate types

4. SLO impact

SLOTargetBreach triggers
Chart read availability99.9 %P95 > 500 ms or error rate > 0.1 %
Note sign latency (P95)< 1 sP95 > 1.5 s
Vitals record latency (P95)< 500 msP95 > 800 ms
Allergy advisory latency (P95)< 300 msP95 > 600 ms

Breach detection in OBSERVABILITY.md; runbook links in §5 of that document.