Medication Service — Failure Modes
Status: populated Owner: TBD Last updated: 2026-04-17 Companion: Service Template
1. Catalog
| ID | Failure | User impact | Detection | Mitigation |
|---|---|---|---|---|
| MED-F-001 | Drug KB unavailable | Signing blocked (safety-critical) | medication_kb_check_total{outcome=error} > 2% | 503 returned; fallback to cached formulary for non-safety-critical checks; queue drafts; auto-resume on KB heartbeat |
| MED-F-002 | Postgres write latency spike | Sign/dispense slow or 5xx | p95 > 3s | Connection pool tuning; shed load via Kong rate-limit; failover to standby |
| MED-F-003 | Inventory decrement conflict (two dispenses same lot) | One fails with 409 | Metric medication_dispense_total{outcome=conflict} | Version-lock retry once; if still conflict, user picks different lot |
| MED-F-004 | Negative stock observed | Data integrity issue | Monitor stock_items.quantity_on_hand < 0 | P1 alert; transaction review; likely replay bug — consumer isolation audit |
| MED-F-005 | Outbox stall | Downstream (billing, gateway) out of sync | medication_outbox_undelivered growing | Restart relay; check NATS connectivity; manual replay via outbox.replay CLI |
| MED-F-006 | Gateway POST MR/MD failure | External pharmacies don't receive Rx/dispense | HTTP 5xx from gateway, outbox retries increasing | Exponential backoff (max 5 retries, 2h ceiling); DLQ after; pharmacist notified to transmit manually |
| MED-F-007 | Gateway inbound event duplicate | Risk of duplicate dispenses | Inbox dedup metric | Inbox table prevents; alert if duplicate count > baseline |
| MED-F-008 | NCPDP SCRIPT endpoint down (optional) | External e-Rx fails | Adapter error rate | Retained in adapter queue; 3 retries + prescriber notification |
| MED-F-009 | Counter-sign bottleneck (one licensed pharmacist) | CS dispense delayed | Dispense queue isControlled=true age | Coverage rota alert; supervisor escalation policy |
| MED-F-010 | Offline portal 4h limit exceeded | Queue becomes stale | Client telemetry offline duration | Banner warning at 3h; reject new dispense entries at 4h; user must reconnect |
| MED-F-011 | Sync conflict on dispense (offline → online) | Dispense rejected if stock unavailable | Client conflict event | Server-wins; user requeues; idempotency prevents duplicate |
| MED-F-012 | RxNorm/terminology lookup failure | Drafting blocked for coded med | Term service 5xx rate | Allow free-text drug name with requiresReview=true flag per BR-MEDS |
| MED-F-013 | Controlled-substance MFA step-up failure | Prescriber cannot sign CS | Identity MFA challenge | Fallback: prescriber re-auths; retry sign; audit every failed step-up |
| MED-F-014 | Mass recall event | Spike in recall-triggered + dispense blocks | medication_inventory_recall_total | Batch-process recall lot list; block dispense; notify affected patients via communication-service |
| MED-F-015 | Drug-KB snapshot mismatch during audit | Override record references missing version | CI audit | Retain KB snapshots indefinitely in S3; restore-on-demand |
| MED-F-016 | Gateway MR ingestion lag > 30s | Pharmacy queue stale | Consumer lag | Scale consumer replicas; investigate NATS stream partition |
2. Safety-Critical Fail-Closed Behaviors
- Sign of prescription never proceeds without KB check AND allergy check verification success or a recorded override. If checks time out → sign refused with 503 + retry guidance.
- Dispense never against expired lot or recalled lot — system blocks even with pharmacist override intent (only emergency override is for insufficient-stock, never for expired/recalled).
- Schedule II dispense never without counter-sign.
3. Non-Safety-Critical Degrade-Open Behaviors
- Reorder alert generation may lag up to 5 minutes with no patient impact.
- Expiry alert batch may run once every 6h.
- AI-advisory features disable silently when ai-gateway-service unavailable.