Orders Service — Failure Modes
Status: populated
Owner: TBD
Last updated: 2026-04-18
Companion: Service Template · OBSERVABILITY · SERVICE_RISK_REGISTER
1. Overview
The orders-service is a patient-safety-critical component. Failures in order creation or CDS checks can directly impact clinical care. This catalog identifies failure modes, their detection mechanisms, and mitigation strategies.
2. Failure Catalog
F-01: PostgreSQL Unavailable
| Attribute | Detail |
|---|
| Failure | Database connection pool exhausted or PostgreSQL unreachable |
| User impact | All order creation, activation, and reads fail — clinicians cannot enter orders |
| Severity | Critical |
| Detection | GET /health/ready returns unhealthy; orders_db_pool_exhausted_total counter; alert OrdersApiErrorRate fires |
| Mitigation | Connection pool retry with exponential backoff (max 5 attempts, 30 s). Service enters graceful shutdown after 3 failed health-check cycles. On-call runbook: verify PG primary reachability; failover to standby replica if warranted |
| FHIR impact | All FHIR ServiceRequest / MedicationRequest reads via interop-service fail |
F-02: NATS JetStream Unavailable
| Attribute | Detail |
|---|
| Failure | NATS cluster unreachable; outbox relay cannot publish events |
| User impact | Orders can still be created and activated (data writes succeed); downstream services (laboratory-service, pharmacy-service) do not receive routing events |
| Severity | High |
| Detection | orders_outbox_pending_gauge grows; OrdersOutboxDepthHigh alert fires after 5 min |
| Mitigation | Outbox pattern: events buffered in orders.outbox table; relay retries with backoff when NATS recovers. No order data lost. On recovery, outbox drains automatically |
| FHIR impact | None immediate; downstream order routing delayed |
F-03: CDS Engine Unavailable
| Attribute | Detail |
|---|
| Failure | CDS engine (terminology-service or dedicated CDS endpoint) unreachable |
| User impact | CDS checks cannot be performed; hard-stop and warning alerts cannot be generated |
| Severity | High (patient-safety risk if medication orders proceed unchecked) |
| Detection | OrdersCdsCheckTimeout alert; orders_cds_check_duration_seconds p95 > threshold; health check cds probe unhealthy |
| Mitigation | CDS degraded mode: order creation records CDS_DEGRADED soft alert; medication orders blocked from immediate activation until CDS recovers unless ADMIN overrides with documented clinical reason. All CDS-degraded activations logged with CDS_DEGRADED_OVERRIDE audit event |
| FHIR impact | None |
F-04: Redis Unavailable (Allergy Cache / Idempotency)
| Attribute | Detail |
|---|
| Failure | Redis cluster unreachable |
| User impact | Allergy cache misses force DB fallback (slower). Idempotency key lookup falls back to DB. Outbox relay distributed lock unavailable — possible duplicate relay across pods |
| Severity | Medium |
| Detection | Redis health probe in /health/ready; increased DB query latency |
| Mitigation | Fallback to PostgreSQL for allergy data. Idempotency falls back to DB client_mutation_id unique constraint. Outbox relay uses optimistic DB lock as fallback. REDIS_DEGRADED log event emitted |
| FHIR impact | None |
F-05: CDS Hard-Stop Incorrectly Applied (False Positive)
| Attribute | Detail |
|---|
| Failure | CDS rule fires hard-stop on a clinically appropriate order due to stale rule data or bug in CDS rule engine |
| User impact | Clinician cannot activate a needed order; potential treatment delay |
| Severity | High (patient-safety) |
| Detection | Spike in orders_cds_hard_stop_total; clinical complaint; OrdersCdsHardStopSpike alert |
| Mitigation | ADMIN escalation path: ADMIN role can override hard-stop with mandatory reason (dual-sign for controlled substances). Override logged immutably. terminology-service team notified for rule review |
| FHIR impact | None direct |
F-06: Duplicate Order Created (Idempotency Failure)
| Attribute | Detail |
|---|
| Failure | Network retry causes the same order to be submitted twice; clientMutationId / Idempotency-Key check fails |
| User impact | Patient may receive duplicate medication administration orders or duplicate lab requisitions |
| Severity | High (patient-safety) |
| Detection | Duplicate order detection CDS warning; DUPLICATE_MUTATION error code in logs; orders_created_total spike for patient |
| Mitigation | client_mutation_id unique constraint at DB level prevents DB-level duplicates. HTTP Idempotency-Key header handled at controller — returns cached response for 24 h. CDS duplicate check fires warning for identical orders within 24 h |
| FHIR impact | Downstream FHIR resources may contain duplicate ServiceRequests if detection fires after NATS publish |
F-07: Downstream Service (lab / pharmacy / radiology) Not Consuming Events
| Attribute | Detail |
|---|
| Failure | clinical.orders.order.activated event published to NATS but consuming service is down or consumer group lagging |
| User impact | Lab order not sent to lab worklist; medication order not sent to pharmacy — clinical workflow stall |
| Severity | High |
| Detection | NATS consumer lag metric; OrdersOutboxDepthHigh for consumer group; downstream service health alerts |
| Mitigation | NATS JetStream at-least-once delivery with configurable ACK timeout and retry. Downstream consumer reconnect and replay pending events on recovery. Monitoring dashboard Orders Overview shows events pending per consumer group |
| FHIR impact | None |
F-08: Optimistic Lock Conflict Storm
| Attribute | Detail |
|---|
| Failure | High-concurrency scenario where multiple clinicians update the same order simultaneously; many 409 OPTIMISTIC_LOCK_CONFLICT responses |
| User impact | Repeated submission failures frustrate clinicians; orders not activated |
| Severity | Medium |
| Detection | orders_lock_conflict_total counter; increase in 409 error rate on activation endpoint |
| Mitigation | Client-side retry with fresh GET before re-attempting. Server-side exponential backoff on retry suggestion in error body (Retry-After header). UI designed to fetch latest version before each submit |
| FHIR impact | None |
F-09: Order Set Partial Instantiation Failure
| Attribute | Detail |
|---|
| Failure | InstantiateOrderSetCommand creates some orders but fails mid-batch (DB error, CDS timeout) |
| User impact | Partial set of orders created; clinician may not notice missing orders from the set |
| Severity | High |
| Detection | Response payload includes failedTemplates[] array; ORDER_SET_PARTIAL_FAILURE log event |
| Mitigation | Each order is created in an independent transaction. Response always lists which templates succeeded and which failed. Clinician is prompted to review and retry failed items individually. Failed items do not leave orphaned partial orders |
| FHIR impact | FHIR ServiceRequests not created for failed items |
F-10: Referral Routing Event Dropped
| Attribute | Detail |
|---|
| Failure | clinical.orders.referral.created.v1 event published but scheduling-service consumer does not receive it (NATS outage at time of publish, then NATS recovers but consumer was not yet subscribed) |
| User impact | Referral order active in orders-service but no appointment ever proposed by scheduling-service |
| Severity | High |
| Detection | orders_referrals_pending_gauge stays high; OrdersReferralOverdue alert fires after 72 h |
| Mitigation | NATS JetStream durable consumer with replay ensures delivery on reconnect. Operations alert after 72 h prompts manual follow-up. UI shows pending_scheduling referrals with age indicator |
| FHIR impact | ServiceRequest status does not progress to active in interop FHIR surface |
F-11: Allergy Cache Stale
| Attribute | Detail |
|---|
| Failure | Redis allergy cache not updated when REGISTRATION.allergy.recorded.v1 event delivery is delayed |
| User impact | CDS allergy check uses stale data; a new allergy may not be caught for the cache TTL period |
| Severity | High (patient-safety) |
| Detection | NATS consumer lag on REGISTRATION.allergy.recorded.v1; cache TTL monitoring |
| Mitigation | Cache TTL set to 15 minutes maximum. On cache miss, always fall back to live query to registration-service. NATS durable consumer ensures event is delivered within SLA even after transient outage |
| FHIR impact | None |
F-12: Tenant RLS Misconfiguration
| Attribute | Detail |
|---|
| Failure | PostgreSQL session variable app.tenant_id not set correctly, causing RLS bypass or wrong-tenant data return |
| User impact | Cross-tenant data leak — critical compliance failure |
| Severity | Critical |
| Detection | tenant-isolation.integration.spec.ts gate in CI prevents merge. Runtime: unexpected data volume anomalies; audit log cross-tenant access pattern alert |
| Mitigation | NestJS middleware sets app.tenant_id from validated JWT tenantId claim on every request before any DB query. Integration test mandatory and blocking in CI. Any CI failure on this test halts deployment |
| FHIR impact | FHIR resources could cross tenant boundaries |
3. Severity Summary
| ID | Failure | Severity | Patient safety |
|---|
| F-01 | PostgreSQL unavailable | Critical | Yes — no orders possible |
| F-02 | NATS unavailable | High | Partial — orders entered but not routed |
| F-03 | CDS engine unavailable | High | Yes — allergy/interaction checks skipped |
| F-04 | Redis unavailable | Medium | No |
| F-05 | False positive hard-stop | High | Yes — treatment delay |
| F-06 | Duplicate order | High | Yes — duplicate medication/lab |
| F-07 | Downstream not consuming | High | Partial — workflow stall |
| F-08 | Optimistic lock storm | Medium | No |
| F-09 | Order set partial failure | High | Yes — incomplete order entry |
| F-10 | Referral event dropped | High | Partial — referral not scheduled |
| F-11 | Allergy cache stale | High | Yes — missed allergy alert |
| F-12 | RLS misconfiguration | Critical | Compliance |