Skip to main content

Orders Service — Failure Modes

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · OBSERVABILITY · SERVICE_RISK_REGISTER

1. Overview

The orders-service is a patient-safety-critical component. Failures in order creation or CDS checks can directly impact clinical care. This catalog identifies failure modes, their detection mechanisms, and mitigation strategies.

2. Failure Catalog

F-01: PostgreSQL Unavailable

AttributeDetail
FailureDatabase connection pool exhausted or PostgreSQL unreachable
User impactAll order creation, activation, and reads fail — clinicians cannot enter orders
SeverityCritical
DetectionGET /health/ready returns unhealthy; orders_db_pool_exhausted_total counter; alert OrdersApiErrorRate fires
MitigationConnection pool retry with exponential backoff (max 5 attempts, 30 s). Service enters graceful shutdown after 3 failed health-check cycles. On-call runbook: verify PG primary reachability; failover to standby replica if warranted
FHIR impactAll FHIR ServiceRequest / MedicationRequest reads via interop-service fail

F-02: NATS JetStream Unavailable

AttributeDetail
FailureNATS cluster unreachable; outbox relay cannot publish events
User impactOrders can still be created and activated (data writes succeed); downstream services (laboratory-service, pharmacy-service) do not receive routing events
SeverityHigh
Detectionorders_outbox_pending_gauge grows; OrdersOutboxDepthHigh alert fires after 5 min
MitigationOutbox pattern: events buffered in orders.outbox table; relay retries with backoff when NATS recovers. No order data lost. On recovery, outbox drains automatically
FHIR impactNone immediate; downstream order routing delayed

F-03: CDS Engine Unavailable

AttributeDetail
FailureCDS engine (terminology-service or dedicated CDS endpoint) unreachable
User impactCDS checks cannot be performed; hard-stop and warning alerts cannot be generated
SeverityHigh (patient-safety risk if medication orders proceed unchecked)
DetectionOrdersCdsCheckTimeout alert; orders_cds_check_duration_seconds p95 > threshold; health check cds probe unhealthy
MitigationCDS degraded mode: order creation records CDS_DEGRADED soft alert; medication orders blocked from immediate activation until CDS recovers unless ADMIN overrides with documented clinical reason. All CDS-degraded activations logged with CDS_DEGRADED_OVERRIDE audit event
FHIR impactNone

F-04: Redis Unavailable (Allergy Cache / Idempotency)

AttributeDetail
FailureRedis cluster unreachable
User impactAllergy cache misses force DB fallback (slower). Idempotency key lookup falls back to DB. Outbox relay distributed lock unavailable — possible duplicate relay across pods
SeverityMedium
DetectionRedis health probe in /health/ready; increased DB query latency
MitigationFallback to PostgreSQL for allergy data. Idempotency falls back to DB client_mutation_id unique constraint. Outbox relay uses optimistic DB lock as fallback. REDIS_DEGRADED log event emitted
FHIR impactNone

F-05: CDS Hard-Stop Incorrectly Applied (False Positive)

AttributeDetail
FailureCDS rule fires hard-stop on a clinically appropriate order due to stale rule data or bug in CDS rule engine
User impactClinician cannot activate a needed order; potential treatment delay
SeverityHigh (patient-safety)
DetectionSpike in orders_cds_hard_stop_total; clinical complaint; OrdersCdsHardStopSpike alert
MitigationADMIN escalation path: ADMIN role can override hard-stop with mandatory reason (dual-sign for controlled substances). Override logged immutably. terminology-service team notified for rule review
FHIR impactNone direct

F-06: Duplicate Order Created (Idempotency Failure)

AttributeDetail
FailureNetwork retry causes the same order to be submitted twice; clientMutationId / Idempotency-Key check fails
User impactPatient may receive duplicate medication administration orders or duplicate lab requisitions
SeverityHigh (patient-safety)
DetectionDuplicate order detection CDS warning; DUPLICATE_MUTATION error code in logs; orders_created_total spike for patient
Mitigationclient_mutation_id unique constraint at DB level prevents DB-level duplicates. HTTP Idempotency-Key header handled at controller — returns cached response for 24 h. CDS duplicate check fires warning for identical orders within 24 h
FHIR impactDownstream FHIR resources may contain duplicate ServiceRequests if detection fires after NATS publish

F-07: Downstream Service (lab / pharmacy / radiology) Not Consuming Events

AttributeDetail
Failureclinical.orders.order.activated event published to NATS but consuming service is down or consumer group lagging
User impactLab order not sent to lab worklist; medication order not sent to pharmacy — clinical workflow stall
SeverityHigh
DetectionNATS consumer lag metric; OrdersOutboxDepthHigh for consumer group; downstream service health alerts
MitigationNATS JetStream at-least-once delivery with configurable ACK timeout and retry. Downstream consumer reconnect and replay pending events on recovery. Monitoring dashboard Orders Overview shows events pending per consumer group
FHIR impactNone

F-08: Optimistic Lock Conflict Storm

AttributeDetail
FailureHigh-concurrency scenario where multiple clinicians update the same order simultaneously; many 409 OPTIMISTIC_LOCK_CONFLICT responses
User impactRepeated submission failures frustrate clinicians; orders not activated
SeverityMedium
Detectionorders_lock_conflict_total counter; increase in 409 error rate on activation endpoint
MitigationClient-side retry with fresh GET before re-attempting. Server-side exponential backoff on retry suggestion in error body (Retry-After header). UI designed to fetch latest version before each submit
FHIR impactNone

F-09: Order Set Partial Instantiation Failure

AttributeDetail
FailureInstantiateOrderSetCommand creates some orders but fails mid-batch (DB error, CDS timeout)
User impactPartial set of orders created; clinician may not notice missing orders from the set
SeverityHigh
DetectionResponse payload includes failedTemplates[] array; ORDER_SET_PARTIAL_FAILURE log event
MitigationEach order is created in an independent transaction. Response always lists which templates succeeded and which failed. Clinician is prompted to review and retry failed items individually. Failed items do not leave orphaned partial orders
FHIR impactFHIR ServiceRequests not created for failed items

F-10: Referral Routing Event Dropped

AttributeDetail
Failureclinical.orders.referral.created.v1 event published but scheduling-service consumer does not receive it (NATS outage at time of publish, then NATS recovers but consumer was not yet subscribed)
User impactReferral order active in orders-service but no appointment ever proposed by scheduling-service
SeverityHigh
Detectionorders_referrals_pending_gauge stays high; OrdersReferralOverdue alert fires after 72 h
MitigationNATS JetStream durable consumer with replay ensures delivery on reconnect. Operations alert after 72 h prompts manual follow-up. UI shows pending_scheduling referrals with age indicator
FHIR impactServiceRequest status does not progress to active in interop FHIR surface

F-11: Allergy Cache Stale

AttributeDetail
FailureRedis allergy cache not updated when REGISTRATION.allergy.recorded.v1 event delivery is delayed
User impactCDS allergy check uses stale data; a new allergy may not be caught for the cache TTL period
SeverityHigh (patient-safety)
DetectionNATS consumer lag on REGISTRATION.allergy.recorded.v1; cache TTL monitoring
MitigationCache TTL set to 15 minutes maximum. On cache miss, always fall back to live query to registration-service. NATS durable consumer ensures event is delivered within SLA even after transient outage
FHIR impactNone

F-12: Tenant RLS Misconfiguration

AttributeDetail
FailurePostgreSQL session variable app.tenant_id not set correctly, causing RLS bypass or wrong-tenant data return
User impactCross-tenant data leak — critical compliance failure
SeverityCritical
Detectiontenant-isolation.integration.spec.ts gate in CI prevents merge. Runtime: unexpected data volume anomalies; audit log cross-tenant access pattern alert
MitigationNestJS middleware sets app.tenant_id from validated JWT tenantId claim on every request before any DB query. Integration test mandatory and blocking in CI. Any CI failure on this test halts deployment
FHIR impactFHIR resources could cross tenant boundaries

3. Severity Summary

IDFailureSeverityPatient safety
F-01PostgreSQL unavailableCriticalYes — no orders possible
F-02NATS unavailableHighPartial — orders entered but not routed
F-03CDS engine unavailableHighYes — allergy/interaction checks skipped
F-04Redis unavailableMediumNo
F-05False positive hard-stopHighYes — treatment delay
F-06Duplicate orderHighYes — duplicate medication/lab
F-07Downstream not consumingHighPartial — workflow stall
F-08Optimistic lock stormMediumNo
F-09Order set partial failureHighYes — incomplete order entry
F-10Referral event droppedHighPartial — referral not scheduled
F-11Allergy cache staleHighYes — missed allergy alert
F-12RLS misconfigurationCriticalCompliance