Skip to main content

Orders Service — Observability

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template

1. SLIs and SLOs

SLISLO targetMeasurement window
orders.create.availability≥ 99.5% of POST /v1/orders requests succeed (2xx)30-day rolling
orders.create.latencyp95 order creation < 800 ms (includes CDS check)30-day rolling
orders.activate.latencyp95 order activation < 500 ms30-day rolling
orders.query.latencyp95 order list query < 600 ms30-day rolling
cds.check.latencyp95 CDS check < 300 ms30-day rolling
outbox.delivery.lagp99 outbox events delivered to NATS within 30 s30-day rolling

2. OpenTelemetry Spans

Span nameAttributesPurpose
orders.createtenantId, patientId, orderType, hasCdsAlertsTrace order creation
orders.activatetenantId, orderId, orderType, cdsAlertCountTrace activation with CDS resolution
orders.cds.allergy_checktenantId, patientId, allergenCodeTrace allergy CDS check
orders.cds.drug_interactiontenantId, patientId, medicationCodeTrace DDI check
orders.cds.duplicatetenantId, patientId, orderCodeTrace duplicate check
orders.outbox.relaytenantId, subject, eventIdTrace outbox relay
orders.referral.createtenantId, patientId, referToSpecialtyTrace referral creation

3. Metrics (Prometheus)

MetricTypeLabels
orders_created_totalCountertenant_id, order_type, priority
orders_activated_totalCountertenant_id, order_type
orders_cancelled_totalCountertenant_id, order_type
orders_cds_hard_stop_totalCountertenant_id, rule_id
orders_cds_warning_totalCountertenant_id, rule_id
orders_cds_check_duration_secondsHistogramtenant_id, check_type
orders_referrals_pending_gaugeGaugetenant_id, facility_id
orders_outbox_pending_gaugeGaugetenant_id

4. Structured Logs

All logs use JSON format with mandatory fields: level, timestamp, traceId, spanId, tenantId, service: "orders-service".

Key log events:

  • ORDER_CREATED — info
  • ORDER_ACTIVATED — info
  • ORDER_CANCELLED — info
  • CDS_HARD_STOP_BLOCKED — warn (activation blocked)
  • CDS_WARNING_ACKNOWLEDGED — info with reason
  • OUTBOX_RELAY_FAILED — error
  • ALLERGY_CACHE_UPDATED — info

5. Dashboards

DashboardKey panels
Orders OverviewOrders by type/status, creation rate, cancellation rate
CDS ActivityHard-stop rate, warning rate, override rate by rule
Referral TrackingPending referrals, overdue referrals, acceptance rate
Service HealthAPI latency p50/p95/p99, error rate, outbox depth
Order Set UsageMost-used order sets, instantiation frequency

6. Alerts

AlertThresholdRunbook
OrdersApiErrorRate> 1% 5xx over 5 minCheck DB connectivity, RLS policy
OrdersCdsCheckTimeoutCDS check p95 > 1s for 5 minCheck terminology-service health
OrdersOutboxDepthHighOutbox pending > 200 for > 5 minCheck NATS connectivity, relay worker
OrdersCdsHardStopSpikeHard-stop rate increases > 3× baselinePossible new drug/allergy combination alert; review rules
OrdersReferralOverdueReferrals pending > 72h without schedulingAlert clinical operations

7. Health Endpoints

EndpointPurpose
GET /health/liveLiveness probe
GET /health/readyReadiness probe — DB, NATS, Redis, CDS connectivity
GET /health/startupStartup probe — migrations complete, allergy cache seeded