Skip to main content

Laboratory Service — Observability

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 03 platform-services · 02 DDD


1. SLIs and SLOs

SLITarget SLOMeasurement
API p95 latency (GET worklist)< 2 sHistogram http_request_duration_seconds p95
API p95 latency (POST result entry)< 1 sSame histogram
Result release success rate≥ 99.5%lab_result_release_total{outcome="success"} / total
Critical alert publish latency< 5 s from trigger to NATS acklab_critical_publish_duration_seconds
Outbox relay lag< 30 slab_outbox_unpublished_age_seconds max
Ingestion pipeline throughput≥ 50 results/s sustainedlab_results_ingested_total rate
FHIR publish success rate≥ 99%lab_fhir_publish_total{outcome="success"} / total

2. Key Metrics

Metric nameTypeLabelsDescription
lab_accessions_created_totalCountertenant_id, priorityNew accessions
lab_results_entered_totalCountertenant_id, test_codeResults entered
lab_results_released_totalCountertenant_idResults released to chart
lab_critical_alerts_totalCountertenant_id, test_codeCritical value triggers
lab_critical_ack_latency_secondsHistogramtenant_idTime from trigger to ack
lab_outbox_unpublished_countGaugetenant_idUnpublished outbox events
lab_tat_secondsHistogramtenant_id, priorityAccession TAT
lab_fhir_publish_duration_secondsHistogramtenant_idFHIR publish latency

3. Dashboards

DashboardKey panels
Lab ThroughputAccessions/hr, results/hr, release rate, critical alerts/hr
TAT OverviewMedian + p95 TAT by priority and bench
Critical Value MonitoringUnacknowledged criticals count, ack latency distribution
Outbox HealthUnpublished events age, relay failures
FHIR IntegrationPublish success rate, latency, error breakdown

4. Alerts

AlertConditionSeverityRunbook
Critical result unacknowledgedlab_critical_alerts_total with no matching ack after 60 minP1runbooks/lab-critical-ack-escalation.md
Outbox lag highlab_outbox_unpublished_age_seconds > 300P2runbooks/lab-outbox-lag.md
FHIR publish error rateError rate > 1% over 5 minP2runbooks/lab-fhir-publish.md
Result entry latencyp95 > 2 s sustained 5 minP3runbooks/lab-api-latency.md
No accessions in 4 hours (business hours)rate(lab_accessions_created_total[4h]) == 0P3runbooks/lab-no-activity.md

5. Tracing

All inbound HTTP requests and outbound FHIR/NATS calls are instrumented with OpenTelemetry spans. Trace propagation:

  • Inbound: extract from traceparent header
  • NATS publish: inject into CloudEvents traceparent header
  • FHIR calls: inject into HTTP traceparent

Span attributes: tenant_id, patient_id (masked in non-PHI contexts), accession_id, result_id.