Laboratory Service — Observability
Status: populated
Owner: TBD
Last updated: 2026-04-18
Companion: Service Template · 03 platform-services · 02 DDD
1. SLIs and SLOs
| SLI | Target SLO | Measurement |
|---|
| API p95 latency (GET worklist) | < 2 s | Histogram http_request_duration_seconds p95 |
| API p95 latency (POST result entry) | < 1 s | Same histogram |
| Result release success rate | ≥ 99.5% | lab_result_release_total{outcome="success"} / total |
| Critical alert publish latency | < 5 s from trigger to NATS ack | lab_critical_publish_duration_seconds |
| Outbox relay lag | < 30 s | lab_outbox_unpublished_age_seconds max |
| Ingestion pipeline throughput | ≥ 50 results/s sustained | lab_results_ingested_total rate |
| FHIR publish success rate | ≥ 99% | lab_fhir_publish_total{outcome="success"} / total |
2. Key Metrics
| Metric name | Type | Labels | Description |
|---|
lab_accessions_created_total | Counter | tenant_id, priority | New accessions |
lab_results_entered_total | Counter | tenant_id, test_code | Results entered |
lab_results_released_total | Counter | tenant_id | Results released to chart |
lab_critical_alerts_total | Counter | tenant_id, test_code | Critical value triggers |
lab_critical_ack_latency_seconds | Histogram | tenant_id | Time from trigger to ack |
lab_outbox_unpublished_count | Gauge | tenant_id | Unpublished outbox events |
lab_tat_seconds | Histogram | tenant_id, priority | Accession TAT |
lab_fhir_publish_duration_seconds | Histogram | tenant_id | FHIR publish latency |
3. Dashboards
| Dashboard | Key panels |
|---|
| Lab Throughput | Accessions/hr, results/hr, release rate, critical alerts/hr |
| TAT Overview | Median + p95 TAT by priority and bench |
| Critical Value Monitoring | Unacknowledged criticals count, ack latency distribution |
| Outbox Health | Unpublished events age, relay failures |
| FHIR Integration | Publish success rate, latency, error breakdown |
4. Alerts
| Alert | Condition | Severity | Runbook |
|---|
| Critical result unacknowledged | lab_critical_alerts_total with no matching ack after 60 min | P1 | runbooks/lab-critical-ack-escalation.md |
| Outbox lag high | lab_outbox_unpublished_age_seconds > 300 | P2 | runbooks/lab-outbox-lag.md |
| FHIR publish error rate | Error rate > 1% over 5 min | P2 | runbooks/lab-fhir-publish.md |
| Result entry latency | p95 > 2 s sustained 5 min | P3 | runbooks/lab-api-latency.md |
| No accessions in 4 hours (business hours) | rate(lab_accessions_created_total[4h]) == 0 | P3 | runbooks/lab-no-activity.md |
5. Tracing
All inbound HTTP requests and outbound FHIR/NATS calls are instrumented with OpenTelemetry spans. Trace propagation:
- Inbound: extract from
traceparent header
- NATS publish: inject into CloudEvents
traceparent header
- FHIR calls: inject into HTTP
traceparent
Span attributes: tenant_id, patient_id (masked in non-PHI contexts), accession_id, result_id.