Patient Chart Service — Observability
Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 12 observability-telemetry
1. SLIs and SLOs
| SLI | Measurement | SLO target | Window |
|---|---|---|---|
| Chart read availability | 1 - (5xx rate on GET /v1/chart/*, /v1/problems/*, /v1/allergies/*, /v1/vitals/*, /v1/clinical-notes/* ) / total | 99.9 % | 30-day rolling |
| Chart write availability | 1 - (5xx rate on POST/PUT/PATCH chart write endpoints) | 99.9 % | 30-day rolling |
| Problem list read latency (P95) | patient_chart_http_request_duration_ms{operation="list_problems"} | < 500 ms | 5-min window |
| Note sign latency (P95) | patient_chart_http_request_duration_ms{operation="sign_note"} | < 1 000 ms | 5-min window |
| Vitals record latency (P95) | patient_chart_http_request_duration_ms{operation="record_vitals"} | < 500 ms | 5-min window |
| Allergy advisory latency (P95) | patient_chart_http_request_duration_ms{operation="allergy_advisory"} | < 300 ms | 5-min window |
| NATS outbox lag | patient_chart_outbox_unpublished_rows | < 100 rows for > 30 s | Continuous |
| Tenant isolation | Automated tenant-isolation.spec.ts in CI | 100 % pass | Every deploy |
2. Key metrics (Prometheus)
| Metric | Type | Labels | Description |
|---|---|---|---|
patient_chart_http_request_duration_ms | Histogram | operation, status_code, tenant_id | HTTP endpoint latency |
patient_chart_http_request_total | Counter | operation, status_code | Request count |
patient_chart_domain_event_published_total | Counter | event_type | Events published to NATS |
patient_chart_outbox_unpublished_rows | Gauge | — | Pending outbox rows |
patient_chart_problem_created_total | Counter | tenant_id, clinical_status | Problem creation rate |
patient_chart_allergy_created_total | Counter | tenant_id, category | Allergy creation rate |
patient_chart_vitals_recorded_total | Counter | tenant_id | VitalsSet creation rate |
patient_chart_vitals_abnormal_total | Counter | tenant_id, code, severity | Abnormal vitals flagged |
patient_chart_note_signed_total | Counter | tenant_id, note_type | Signed notes |
patient_chart_note_ai_accepted_total | Counter | tenant_id | AI-assist chunks accepted |
patient_chart_breakglass_invoked_total | Counter | tenant_id | Break-glass events |
patient_chart_db_query_duration_ms | Histogram | query_name | DB query latency |
patient_chart_downstream_http_duration_ms | Histogram | dependency, status_code | Fan-out call latency |
3. Distributed tracing (OpenTelemetry)
All handlers emit OTEL spans. Key span hierarchy per request:
HTTP /v1/problems (POST)
patient_chart.policy_check
patient_chart.add_problem (use case)
patient_chart.terminology_lookup
patient_chart.db.problem.insert
patient_chart.outbox.write
patient_chart.event.publish
Span attributes (never include PHI):
tenant_id,patient_id,aggregate_type,aggregate_id,operation,correlation_id
OTEL exporter: OTLP → Grafana Tempo.
4. Structured logging
Log format: JSON via pino. Mandatory fields:
| Field | Description |
|---|---|
level | info / warn / error |
service | patient-chart-service |
traceId | OTEL trace id |
spanId | OTEL span id |
tenantId | Current tenant (from JWT) |
correlationId | Request correlation |
operation | Handler name |
msg | Human-readable message |
PHI-safe logging: Patient names, DOB, free-text clinical content MUST NOT appear in log output. IDs (pat_*, prb_*, etc.) are permitted.
5. Dashboards
| Dashboard | Location | Key panels |
|---|---|---|
| Patient Chart — Overview | Grafana ghasi/patient-chart | Request rate, error rate, P95 latency per operation, outbox lag |
| Patient Chart — Clinical Activity | Grafana ghasi/patient-chart-clinical | Problems/allergies/vitals/notes created per hour, abnormal vitals rate, break-glass events |
| Patient Chart — Downstream Deps | Grafana ghasi/patient-chart-deps | Latency and error rates for each upstream dependency |
| Patient Chart — SLO Burn | Grafana ghasi/patient-chart-slo | Multi-window burn rate for availability and latency SLOs |
6. Alerts
| Alert | Condition | Severity | Runbook |
|---|---|---|---|
ChartHighErrorRate | Error rate > 1 % for > 5 min | Critical | /runbooks/patient-chart/high-error-rate |
ChartP95LatencyHigh | P95 > 1.5 s for > 10 min | Warning | /runbooks/patient-chart/high-latency |
ChartOutboxLag | Outbox unpublished > 100 rows for > 2 min | Warning | /runbooks/patient-chart/outbox-lag |
ChartDBConnectionError | DB connection error rate > 0 for > 1 min | Critical | /runbooks/patient-chart/db-failure |
ChartBreakGlassSpike | Break-glass events > 10 in 1 min for same tenant | Warning | /runbooks/patient-chart/breakglass-spike |
ChartAbnormalVitalsUnreviewed | patient_chart_vitals_abnormal_total high rate, no cosign | Info | /runbooks/patient-chart/abnormal-vitals |
ChartTenantIsolationTestFailed | CI tenant-isolation.spec.ts failed in last deploy | Critical | Block deploy |
7. On-call runbook index
| Runbook | Trigger |
|---|---|
/runbooks/patient-chart/high-error-rate.md | ChartHighErrorRate |
/runbooks/patient-chart/db-failure.md | ChartDBConnectionError |
/runbooks/patient-chart/outbox-lag.md | ChartOutboxLag |
/runbooks/patient-chart/breakglass-spike.md | ChartBreakGlassSpike |
/runbooks/patient-chart/migration-failure.md | Migration job exit code ≠ 0 |