Population Health Service — Observability
Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 12 observability-telemetry
1. SLIs and SLOs
| SLI | SLO | Measurement |
|---|---|---|
| Dashboard aggregate response time (p95) | ≤ 2500 ms | Histogram http_request_duration_ms{route="/dashboard"} |
| Dashboard availability | ≥ 99.5% / 30-day window | Success rate on GET /dashboard |
| Cohort refresh job success rate | ≥ 98% over 7-day window | cohort_refresh_jobs_succeeded / cohort_refresh_jobs_total |
| HMIS export success rate | ≥ 99% per scheduled run | hmis_export_completed / hmis_export_triggered |
| HMIS export latency (scheduled push) | ≤ 5 minutes from scheduled time | hmis_export_lag_seconds |
| API error rate (5xx) | ≤ 0.5% | http_requests_total{status=~"5.."} |
| De-identification job completion | ≤ 10 minutes p95 for cohorts ≤ 50k rows | deident_job_duration_seconds |
2. Key Metrics
| Metric name | Type | Labels | Description |
|---|---|---|---|
pophealth_dashboard_requests_total | Counter | tenant_id, status | Dashboard API calls |
pophealth_cohort_refresh_duration_seconds | Histogram | tenant_id, status | Cohort refresh job duration |
pophealth_cohort_membership_count | Gauge | tenant_id, cohort_id | Last evaluated membership |
pophealth_risk_jobs_total | Counter | tenant_id, model_key, status | Risk scoring jobs |
pophealth_hmis_export_lag_seconds | Gauge | tenant_id, indicator_family | Seconds behind scheduled push time |
pophealth_hmis_export_total | Counter | tenant_id, indicator_family, status | HMIS push outcomes |
pophealth_deident_jobs_total | Counter | tenant_id, status | De-id export outcomes |
pophealth_outreach_items_by_status | Gauge | tenant_id, status | Live outreach funnel |
pophealth_quality_snapshots_computed_total | Counter | tenant_id, program | QM snapshot completions |
pophealth_outbox_lag_seconds | Gauge | — | Age of oldest unpublished outbox message |
3. Traces
All use cases emit OpenTelemetry traces via @ghasi/telemetry initialized before NestFactory. Key span names:
| Span | Attributes |
|---|---|
pophealth.dashboard.query | tenant_id, facility_id, result_count |
pophealth.cohort.refresh | cohort_id, version, member_delta |
pophealth.risk.score | model_key, scope, tier_counts |
pophealth.hmis.push | indicator_family, period, dhis2_import_count |
pophealth.deident.pipeline | cohort_id, k_value, epsilon, row_count |
pophealth.outreach.status_change | item_id, prev_status, new_status |
4. Dashboards
| Dashboard | Key panels |
|---|---|
| Population Health Overview | Active patients by tenant/facility, registry counts, compliance rates |
| Cohort Engine | Refresh job queue depth, success/failure rate, p95 duration |
| HMIS Export Pipeline | Export lag per indicator family, DHIS2 response times, failure rate |
| Quality Metrics | Snapshot freshness, rate trends by program |
| Outreach Funnel | Items by status, conversion rate (pending→completed) |
| Security / Audit | PHI access events, export requests, consent violations |
5. Alerts
| Alert | Condition | Severity | Runbook |
|---|---|---|---|
| Dashboard p95 latency breach | > 2500 ms for 5 min | Warning | runbooks/pophealth-dashboard-latency.md |
| HMIS export lag critical | > 30 min behind schedule | Critical | runbooks/pophealth-hmis-lag.md |
| HMIS export failure | 3 consecutive failures | Critical | runbooks/pophealth-hmis-failure.md |
| Cohort refresh queue depth | > 50 pending jobs for > 10 min | Warning | runbooks/pophealth-cohort-queue.md |
| Outbox lag | Oldest unpublished message > 5 min | Warning | runbooks/pophealth-outbox-lag.md |
| De-ident k-threshold violation rate | > 5% of export jobs suppressed | Warning | runbooks/pophealth-deident.md |
| 5xx error rate spike | > 1% for 5 min | Critical | runbooks/pophealth-errors.md |
6. Logs
Structured JSON logs via @ghasi/telemetry logger. All log lines include traceId, spanId, tenantId, actorId.
PHI must never appear in log lines. Patient identifiers are replaced with patientId=<redacted> in debug logs.