Skip to main content

Population Health Service — Observability

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 12 observability-telemetry

1. SLIs and SLOs

SLISLOMeasurement
Dashboard aggregate response time (p95)≤ 2500 msHistogram http_request_duration_ms{route="/dashboard"}
Dashboard availability≥ 99.5% / 30-day windowSuccess rate on GET /dashboard
Cohort refresh job success rate≥ 98% over 7-day windowcohort_refresh_jobs_succeeded / cohort_refresh_jobs_total
HMIS export success rate≥ 99% per scheduled runhmis_export_completed / hmis_export_triggered
HMIS export latency (scheduled push)≤ 5 minutes from scheduled timehmis_export_lag_seconds
API error rate (5xx)≤ 0.5%http_requests_total{status=~"5.."}
De-identification job completion≤ 10 minutes p95 for cohorts ≤ 50k rowsdeident_job_duration_seconds

2. Key Metrics

Metric nameTypeLabelsDescription
pophealth_dashboard_requests_totalCountertenant_id, statusDashboard API calls
pophealth_cohort_refresh_duration_secondsHistogramtenant_id, statusCohort refresh job duration
pophealth_cohort_membership_countGaugetenant_id, cohort_idLast evaluated membership
pophealth_risk_jobs_totalCountertenant_id, model_key, statusRisk scoring jobs
pophealth_hmis_export_lag_secondsGaugetenant_id, indicator_familySeconds behind scheduled push time
pophealth_hmis_export_totalCountertenant_id, indicator_family, statusHMIS push outcomes
pophealth_deident_jobs_totalCountertenant_id, statusDe-id export outcomes
pophealth_outreach_items_by_statusGaugetenant_id, statusLive outreach funnel
pophealth_quality_snapshots_computed_totalCountertenant_id, programQM snapshot completions
pophealth_outbox_lag_secondsGaugeAge of oldest unpublished outbox message

3. Traces

All use cases emit OpenTelemetry traces via @ghasi/telemetry initialized before NestFactory. Key span names:

SpanAttributes
pophealth.dashboard.querytenant_id, facility_id, result_count
pophealth.cohort.refreshcohort_id, version, member_delta
pophealth.risk.scoremodel_key, scope, tier_counts
pophealth.hmis.pushindicator_family, period, dhis2_import_count
pophealth.deident.pipelinecohort_id, k_value, epsilon, row_count
pophealth.outreach.status_changeitem_id, prev_status, new_status

4. Dashboards

DashboardKey panels
Population Health OverviewActive patients by tenant/facility, registry counts, compliance rates
Cohort EngineRefresh job queue depth, success/failure rate, p95 duration
HMIS Export PipelineExport lag per indicator family, DHIS2 response times, failure rate
Quality MetricsSnapshot freshness, rate trends by program
Outreach FunnelItems by status, conversion rate (pending→completed)
Security / AuditPHI access events, export requests, consent violations

5. Alerts

AlertConditionSeverityRunbook
Dashboard p95 latency breach> 2500 ms for 5 minWarningrunbooks/pophealth-dashboard-latency.md
HMIS export lag critical> 30 min behind scheduleCriticalrunbooks/pophealth-hmis-lag.md
HMIS export failure3 consecutive failuresCriticalrunbooks/pophealth-hmis-failure.md
Cohort refresh queue depth> 50 pending jobs for > 10 minWarningrunbooks/pophealth-cohort-queue.md
Outbox lagOldest unpublished message > 5 minWarningrunbooks/pophealth-outbox-lag.md
De-ident k-threshold violation rate> 5% of export jobs suppressedWarningrunbooks/pophealth-deident.md
5xx error rate spike> 1% for 5 minCriticalrunbooks/pophealth-errors.md

6. Logs

Structured JSON logs via @ghasi/telemetry logger. All log lines include traceId, spanId, tenantId, actorId.

PHI must never appear in log lines. Patient identifiers are replaced with patientId=<redacted> in debug logs.