Skip to main content

Population Health Service — Failure Modes

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 03 platform-services

1. Failure Catalog

#	Failure	User / System Impact	Detection	Mitigation
F01	DHIS2 API unreachable at scheduled export time	HMIS indicators not delivered to MoPH on schedule	`pophealth_hmis_export_lag_seconds` alert; HTTP 503 from adapter	Retry with exponential backoff (max 3 attempts); emit `hmis_export.failed` event; page on-call; manual re-trigger via API
F02	DHIS2 accepts push but returns import errors (partial failure)	Some indicators silently not imported	Parse DHIS2 `importSummary` in response; alert if `ignored > 0`	Log import summary; emit event with counts; alert analyst; schedule re-push for ignored rows
F03	Cohort refresh job hangs (worker crash mid-run)	Cohort membership stale; dashboard shows old data	`cohort_refresh_duration_seconds` histogram; job status timeout alert (> 10 min)	Job TTL: mark as failed after timeout; re-enqueue; `dataFreshness` metadata exposed in dashboard response
F04	PostgreSQL connection pool exhausted under analytics load	API timeouts; 503 responses	Connection pool metrics; `http_requests_total{status=5xx}` spike	Pool size tuned per environment; read-only replicas for heavy analytics queries; circuit breaker on connection timeout
F05	k-anonymity threshold violation during de-identification	Export silently suppressed; researcher blocked	`DEIDENT_K_THRESHOLD_VIOLATION` returned to caller; metric incremented	Return structured error with suppression count; no partial release; analyst can adjust cohort to meet threshold
F06	Consent check service (access-policy) unavailable	All secondary-use exports blocked	HTTP 5xx from access-policy adapter	Fail-closed: deny export if consent check unavailable; return 503 `CONSENT_SERVICE_UNAVAILABLE`
F07	NATS JetStream partition unavailable	Domain events not published; downstream consumers miss updates	Outbox lag alert (oldest unpublished > 5 min)	Transactional outbox: events accumulate in DB; replay when NATS recovers; no data loss
F08	Upstream clinical feed delay (e.g., patient-chart-service slow)	Dashboard `dataFreshness` timestamp becomes stale	Freshness metadata staleness alert (> 2h behind)	Serve stale aggregates with explicit `dataFreshness` in response; alert; do not return 503 for stale data
F09	Duplicate cohort refresh jobs submitted (race condition)	Wasted compute; potential inconsistent membership	Unique constraint on `(cohort_id, status='running')` in DB	Coalesce logic in use case: check for active job before enqueue; return existing `jobId`
F10	Object storage unavailable for export file write	De-ident export job fails; researcher receives error	HTTP 5xx from storage adapter	Retry 3 times; if persistent failure, mark job `failed`; emit alert; researcher can re-trigger
F11	HMIS export duplicate for same period	Double-counting in DHIS2	Unique constraint on `(tenant_id, indicator_family, period, status)`	Reject duplicate job creation with `EXPORT_JOB_ACTIVE`; operator can force-override with `forceResubmit: true`
F12	Memory pressure during large cohort refresh (> 500k rows)	OOM kill; job fails	Container memory limit alert	Stream cohort computation in batches of 10k; worker resource limits enforced in Kubernetes

2. Degraded Mode Behavior

Condition	Degraded Behavior
Offline (facility-level)	Facility aggregate reports generated locally; queued for sync; API returns offline mode indicator
DHIS2 unreachable	Accumulate export data; push when connectivity restored; alert if > 24h lag
NATS unavailable	Use transactional outbox; serve reads normally; writes committed to DB
Analytics DB read replica lag > 30s	Fall through to primary; log replica lag metric

1. Failure Catalog
2. Degraded Mode Behavior