Skip to main content

Population Health Service — Failure Modes

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 03 platform-services

1. Failure Catalog

#FailureUser / System ImpactDetectionMitigation
F01DHIS2 API unreachable at scheduled export timeHMIS indicators not delivered to MoPH on schedulepophealth_hmis_export_lag_seconds alert; HTTP 503 from adapterRetry with exponential backoff (max 3 attempts); emit hmis_export.failed event; page on-call; manual re-trigger via API
F02DHIS2 accepts push but returns import errors (partial failure)Some indicators silently not importedParse DHIS2 importSummary in response; alert if ignored > 0Log import summary; emit event with counts; alert analyst; schedule re-push for ignored rows
F03Cohort refresh job hangs (worker crash mid-run)Cohort membership stale; dashboard shows old datacohort_refresh_duration_seconds histogram; job status timeout alert (> 10 min)Job TTL: mark as failed after timeout; re-enqueue; dataFreshness metadata exposed in dashboard response
F04PostgreSQL connection pool exhausted under analytics loadAPI timeouts; 503 responsesConnection pool metrics; http_requests_total{status=5xx} spikePool size tuned per environment; read-only replicas for heavy analytics queries; circuit breaker on connection timeout
F05k-anonymity threshold violation during de-identificationExport silently suppressed; researcher blockedDEIDENT_K_THRESHOLD_VIOLATION returned to caller; metric incrementedReturn structured error with suppression count; no partial release; analyst can adjust cohort to meet threshold
F06Consent check service (access-policy) unavailableAll secondary-use exports blockedHTTP 5xx from access-policy adapterFail-closed: deny export if consent check unavailable; return 503 CONSENT_SERVICE_UNAVAILABLE
F07NATS JetStream partition unavailableDomain events not published; downstream consumers miss updatesOutbox lag alert (oldest unpublished > 5 min)Transactional outbox: events accumulate in DB; replay when NATS recovers; no data loss
F08Upstream clinical feed delay (e.g., patient-chart-service slow)Dashboard dataFreshness timestamp becomes staleFreshness metadata staleness alert (> 2h behind)Serve stale aggregates with explicit dataFreshness in response; alert; do not return 503 for stale data
F09Duplicate cohort refresh jobs submitted (race condition)Wasted compute; potential inconsistent membershipUnique constraint on (cohort_id, status='running') in DBCoalesce logic in use case: check for active job before enqueue; return existing jobId
F10Object storage unavailable for export file writeDe-ident export job fails; researcher receives errorHTTP 5xx from storage adapterRetry 3 times; if persistent failure, mark job failed; emit alert; researcher can re-trigger
F11HMIS export duplicate for same periodDouble-counting in DHIS2Unique constraint on (tenant_id, indicator_family, period, status)Reject duplicate job creation with EXPORT_JOB_ACTIVE; operator can force-override with forceResubmit: true
F12Memory pressure during large cohort refresh (> 500k rows)OOM kill; job failsContainer memory limit alertStream cohort computation in batches of 10k; worker resource limits enforced in Kubernetes

2. Degraded Mode Behavior

ConditionDegraded Behavior
Offline (facility-level)Facility aggregate reports generated locally; queued for sync; API returns offline mode indicator
DHIS2 unreachableAccumulate export data; push when connectivity restored; alert if > 24h lag
NATS unavailableUse transactional outbox; serve reads normally; writes committed to DB
Analytics DB read replica lag > 30sFall through to primary; log replica lag metric