| F01 | DHIS2 API unreachable at scheduled export time | HMIS indicators not delivered to MoPH on schedule | pophealth_hmis_export_lag_seconds alert; HTTP 503 from adapter | Retry with exponential backoff (max 3 attempts); emit hmis_export.failed event; page on-call; manual re-trigger via API |
| F02 | DHIS2 accepts push but returns import errors (partial failure) | Some indicators silently not imported | Parse DHIS2 importSummary in response; alert if ignored > 0 | Log import summary; emit event with counts; alert analyst; schedule re-push for ignored rows |
| F03 | Cohort refresh job hangs (worker crash mid-run) | Cohort membership stale; dashboard shows old data | cohort_refresh_duration_seconds histogram; job status timeout alert (> 10 min) | Job TTL: mark as failed after timeout; re-enqueue; dataFreshness metadata exposed in dashboard response |
| F04 | PostgreSQL connection pool exhausted under analytics load | API timeouts; 503 responses | Connection pool metrics; http_requests_total{status=5xx} spike | Pool size tuned per environment; read-only replicas for heavy analytics queries; circuit breaker on connection timeout |
| F05 | k-anonymity threshold violation during de-identification | Export silently suppressed; researcher blocked | DEIDENT_K_THRESHOLD_VIOLATION returned to caller; metric incremented | Return structured error with suppression count; no partial release; analyst can adjust cohort to meet threshold |
| F06 | Consent check service (access-policy) unavailable | All secondary-use exports blocked | HTTP 5xx from access-policy adapter | Fail-closed: deny export if consent check unavailable; return 503 CONSENT_SERVICE_UNAVAILABLE |
| F07 | NATS JetStream partition unavailable | Domain events not published; downstream consumers miss updates | Outbox lag alert (oldest unpublished > 5 min) | Transactional outbox: events accumulate in DB; replay when NATS recovers; no data loss |
| F08 | Upstream clinical feed delay (e.g., patient-chart-service slow) | Dashboard dataFreshness timestamp becomes stale | Freshness metadata staleness alert (> 2h behind) | Serve stale aggregates with explicit dataFreshness in response; alert; do not return 503 for stale data |
| F09 | Duplicate cohort refresh jobs submitted (race condition) | Wasted compute; potential inconsistent membership | Unique constraint on (cohort_id, status='running') in DB | Coalesce logic in use case: check for active job before enqueue; return existing jobId |
| F10 | Object storage unavailable for export file write | De-ident export job fails; researcher receives error | HTTP 5xx from storage adapter | Retry 3 times; if persistent failure, mark job failed; emit alert; researcher can re-trigger |
| F11 | HMIS export duplicate for same period | Double-counting in DHIS2 | Unique constraint on (tenant_id, indicator_family, period, status) | Reject duplicate job creation with EXPORT_JOB_ACTIVE; operator can force-override with forceResubmit: true |
| F12 | Memory pressure during large cohort refresh (> 500k rows) | OOM kill; job fails | Container memory limit alert | Stream cohort computation in batches of 10k; worker resource limits enforced in Kubernetes |