Skip to main content

Document Service — Failure Modes

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 03 platform-services


1. Failure Catalog

#FailureUser / Platform ImpactDetectionMitigation
FM-01FHIR gateway unavailableSynchronous generation returns 503; async jobs fail at binding step; patients cannot get new documentsHTTP 5xx rate on FHIR calls; render job failure rate alertCircuit breaker; retry async jobs with exponential backoff; surface error to clinician; mark job failed after 3 retries
FM-02Object storage (S3/MinIO) unavailableDocument storage fails; generation returns 503; downloads unavailableStorage PUT/GET error metric; alertFail generation with structured error; cached PDFs on client may still be viewable; page on-call immediately
FM-03ClamAV unavailableUploads fail; no documents can be ingested via upload pathClamAV connection error; GET /health/ready returns not readyReject uploads with 503 until ClamAV recovered; never skip scan
FM-04PDF renderer timeoutSingle document generation exceeds 5 s SLO; 504 returnedGeneration duration histogram; timeout counterRetry via async render job; investigate template complexity; alert if p95 > 8 s
FM-05PostgreSQL unavailableAll APIs fail; render workers cannot update job statusDB connection error; readiness probe failsK8s removes pods from LB; failover to read replica; page on-call
FM-06NATS unavailableEvents not published; audit gap; render workers may lose job coordinationNATS connection error; outbox relay failure alertOutbox pattern: events queued in DB; relay retries on recovery; audit gap flagged
FM-07Render worker crash mid-jobJob stuck in running status; PDF not deliveredJob age in running status > 5 min alertWatchdog process marks stale running jobs as failed after 5 min; client retries
FM-08Virus scanner false positiveLegitimate document quarantined; upload fails with VIRUS_DETECTEDUser complaint; quarantine rate anomaly alertOperator reviews quarantine bucket; manual re-scan with updated definitions; resubmit if clean
FM-09FHIR binding missing dataPDF generation fails with 422 BINDING_RESOLUTION_FAILED422 rate on generate endpointReturn binding path in error; clinician resolves data gap in FHIR; retry generation
FM-10Presigned URL expiryUser clicks download link after TTL; 403 from object storageSupport ticket; 403 rate on object storageRefresh presigned URL on next document list / download API call; short TTL by design
FM-11Template version not found at generation timeGeneration fails; clinician cannot produce document422 rate; alert if > 1 %Validate templateVersionId at request intake; surface clear error to clinician
FM-12config-service unavailableTenant branding tokens not resolved; PDF uses default themeHTTP 5xx on config callFallback to default platform design tokens; generate PDF with platform defaults; log warning

2. Degraded Mode Behaviour

ModeDocument service behaviour
FHIR gateway downAll generation fails; uploaded scans still accepted (no binding resolution needed)
Object storage downAll generation + download fails; template management continues (no object storage needed)
ClamAV downUploads blocked entirely; generation unaffected
NATS downAll APIs continue; events queued in outbox; audit gap until NATS recovers
config-service downPDF generation uses platform default tokens; template authoring continues

3. Recovery Procedures

ScenarioRecovery steps
Stale running render jobsWatchdog marks as failed after 5 min; clients retry
Quarantined file reviewOperator inspects quarantine-infected/ prefix; manual re-scan; contact uploader
Outbox replay after NATS recoveryRelay automatically processes undelivered outbox rows on reconnect
Object storage partition healingNo data loss; PUT retried; no duplicate if idempotency key used