Skip to main content

Communication Service — Failure Modes

Status: populated Owner: TBD Last updated: 2026-04-17 Companion: Service Template

1. Failure catalog

#FailureTriggerUser impactDetectionMitigation
1SMS provider outageGhasi-SMS-Gateway 5xx or timeoutReminders / critical alerts delayedper-adapter error rate > 20% 5 minFailover to secondary provider; retry with backoff; alert
2Push feedback stallFCM / APNs feedback not flowingdispatched never settles to deliveredfeedback age > 10 minForce-refresh feedback; alert; mark stale dispatches as unknown
3Email bounce spikeBad template / domain reputationLegitimate notifications marked undeliverablebounce ratio > 5%Throttle category; page on-call; auto-disable category for tenant
4Virtual room unreachableJitsi node downPatients see "session failed"; fallback thread spawnedvirtual_session.failed count > 5 in 5 minAuto-spawn fallback thread; page on-call; failover Jitsi pool
5Database unavailablePrimary DB failover503 on writes; reads degrade to replicaDB conn errorsReplica read mode; queue writes to outbox preflight; alert
6NATS unavailableJetStream cluster partitionOutbox lag grows; events not emittedlag > 30 sOutbox retains; replay after recovery; alert
7Redis cache miss floodCache evictionIdempotency keys lost; duplicate 2xx possiblecache hit ratio < 90%Failure-open idempotency using DB fallback; warm cache
8Attachment scan stuckAV scanner queue backed upUploads pending indefinitelyscan queue > 1000 for 15 minScale scanner; block uploads > N; alert
9Outbox relay crash loopSchema drift or poison eventEvents haltrelay liveness failsIdentify poison event; move to DLQ; redeploy
10Join-token forgedKMS key leakUnauthorized VC entrytoken verify fail rate spikeRotate key; revoke sessions; forensic review
11Thread PHI leak in pushBug in dispatch payloadPrivacy breachPre-send assertion (code) + periodic PII scanner in logsHalt affected category; rotate template; incident report
12Cross-tenant participant addedBug bypasses tenant checkData exposureIntegration test + runtime RLSReject at DB level (RLS) + application check; alert
13DLR callback floodProvider retries excessiveCPU saturation on worker-dlrRPS > 10x baselinePer-provider rate limit at Kong; scale workers
14Fallback-loopRepeated VC failures all spawn threadsThread proliferationfallback_initiated rate > 20xCircuit breaker per patient/day; alert
15GDPR erase partialEvent stream missed an eventResidual PII in dispatch logPeriodic reconciliation jobReplay erasure saga; alert
16Clock skewNTP drift on hostRead-receipt ordering wrongskew > 500 msNTP enforcement; alert
17Large-attachment OOM> budget upload413 to clientPre-signed URL size capEnforce maxUploadSize at Kong and service
18Scheduling event re-deliveryJetStream ack lossDuplicate sessionsInbox dedupeUnique (tenant, appointment_id) + dedupe
19Template not foundDeploy mismatchDispatch failstemplate_not_found countFail-closed; alert; show internal-ops notice
20Recording ingest failureBlob path deniedRecording missingrecording.failed eventRetry with backoff; alert; no user action

2. Error budget policy

  • If SLO (send latency or dispatch success) burns 30-day budget > 50% in any 7-day window, feature freeze new adapter changes.