Communication Service — Failure Modes
Status: populated Owner: TBD Last updated: 2026-04-17 Companion: Service Template
1. Failure catalog
| # | Failure | Trigger | User impact | Detection | Mitigation |
|---|---|---|---|---|---|
| 1 | SMS provider outage | Ghasi-SMS-Gateway 5xx or timeout | Reminders / critical alerts delayed | per-adapter error rate > 20% 5 min | Failover to secondary provider; retry with backoff; alert |
| 2 | Push feedback stall | FCM / APNs feedback not flowing | dispatched never settles to delivered | feedback age > 10 min | Force-refresh feedback; alert; mark stale dispatches as unknown |
| 3 | Email bounce spike | Bad template / domain reputation | Legitimate notifications marked undeliverable | bounce ratio > 5% | Throttle category; page on-call; auto-disable category for tenant |
| 4 | Virtual room unreachable | Jitsi node down | Patients see "session failed"; fallback thread spawned | virtual_session.failed count > 5 in 5 min | Auto-spawn fallback thread; page on-call; failover Jitsi pool |
| 5 | Database unavailable | Primary DB failover | 503 on writes; reads degrade to replica | DB conn errors | Replica read mode; queue writes to outbox preflight; alert |
| 6 | NATS unavailable | JetStream cluster partition | Outbox lag grows; events not emitted | lag > 30 s | Outbox retains; replay after recovery; alert |
| 7 | Redis cache miss flood | Cache eviction | Idempotency keys lost; duplicate 2xx possible | cache hit ratio < 90% | Failure-open idempotency using DB fallback; warm cache |
| 8 | Attachment scan stuck | AV scanner queue backed up | Uploads pending indefinitely | scan queue > 1000 for 15 min | Scale scanner; block uploads > N; alert |
| 9 | Outbox relay crash loop | Schema drift or poison event | Events halt | relay liveness fails | Identify poison event; move to DLQ; redeploy |
| 10 | Join-token forged | KMS key leak | Unauthorized VC entry | token verify fail rate spike | Rotate key; revoke sessions; forensic review |
| 11 | Thread PHI leak in push | Bug in dispatch payload | Privacy breach | Pre-send assertion (code) + periodic PII scanner in logs | Halt affected category; rotate template; incident report |
| 12 | Cross-tenant participant added | Bug bypasses tenant check | Data exposure | Integration test + runtime RLS | Reject at DB level (RLS) + application check; alert |
| 13 | DLR callback flood | Provider retries excessive | CPU saturation on worker-dlr | RPS > 10x baseline | Per-provider rate limit at Kong; scale workers |
| 14 | Fallback-loop | Repeated VC failures all spawn threads | Thread proliferation | fallback_initiated rate > 20x | Circuit breaker per patient/day; alert |
| 15 | GDPR erase partial | Event stream missed an event | Residual PII in dispatch log | Periodic reconciliation job | Replay erasure saga; alert |
| 16 | Clock skew | NTP drift on host | Read-receipt ordering wrong | skew > 500 ms | NTP enforcement; alert |
| 17 | Large-attachment OOM | > budget upload | 413 to client | Pre-signed URL size cap | Enforce maxUploadSize at Kong and service |
| 18 | Scheduling event re-delivery | JetStream ack loss | Duplicate sessions | Inbox dedupe | Unique (tenant, appointment_id) + dedupe |
| 19 | Template not found | Deploy mismatch | Dispatch fails | template_not_found count | Fail-closed; alert; show internal-ops notice |
| 20 | Recording ingest failure | Blob path denied | Recording missing | recording.failed event | Retry with backoff; alert; no user action |
2. Error budget policy
- If SLO (send latency or dispatch success) burns 30-day budget > 50% in any 7-day window, feature freeze new adapter changes.