Care Plan Service — Failure Modes
Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 03 platform-services · 02 DDD
Failure Catalog
| ID | Failure | User impact | Detection | Mitigation |
|---|---|---|---|---|
FM-CP-001 | Postgres unavailable | All reads and writes fail with 503 | Health check /health/ready fails; alert CarePlanServiceDown | Retry with exponential backoff; circuit breaker; read replica fallback for FHIR reads |
FM-CP-002 | NATS JetStream unavailable | Domain events accumulate in outbox; no immediate user impact | Outbox pending count alert CarePlanOutboxLagHigh | Outbox relay retries; events delivered once NATS recovers |
FM-CP-003 | Outbox relay stuck | Events not published > 15 min; downstream services stale | Alert CarePlanOutboxStuck on outbox age | Manual replay trigger; on-call investigation of relay process |
FM-CP-004 | Terminology service unavailable | Coding validation skipped (degraded mode) | HTTP timeout on terminology call; logged with WARN | Graceful degradation: accept request but skip coding validation; alert operator |
FM-CP-005 | Provider directory unavailable | Care team practitioner validation skipped | HTTP timeout; logged WARN | Accept care team update; async validation job checks later |
FM-CP-006 | JWT validation failure (Keycloak down) | All authenticated requests fail with 401 | Health check; spike in 401 errors | Cached JWKS (short TTL); circuit breaker on Keycloak calls |
FM-CP-007 | Concurrent version conflict storm | Multiple clients retrying simultaneously; each gets 409 | High version_conflicts_total metric | Inform users; no data loss; retry-after hint in 409 response |
FM-CP-008 | Large care plan with many goals/activities | Slow reads > SLO threshold | p95 latency alert | Pagination on goals/activities lists; lazy load sub-resources |
FM-CP-009 | RLS policy misconfiguration | Cross-tenant data leakage | RLS integration test fails in CI; adversarial test | CI gate: tenant-isolation spec must pass; RLS policy reviewed in security audit |
FM-CP-010 | Module entitlement check failure | All writes blocked even for licensed tenants | 403 MODULE_NOT_LICENSED errors | Cache entitlement checks; fallback to allow if entitlement service unreachable (configurable) |