Patient Portal Service — Failure Modes
Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 03 platform-services · 02 DDD
1. Failure Catalog
| # | Failure | User impact | Detection | Mitigation |
|---|---|---|---|---|
| F-01 | PostgreSQL unavailable | Portal login fails; account reads fail | Health check probe fails; portal.upstream.calls.total DB errors spike | Pod restarts; Postgres HA failover (streaming replication); circuit breaker opens after 3 consecutive failures |
| F-02 | Keycloak patient realm unreachable | Login impossible; all JWT validation fails; all portal endpoints return 401 | Health check JWT validation probe fails; login failure rate spikes | Patient realm HA (Keycloak active-passive); cached JWKS for 5 min to handle brief outages |
| F-03 | registration-service unavailable | Patient summary / demographics unavailable | Upstream adapter circuit breaker triggers | Return cached last-known Patient resource (Redis, 5 min stale TTL); display degraded banner |
| F-04 | laboratory-service unavailable | Lab results section empty | Circuit breaker open; upstream error counter | Return cached last-known results bundle; display "Results may not be current" message |
| F-05 | scheduling-service unavailable | Appointment list unavailable; new appointment requests fail | Circuit breaker; upstream error counter | Cache last-known appointment list (Redis); queue appointment requests for retry via outbox |
| F-06 | radiology-service unavailable | Imaging results section empty | Circuit breaker; upstream error counter | Cached results + degraded banner |
| F-07 | claims-service unavailable | Coverage and EOB unavailable | Circuit breaker | Cached claims data + degraded banner |
| F-08 | ai-gateway-service unavailable | Navigation assistant returns 503 | Circuit breaker; feature-specific health flag | Graceful degradation: navigation assistant disabled; core portal fully operational |
| F-09 | Redis cache unavailable | Increased latency to upstream services; all requests bypass cache | Redis health probe fails | Fall through to upstream services; performance SLO breach alert fires; no data loss |
| F-10 | NATS JetStream unavailable | Outbox events not delivered; push notifications delayed | Outbox relay worker fails to publish; outbox_unpublished_count rises | Outbox persists in PostgreSQL; events delivered when NATS recovers; at-least-once semantics guarantee |
| F-11 | Push notification gateway (FCM/APNs) failure | Mobile push notifications not delivered | FCM/APNs adapter error rate spikes | Silent failure (push is best-effort); patient can still see results by opening portal |
| F-12 | Export job worker crash mid-export | Export job stuck in in_progress | Job status age-out monitor; expjob exceeds 10 min in-progress | Kubernetes restarts pod; job is idempotent and restarts from checkpoint; failed jobs set to failed with errorDetail |
| F-13 | Upstream returns unreleased result | Risk of premature result disclosure | Release policy enforcement in BFF code path; unit tests cover policy | Server-side policy check (?releasePolicy=patient-visible) is mandatory on every upstream call; result excluded if policy not met |
| F-14 | Proxy delegation record stale / expired | Proxy access continues past expiry | validTo checked on every request | validTo evaluated server-side per request; status = expired auto-set by cron job; immediate revocation supported |
| F-15 | Downstream tenant isolation breach | Patient sees another tenant's data | RLS policy check failure; audit alert | PostgreSQL RLS policy on all tables; app.tenant_id set from JWT tid claim before every query; audit-service anomaly detection |
2. Degradation Mode Summary
The portal implements graceful degradation: upstream failures cause sections to show cached or empty states with user-facing banners, rather than failing the entire portal session. Only authentication failures (F-02) and database failures (F-01) result in complete portal unavailability.
| Upstream | Degraded behaviour |
|---|---|
| registration-service down | Patient summary shows cached demographics + warning |
| lab/radiology down | Results section shows cached results + "may not be current" |
| scheduling-service down | Appointments show cached list; new booking disabled with message |
| ai-gateway down | Navigation assistant disabled; rest of portal unaffected |
| Redis down | All requests served live from upstream; latency increases |