Skip to main content

Patient Portal Service — Failure Modes

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 03 platform-services · 02 DDD

1. Failure Catalog

#FailureUser impactDetectionMitigation
F-01PostgreSQL unavailablePortal login fails; account reads failHealth check probe fails; portal.upstream.calls.total DB errors spikePod restarts; Postgres HA failover (streaming replication); circuit breaker opens after 3 consecutive failures
F-02Keycloak patient realm unreachableLogin impossible; all JWT validation fails; all portal endpoints return 401Health check JWT validation probe fails; login failure rate spikesPatient realm HA (Keycloak active-passive); cached JWKS for 5 min to handle brief outages
F-03registration-service unavailablePatient summary / demographics unavailableUpstream adapter circuit breaker triggersReturn cached last-known Patient resource (Redis, 5 min stale TTL); display degraded banner
F-04laboratory-service unavailableLab results section emptyCircuit breaker open; upstream error counterReturn cached last-known results bundle; display "Results may not be current" message
F-05scheduling-service unavailableAppointment list unavailable; new appointment requests failCircuit breaker; upstream error counterCache last-known appointment list (Redis); queue appointment requests for retry via outbox
F-06radiology-service unavailableImaging results section emptyCircuit breaker; upstream error counterCached results + degraded banner
F-07claims-service unavailableCoverage and EOB unavailableCircuit breakerCached claims data + degraded banner
F-08ai-gateway-service unavailableNavigation assistant returns 503Circuit breaker; feature-specific health flagGraceful degradation: navigation assistant disabled; core portal fully operational
F-09Redis cache unavailableIncreased latency to upstream services; all requests bypass cacheRedis health probe failsFall through to upstream services; performance SLO breach alert fires; no data loss
F-10NATS JetStream unavailableOutbox events not delivered; push notifications delayedOutbox relay worker fails to publish; outbox_unpublished_count risesOutbox persists in PostgreSQL; events delivered when NATS recovers; at-least-once semantics guarantee
F-11Push notification gateway (FCM/APNs) failureMobile push notifications not deliveredFCM/APNs adapter error rate spikesSilent failure (push is best-effort); patient can still see results by opening portal
F-12Export job worker crash mid-exportExport job stuck in in_progressJob status age-out monitor; expjob exceeds 10 min in-progressKubernetes restarts pod; job is idempotent and restarts from checkpoint; failed jobs set to failed with errorDetail
F-13Upstream returns unreleased resultRisk of premature result disclosureRelease policy enforcement in BFF code path; unit tests cover policyServer-side policy check (?releasePolicy=patient-visible) is mandatory on every upstream call; result excluded if policy not met
F-14Proxy delegation record stale / expiredProxy access continues past expiryvalidTo checked on every requestvalidTo evaluated server-side per request; status = expired auto-set by cron job; immediate revocation supported
F-15Downstream tenant isolation breachPatient sees another tenant's dataRLS policy check failure; audit alertPostgreSQL RLS policy on all tables; app.tenant_id set from JWT tid claim before every query; audit-service anomaly detection

2. Degradation Mode Summary

The portal implements graceful degradation: upstream failures cause sections to show cached or empty states with user-facing banners, rather than failing the entire portal session. Only authentication failures (F-02) and database failures (F-01) result in complete portal unavailability.

UpstreamDegraded behaviour
registration-service downPatient summary shows cached demographics + warning
lab/radiology downResults section shows cached results + "may not be current"
scheduling-service downAppointments show cached list; new booking disabled with message
ai-gateway downNavigation assistant disabled; rest of portal unaffected
Redis downAll requests served live from upstream; latency increases