Ghasi e-Prescribing Gateway Service — Failure Modes
Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 03 platform-services · 02 DDD
Failure Catalog
| ID | Failure | User impact | Detection | Mitigation |
|---|---|---|---|---|
FM-EPRX-001 | Postgres unavailable | All MR/MD create/read operations fail with 503 | Health check fails; alert EprescribingServiceDown | Circuit breaker; retry; read-replica for GETs; idempotency keys enable safe retry on restore |
FM-EPRX-002 | NATS unavailable | Prescriptions and dispenses persist but events not published; no immediate user impact | Outbox pending count alert; EprescribingOutboxStuck | Outbox accumulates safely; relay retries on NATS recovery |
FM-EPRX-003 | Subscription notification endpoint down | Pharmacy or EHR misses notification; manual or pull workaround needed | DLQ depth alert EprescribingDLQDepthHigh; delivery_failed event | Retry with backoff (5 attempts, 15 min max); DLQ enqueue; on-call manual replay tool (AC-RX-005) |
FM-EPRX-004 | IG validator unavailable (Zod/HAPI) | All profile validation skipped; gateway returns 503 VALIDATOR_UNAVAILABLE | Health endpoint returns validator: degraded; alert | Configurable degraded mode: block writes (strict) or accept with warning (permissive) per tenant policy |
FM-EPRX-005 | Pharmacy routing resolver unavailable | Target pharmacy cannot be determined; MR blocked | HTTP timeout on provider-directory call; logged ERROR | Cache last known routing table (TTL: 5 min); fallback to default pharmacy if configured |
FM-EPRX-006 | ETag conflict storm (concurrent updates) | Multiple actors get 412; must retry with refresh | Spike in eprescribing_etag_conflicts_total | Inform callers via 412 response; client exponential backoff; no data loss |
FM-EPRX-007 | JWT validation failure (Keycloak down) | All authenticated requests fail with 401 | Spike in 401 errors; health alert | Cached JWKS (5 min TTL); circuit breaker on Keycloak |
FM-EPRX-008 | Idempotency store unavailable (Postgres/Redis) | Risk of duplicate MR/MD creation on network retry | Phase 1: Postgres failure cascades; Phase 2: Redis failure → fallback to Postgres | Phase 2: Redis failure falls back to Postgres check; alert on Redis degradation |
FM-EPRX-009 | RLS policy regression | Cross-tenant MR/MD exposure | Adversarial test fails in CI; security audit | CI gate: tenant-isolation.spec mandatory; RLS reviewed on every migration |
FM-EPRX-010 | Subscription DLQ grows without processing | Pharmacy/EHR stale; prescription status diverges | EprescribingDLQDepthHigh alert persists | On-call runbook: investigate endpoint; repair and replay; escalate to tenant admin |
FM-EPRX-011 | Outbox relay process crash mid-publish | Event may be published once or not at all | Relay restart time; NATS duplicate detection | NATS deduplication on message ID; consumer idempotency (BR-RX-003) |
FM-EPRX-012 | Rate limit hit by legitimate high-volume tenant | Tenant receives 429; prescriptions delayed | Rate limit metric spike | Per-tenant rate limit configuration; emergency queue option per tenant policy (BR-RX-004) |