Registration Service — Failure Modes
Status: populated Owner: TBD Last updated: 2026-04-17 Companion: Service Template
1. Failure Catalog
| ID | Failure | User Impact | Detection | Mitigation |
|---|---|---|---|---|
| FM-REG-01 | PostgreSQL primary unavailable | All registration operations fail (503) | Healthcheck readiness probe fails; alert on pod not ready | Replica promotion; circuit breaker on DB port; K8s pod restart |
| FM-REG-02 | NATS JetStream unavailable | Patient creates/updates succeed but events not published (data persisted; events queued in outbox) | Outbox lag metric > threshold | Outbox relay retries with exponential backoff; events published when NATS recovers |
| FM-REG-03 | MPI scoring slow / timeout | Patient create degrades — MPI check times out after configured threshold | registration.mpi.score.duration_p95 alert | Fail-safe: if MPI times out, create proceeds with a warning log + NATS duplicate-review event |
| FM-REG-04 | Redis (idempotency store) unavailable | Idempotent retries may create duplicates for the duration | Redis healthcheck failure | Service degrades gracefully: proceeds without idempotency cache; logs warning; MPI acts as safety net |
| FM-REG-05 | Keycloak JWKS endpoint unavailable | All authenticated requests fail (401) | Auth guard exception spike | Short-term JWKS cache (5 min) in JWT guard; retry backoff on JWKS refresh |
| FM-REG-06 | Portrait object store unavailable | Portrait upload/download fails; core registration unaffected | 503 on portrait endpoints | Return 503 with PORTRAIT_STORAGE_UNAVAILABLE; core create/update proceeds without portrait |
| FM-REG-07 | config-service unavailable | Required-field config cannot be retrieved; fall back to default config | Config client error metric | Cached config with 30-min TTL; fall back to empty required-fields list (accept all creates) |
| FM-REG-08 | Disk space exhaustion (DB) | Writes fail with Postgres error | Disk usage alert | Pre-emptive alerting at 75%; automatic archiving of inactive encounter records |
| FM-REG-09 | Optimistic lock storm | Multiple clients update same patient concurrently; most get 409 | High rate of OPTIMISTIC_LOCK_CONFLICT errors | UI must implement reload-and-retry pattern; alert if 409 rate > 5% sustained |
| FM-REG-10 | MPI false-positive spike | Valid patients blocked by duplicate detection | registration_mpi_duplicates_total spike | Configurable threshold; admin can temporarily lower sensitivity; runbook for MPI recalibration |
| FM-REG-11 | Unmerge of unknown-source merge | Unmerge rejected (UNMERGE_INVALID_STATE) for legacy merges lacking tracking data | User-visible 400 error | Supervisor workflow for manual correction; audit trail for escalation |
| FM-REG-12 | HL7 ADT inbound message parse failure | ADT event not processed; patient not updated | interop-service dead-letter queue | Dead-letter queue alert; manual reprocessing runbook |