Skip to main content

Terminology Service — Failure Modes

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 03 platform-services · 02 DDD

1. Failure Catalog

#FailureUser impactDetectionMitigation
F-01PostgreSQL unavailableAll concept queries fail; 503 returned to consumersHealth check probe fails; DB error rate spikesPostgres HA (streaming replication + automatic failover); circuit breaker opens after 3 consecutive failures; Redis cache serves warm hits during brief outage
F-02Redis unavailableAll requests go to PostgreSQL; latency increases significantly; risk of DB overloadRedis health probe fails; cache miss rate → 100%Fall through to PostgreSQL directly; alert fires on high DB query rate; Redis is not in the critical path — service degrades gracefully
F-03NATS JetStream unavailableConcept mutation events not delivered to consumers; no query impactOutbox relay worker fails; outbox_unpublished_count risesOutbox persists in PostgreSQL; events delivered when NATS recovers; at-least-once delivery maintained
F-04Bulk import job failure (partial)Incomplete terminology dataset after import; some codes missingETL job status check; import response includes errors arrayImport is idempotent (upsert); re-run import after correcting bad rows; failed rows logged without aborting batch
F-05Redis cache stampedeSimultaneous expiry of many hot keys under load causes DB spikeDB connection count spike; query latency spikeCache key jitter on write (±5% TTL variance); Redis SETNX-based locking for $expand result computation
F-06Full-text search index bloatSearch queries slow after large importp95 latency breach on /v1/terminology/searchSchedule REINDEX CONCURRENTLY on idx_concepts_fts after bulk imports; alert on search latency
F-07Licensed data not loaded at deploymentAll SNOMED/LOINC/RxNorm lookups return 404 or emptyHealth check returns terminology_data: empty; custom alertReadiness probe checks concept count > 0; service will not serve traffic until dataset loaded; ETL job must complete before rollout
F-08Tenant isolation breach in concept scopeTenant B concepts visible to Tenant ARLS policy failure; audit anomalyPostgreSQL RLS policy enforced; app.tenant_id set from JWT tid claim per request; integration test tenant-isolation.spec.ts runs on every PR
F-09Drug interaction data stalenessOutdated interaction pairs; missed severity changesNo automated detection (data quality issue)Drug interaction data version tracked in drug_interactions metadata table; ETL updates trigger TERMINOLOGY.dataset.updated event; SRE reviews update cadence monthly
F-10Keycloak unavailableAll /v1/terminology/* and /fhir/R4/* requests fail with 401JWT validation failure; 401 rate spikeInternal route /internal/terminology/* uses shared-secret fallback (not JWT); service-to-service calls (via internal route) continue; patient-facing FHIR ops blocked until Keycloak recovers

2. Degradation Policy

Terminology-service is a shared read-mostly service. The degradation strategy prioritizes query availability over consistency:

ScenarioBehaviour
Redis downServe live from PostgreSQL; emit terminology.cache.miss metric
DB down briefly (< failover time)Serve Redis-warm hits for cached keys; return 503 for uncached queries
DB down persistently503 for all queries; Keycloak Realm shows dependency failure
Import disabled (IMPORT_ENABLED=false)POST /internal/terminology/import returns 405 METHOD_NOT_ALLOWED