Terminology Service — Failure Modes
Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 03 platform-services · 02 DDD
1. Failure Catalog
| # | Failure | User impact | Detection | Mitigation |
|---|---|---|---|---|
| F-01 | PostgreSQL unavailable | All concept queries fail; 503 returned to consumers | Health check probe fails; DB error rate spikes | Postgres HA (streaming replication + automatic failover); circuit breaker opens after 3 consecutive failures; Redis cache serves warm hits during brief outage |
| F-02 | Redis unavailable | All requests go to PostgreSQL; latency increases significantly; risk of DB overload | Redis health probe fails; cache miss rate → 100% | Fall through to PostgreSQL directly; alert fires on high DB query rate; Redis is not in the critical path — service degrades gracefully |
| F-03 | NATS JetStream unavailable | Concept mutation events not delivered to consumers; no query impact | Outbox relay worker fails; outbox_unpublished_count rises | Outbox persists in PostgreSQL; events delivered when NATS recovers; at-least-once delivery maintained |
| F-04 | Bulk import job failure (partial) | Incomplete terminology dataset after import; some codes missing | ETL job status check; import response includes errors array | Import is idempotent (upsert); re-run import after correcting bad rows; failed rows logged without aborting batch |
| F-05 | Redis cache stampede | Simultaneous expiry of many hot keys under load causes DB spike | DB connection count spike; query latency spike | Cache key jitter on write (±5% TTL variance); Redis SETNX-based locking for $expand result computation |
| F-06 | Full-text search index bloat | Search queries slow after large import | p95 latency breach on /v1/terminology/search | Schedule REINDEX CONCURRENTLY on idx_concepts_fts after bulk imports; alert on search latency |
| F-07 | Licensed data not loaded at deployment | All SNOMED/LOINC/RxNorm lookups return 404 or empty | Health check returns terminology_data: empty; custom alert | Readiness probe checks concept count > 0; service will not serve traffic until dataset loaded; ETL job must complete before rollout |
| F-08 | Tenant isolation breach in concept scope | Tenant B concepts visible to Tenant A | RLS policy failure; audit anomaly | PostgreSQL RLS policy enforced; app.tenant_id set from JWT tid claim per request; integration test tenant-isolation.spec.ts runs on every PR |
| F-09 | Drug interaction data staleness | Outdated interaction pairs; missed severity changes | No automated detection (data quality issue) | Drug interaction data version tracked in drug_interactions metadata table; ETL updates trigger TERMINOLOGY.dataset.updated event; SRE reviews update cadence monthly |
| F-10 | Keycloak unavailable | All /v1/terminology/* and /fhir/R4/* requests fail with 401 | JWT validation failure; 401 rate spike | Internal route /internal/terminology/* uses shared-secret fallback (not JWT); service-to-service calls (via internal route) continue; patient-facing FHIR ops blocked until Keycloak recovers |
2. Degradation Policy
Terminology-service is a shared read-mostly service. The degradation strategy prioritizes query availability over consistency:
| Scenario | Behaviour |
|---|---|
| Redis down | Serve live from PostgreSQL; emit terminology.cache.miss metric |
| DB down briefly (< failover time) | Serve Redis-warm hits for cached keys; return 503 for uncached queries |
| DB down persistently | 503 for all queries; Keycloak Realm shows dependency failure |
Import disabled (IMPORT_ENABLED=false) | POST /internal/terminology/import returns 405 METHOD_NOT_ALLOWED |