Skip to main content

Terminology Service — Observability

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 12 observability-telemetry · 02 DDD

1. Service-Level Indicators (SLIs)

SLIDescriptionMeasurement
Search latency p9595th percentile of GET /v1/terminology/searchOTEL histogram http.server.duration filtered by route
Lookup / validate latency p9595th percentile for concept lookup + validation endpointsOTEL histogram
CDS query latency p9595th percentile for drug interaction / duplicate / contraindication endpointsOTEL histogram
Availability% of requests returning non-5xx1 - (5xx / total) per 1-min window
Cache hit rate% of lookup/validate/expand requests served from Rediscache_hits / (cache_hits + cache_misses)
Concept dataset freshnessAge of last successful terminology dataset loadterminology_etl_last_success_age_seconds

2. Service-Level Objectives (SLOs)

SLOTargetMeasurement window
Search latency p95≤ 200 ms5-min rolling
Lookup / validate latency p95≤ 100 ms5-min rolling
CDS query latency p95≤ 300 ms5-min rolling
Availability≥ 99.9%30-day rolling
Cache hit rate≥ 80%1-hour rolling

3. OpenTelemetry Instrumentation

Traces:

SpanAttributes
terminology.searchsystem, query_length, result_count, tenant_id
terminology.lookupsystem, code, cache_hit, active
terminology.validatesystem, code, valid
terminology.expandvalue_set_url, result_count, cache_hit
terminology.cds.interactionsdrug_count, interaction_count
terminology.cds.duplicate_therapydrug_count, pair_count
terminology.cds.contraindicationsicd10_count, alert_count
terminology.importsystem, imported, skipped, errors

Metrics:

MetricTypeLabels
terminology.requests.totalCounteroperation, system, status_code
terminology.request.durationHistogramoperation, system
terminology.cache.hits.totalCounteroperation
terminology.cache.misses.totalCounteroperation
terminology.concept.totalGaugesystem, tenant_id
terminology.import.rows.totalCountersystem, result (imported/skipped/error)
terminology.etl.last_success_ageGaugesystem

Logs: Structured JSON via @ghasi/logger. Includes traceId, spanId, tenantId, service: terminology-service. No clinical data in log fields.


4. Dashboards

DashboardKey panels
Terminology OverviewRequest rate by operation, error rate, p95 latency by operation
Cache PerformanceHit rate, miss rate, Redis latency, cache eviction events
CDS Query TrafficInteraction check volume, contraindication volume, p95 latency
Dataset HealthConcept counts by system, ETL last run age, import error rate
Tenant Concept UsageTop tenants by concept query volume

5. Alerts

AlertConditionSeverityAction
Search latency breachsearch_p95 > 500ms for 5 minP2Check PostgreSQL full-text index; check Redis
High error rate5xx_rate > 2% for 3 minP1Check DB + Redis health; page on-call
Cache hit rate lowcache_hit_rate < 50% for 30 minP3Redis eviction check; TTL configuration review
ETL data staleetl_last_success_age > 7 days for any systemP2Review ETL pipeline; check licensed data availability
Concept count dropConcept count drops > 10% in 1hP1Accidental deactivation or migration error; investigate immediately
DB connection exhaustionDB connection pool > 90% for 5 minP2Scale pods; check for connection leaks

ScenarioRunbook
Terminology service high latencydocs/runbooks/terminology-high-latency.md
ETL import failuredocs/runbooks/terminology-etl-failure.md
Redis cache evictiondocs/runbooks/redis-cache-eviction.md
Concept count anomalydocs/runbooks/terminology-concept-count-drop.md