Terminology Service — Observability
Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 12 observability-telemetry · 02 DDD
1. Service-Level Indicators (SLIs)
| SLI | Description | Measurement |
|---|---|---|
| Search latency p95 | 95th percentile of GET /v1/terminology/search | OTEL histogram http.server.duration filtered by route |
| Lookup / validate latency p95 | 95th percentile for concept lookup + validation endpoints | OTEL histogram |
| CDS query latency p95 | 95th percentile for drug interaction / duplicate / contraindication endpoints | OTEL histogram |
| Availability | % of requests returning non-5xx | 1 - (5xx / total) per 1-min window |
| Cache hit rate | % of lookup/validate/expand requests served from Redis | cache_hits / (cache_hits + cache_misses) |
| Concept dataset freshness | Age of last successful terminology dataset load | terminology_etl_last_success_age_seconds |
2. Service-Level Objectives (SLOs)
| SLO | Target | Measurement window |
|---|---|---|
| Search latency p95 | ≤ 200 ms | 5-min rolling |
| Lookup / validate latency p95 | ≤ 100 ms | 5-min rolling |
| CDS query latency p95 | ≤ 300 ms | 5-min rolling |
| Availability | ≥ 99.9% | 30-day rolling |
| Cache hit rate | ≥ 80% | 1-hour rolling |
3. OpenTelemetry Instrumentation
Traces:
| Span | Attributes |
|---|---|
terminology.search | system, query_length, result_count, tenant_id |
terminology.lookup | system, code, cache_hit, active |
terminology.validate | system, code, valid |
terminology.expand | value_set_url, result_count, cache_hit |
terminology.cds.interactions | drug_count, interaction_count |
terminology.cds.duplicate_therapy | drug_count, pair_count |
terminology.cds.contraindications | icd10_count, alert_count |
terminology.import | system, imported, skipped, errors |
Metrics:
| Metric | Type | Labels |
|---|---|---|
terminology.requests.total | Counter | operation, system, status_code |
terminology.request.duration | Histogram | operation, system |
terminology.cache.hits.total | Counter | operation |
terminology.cache.misses.total | Counter | operation |
terminology.concept.total | Gauge | system, tenant_id |
terminology.import.rows.total | Counter | system, result (imported/skipped/error) |
terminology.etl.last_success_age | Gauge | system |
Logs: Structured JSON via @ghasi/logger. Includes traceId, spanId, tenantId, service: terminology-service. No clinical data in log fields.
4. Dashboards
| Dashboard | Key panels |
|---|---|
| Terminology Overview | Request rate by operation, error rate, p95 latency by operation |
| Cache Performance | Hit rate, miss rate, Redis latency, cache eviction events |
| CDS Query Traffic | Interaction check volume, contraindication volume, p95 latency |
| Dataset Health | Concept counts by system, ETL last run age, import error rate |
| Tenant Concept Usage | Top tenants by concept query volume |
5. Alerts
| Alert | Condition | Severity | Action |
|---|---|---|---|
| Search latency breach | search_p95 > 500ms for 5 min | P2 | Check PostgreSQL full-text index; check Redis |
| High error rate | 5xx_rate > 2% for 3 min | P1 | Check DB + Redis health; page on-call |
| Cache hit rate low | cache_hit_rate < 50% for 30 min | P3 | Redis eviction check; TTL configuration review |
| ETL data stale | etl_last_success_age > 7 days for any system | P2 | Review ETL pipeline; check licensed data availability |
| Concept count drop | Concept count drops > 10% in 1h | P1 | Accidental deactivation or migration error; investigate immediately |
| DB connection exhaustion | DB connection pool > 90% for 5 min | P2 | Scale pods; check for connection leaks |
6. Runbook Links
| Scenario | Runbook |
|---|---|
| Terminology service high latency | docs/runbooks/terminology-high-latency.md |
| ETL import failure | docs/runbooks/terminology-etl-failure.md |
| Redis cache eviction | docs/runbooks/redis-cache-eviction.md |
| Concept count anomaly | docs/runbooks/terminology-concept-count-drop.md |