Skip to main content

Config Service — Observability

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 12 observability-telemetry


1. SLIs and SLOs

SLISLOMeasurement window
GET /internal/config/resolve p95 latency< 100 ms5-minute rolling
GET /internal/config/resolve p99 latency< 300 ms5-minute rolling
Resolution pipeline timeout rate (RESOLUTION_TIMEOUT)< 0.1 % of requests1-hour rolling
Service availability (non-5xx rate)≥ 99.9 %30-day rolling
Cache hit rate≥ 85 %1-hour rolling
NATS event delivery success rate≥ 99.5 %1-hour rolling
Cross-tenant denial correctness100 % — zero cross-tenant data leaksContinuous

2. Key Metrics (OpenTelemetry)

Metric nameTypeLabels
config.resolve.duration_msHistogramtenantId, effect, reason, cached
config.resolve.totalCountertenantId, effect, reason
config.cache.hit_totalCountertenantId, cache_key_type
config.cache.miss_totalCountertenantId, cache_key_type
config.role_bfs.depthHistogramtenantId, roleKey
config.event.published_totalCounterevent_type
config.event.dlq_depthGaugestream
config.upstream.call_duration_msHistogramupstream (facility, platform_admin, access_policy)
config.upstream.error_totalCounterupstream, error_code

3. Distributed Tracing

All inbound and outbound HTTP calls use W3C trace context (traceparent header). Key spans:

SpanAttributes
config.resolvetenant_id, user_id, feature_key, action, effect
config.role_bfs_expandtenant_id, root_role_key, depth_reached
config.upstream.hierarchynode_id, duration_ms
config.upstream.license_checkmodule_key, result
config.upstream.feature_flagfeature_key, result
config.upstream.abac_evaluatepolicy_id, effect
config.cache.getcache_key, hit
config.cache.setcache_key, ttl_s

4. Dashboards

DashboardContents
Config Service OverviewRequest rate, p95/p99 latency, error rate, cache hit ratio
Resolution PipelinePer-step denial breakdown (module, feature, role, ABAC, override)
Role Graph HealthBFS depth distribution, circular reference rejections
Cache PerformanceHit/miss ratio per key type, eviction counts
Upstream DependenciesLatency and error rate for facility-service, platform-admin, access-policy
NATS EventsPublished event rate per type, DLQ depth

5. Alerts

AlertThresholdSeverityRunbook
Resolve p95 > 200 ms5-minute windowWarningrunbooks/config-slow-resolution.md
Resolve p95 > 500 ms5-minute windowCriticalrunbooks/config-slow-resolution.md
Availability < 99.5 %15-minute windowCriticalrunbooks/config-service-down.md
Cache hit rate < 70 %30-minute windowWarningrunbooks/config-redis-degraded.md
Redis unavailableAny error connecting to RedisCriticalrunbooks/config-redis-degraded.md
NATS DLQ depth > 10AnyWarningrunbooks/config-nats-dlq.md
Cross-tenant access attemptAny CROSS_TENANT denyInfo (audit)runbooks/config-security-incident.md
Upstream dependency 503 rate > 1 %5-minute windowCriticalrunbooks/config-upstream-failure.md

6. Structured Logging

All log entries include: traceId, spanId, tenantId (where known), service: "config-service", level, timestamp.

Resolution log sample (DEBUG, 1 % sampling):

{
"level": "debug",
"message": "resolution.completed",
"traceId": "...",
"tenantId": "ten_afg_moph_001",
"userId": "usr_...",
"featureKey": "ViewMedications",
"action": "medication:read",
"effect": "allow",
"reason": "ROLE_GRANT",
"durationMs": 42,
"cacheHit": true
}