Config Service — Observability
Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 12 observability-telemetry
1. SLIs and SLOs
| SLI | SLO | Measurement window |
|---|---|---|
GET /internal/config/resolve p95 latency | < 100 ms | 5-minute rolling |
GET /internal/config/resolve p99 latency | < 300 ms | 5-minute rolling |
Resolution pipeline timeout rate (RESOLUTION_TIMEOUT) | < 0.1 % of requests | 1-hour rolling |
| Service availability (non-5xx rate) | ≥ 99.9 % | 30-day rolling |
| Cache hit rate | ≥ 85 % | 1-hour rolling |
| NATS event delivery success rate | ≥ 99.5 % | 1-hour rolling |
| Cross-tenant denial correctness | 100 % — zero cross-tenant data leaks | Continuous |
2. Key Metrics (OpenTelemetry)
| Metric name | Type | Labels |
|---|---|---|
config.resolve.duration_ms | Histogram | tenantId, effect, reason, cached |
config.resolve.total | Counter | tenantId, effect, reason |
config.cache.hit_total | Counter | tenantId, cache_key_type |
config.cache.miss_total | Counter | tenantId, cache_key_type |
config.role_bfs.depth | Histogram | tenantId, roleKey |
config.event.published_total | Counter | event_type |
config.event.dlq_depth | Gauge | stream |
config.upstream.call_duration_ms | Histogram | upstream (facility, platform_admin, access_policy) |
config.upstream.error_total | Counter | upstream, error_code |
3. Distributed Tracing
All inbound and outbound HTTP calls use W3C trace context (traceparent header). Key spans:
| Span | Attributes |
|---|---|
config.resolve | tenant_id, user_id, feature_key, action, effect |
config.role_bfs_expand | tenant_id, root_role_key, depth_reached |
config.upstream.hierarchy | node_id, duration_ms |
config.upstream.license_check | module_key, result |
config.upstream.feature_flag | feature_key, result |
config.upstream.abac_evaluate | policy_id, effect |
config.cache.get | cache_key, hit |
config.cache.set | cache_key, ttl_s |
4. Dashboards
| Dashboard | Contents |
|---|---|
| Config Service Overview | Request rate, p95/p99 latency, error rate, cache hit ratio |
| Resolution Pipeline | Per-step denial breakdown (module, feature, role, ABAC, override) |
| Role Graph Health | BFS depth distribution, circular reference rejections |
| Cache Performance | Hit/miss ratio per key type, eviction counts |
| Upstream Dependencies | Latency and error rate for facility-service, platform-admin, access-policy |
| NATS Events | Published event rate per type, DLQ depth |
5. Alerts
| Alert | Threshold | Severity | Runbook |
|---|---|---|---|
| Resolve p95 > 200 ms | 5-minute window | Warning | runbooks/config-slow-resolution.md |
| Resolve p95 > 500 ms | 5-minute window | Critical | runbooks/config-slow-resolution.md |
| Availability < 99.5 % | 15-minute window | Critical | runbooks/config-service-down.md |
| Cache hit rate < 70 % | 30-minute window | Warning | runbooks/config-redis-degraded.md |
| Redis unavailable | Any error connecting to Redis | Critical | runbooks/config-redis-degraded.md |
| NATS DLQ depth > 10 | Any | Warning | runbooks/config-nats-dlq.md |
| Cross-tenant access attempt | Any CROSS_TENANT deny | Info (audit) | runbooks/config-security-incident.md |
| Upstream dependency 503 rate > 1 % | 5-minute window | Critical | runbooks/config-upstream-failure.md |
6. Structured Logging
All log entries include: traceId, spanId, tenantId (where known), service: "config-service", level, timestamp.
Resolution log sample (DEBUG, 1 % sampling):
{
"level": "debug",
"message": "resolution.completed",
"traceId": "...",
"tenantId": "ten_afg_moph_001",
"userId": "usr_...",
"featureKey": "ViewMedications",
"action": "medication:read",
"effect": "allow",
"reason": "ROLE_GRANT",
"durationMs": 42,
"cacheHit": true
}