Facility Service — Observability
Status: populated
Owner: TBD
Last updated: 2026-04-17
Companion: SERVICE_TEMPLATE §10 · 03 Platform Services
1. SLIs
| SLI | Definition |
|---|
context_lookup_latency_p99_ms | P99 latency of GET /internal/providers/:id/context and /internal/nodes/:id/context |
hierarchy_read_latency_p95_ms | P95 latency of subtree/ancestor queries |
bed_status_change_publish_lag_ms | Time from commit → NATS ack |
outbox_lag_seconds | Oldest unpublished outbox row age |
cache_hit_rate | Redis hit rate on node context |
cycle_detection_rejections_per_min | Incidence of cycle violations |
service_availability | Successful / total internal API requests |
2. SLOs
| SLO | Target |
|---|
| context_lookup_latency_p99 | ≤ 20 ms |
| hierarchy_read_latency_p95 | ≤ 100 ms (up to 1000 nodes) |
| bed_status publish lag p95 | ≤ 2 s |
| outbox lag p99 | ≤ 10 s |
| cache hit rate | ≥ 95 % |
| service_availability | ≥ 99.9 % monthly |
3. Metrics (OpenTelemetry)
| Metric | Type | Labels |
|---|
facility_http_request_duration_seconds | histogram | route, status_code, tenant |
facility_context_cache_hits_total | counter | cache_key_class |
facility_bed_status_transitions_total | counter | from, to, tenant |
facility_outbox_lag_seconds | gauge | — |
facility_cycle_rejections_total | counter | tenant |
facility_node_count | gauge | tenant, type |
4. Dashboards
| Dashboard | Panels |
|---|
Facility — Hot Path | context lookup p50/p95/p99, cache hit rate, RPS |
Facility — Bed Operations | status transitions/min, occupancy by location, rejections |
Facility — Outbox | unpublished rows, relay lag, errors |
Facility — Tenant Health | per-tenant error rate, write volume, active node count |
5. Alerts
| Alert | Threshold | Page |
|---|
| context_lookup_p99 > 50ms 5m | warn | on-call facility |
| context_lookup_p99 > 100ms 5m | page | on-call facility |
| outbox_lag > 30s 10m | page | on-call facility |
| cache_hit_rate < 85% 15m | warn | on-call facility |
| bed double-book rejections > 20/min | page | clinical ops + on-call |
| cycle rejections > 50/min/tenant | warn | tenant admin notified |
| error_rate > 1% 5m | page | on-call facility |
6. Tracing
All inbound HTTP/gRPC calls attach traceparent. Spans:
facility.hierarchy.create_node, facility.hierarchy.cycle_check
facility.location.create
facility.bed.transition
facility.cache.context_lookup
facility.outbox.publish
Trace sampling: 100% for 4xx/5xx; 10% for 2xx; 100% for internal:* endpoints during M0.
7. Runbooks
(Runbook files are created alongside SRE onboarding; owner: SRE.)