Skip to main content

Facility Service — Observability

Status: populated Owner: TBD Last updated: 2026-04-17 Companion: SERVICE_TEMPLATE §10 · 03 Platform Services

1. SLIs

SLIDefinition
context_lookup_latency_p99_msP99 latency of GET /internal/providers/:id/context and /internal/nodes/:id/context
hierarchy_read_latency_p95_msP95 latency of subtree/ancestor queries
bed_status_change_publish_lag_msTime from commit → NATS ack
outbox_lag_secondsOldest unpublished outbox row age
cache_hit_rateRedis hit rate on node context
cycle_detection_rejections_per_minIncidence of cycle violations
service_availabilitySuccessful / total internal API requests

2. SLOs

SLOTarget
context_lookup_latency_p99≤ 20 ms
hierarchy_read_latency_p95≤ 100 ms (up to 1000 nodes)
bed_status publish lag p95≤ 2 s
outbox lag p99≤ 10 s
cache hit rate≥ 95 %
service_availability≥ 99.9 % monthly

3. Metrics (OpenTelemetry)

MetricTypeLabels
facility_http_request_duration_secondshistogramroute, status_code, tenant
facility_context_cache_hits_totalcountercache_key_class
facility_bed_status_transitions_totalcounterfrom, to, tenant
facility_outbox_lag_secondsgauge
facility_cycle_rejections_totalcountertenant
facility_node_countgaugetenant, type

4. Dashboards

DashboardPanels
Facility — Hot Pathcontext lookup p50/p95/p99, cache hit rate, RPS
Facility — Bed Operationsstatus transitions/min, occupancy by location, rejections
Facility — Outboxunpublished rows, relay lag, errors
Facility — Tenant Healthper-tenant error rate, write volume, active node count

5. Alerts

AlertThresholdPage
context_lookup_p99 > 50ms 5mwarnon-call facility
context_lookup_p99 > 100ms 5mpageon-call facility
outbox_lag > 30s 10mpageon-call facility
cache_hit_rate < 85% 15mwarnon-call facility
bed double-book rejections > 20/minpageclinical ops + on-call
cycle rejections > 50/min/tenantwarntenant admin notified
error_rate > 1% 5mpageon-call facility

6. Tracing

All inbound HTTP/gRPC calls attach traceparent. Spans:

  • facility.hierarchy.create_node, facility.hierarchy.cycle_check
  • facility.location.create
  • facility.bed.transition
  • facility.cache.context_lookup
  • facility.outbox.publish

Trace sampling: 100% for 4xx/5xx; 10% for 2xx; 100% for internal:* endpoints during M0.

7. Runbooks

(Runbook files are created alongside SRE onboarding; owner: SRE.)