Skip to main content

Care Plan Service — Observability

Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 03 platform-services · 02 DDD

SLIs and SLOs

SLITarget (SLO)Measurement window
API availability (all endpoints)99.9%30-day rolling
Care plan read p95 latency< 500 ms1-hour rolling
Care plan write p95 latency< 800 ms1-hour rolling
FHIR CarePlan read p95 latency< 1000 ms1-hour rolling
Outbox relay lag (event publish)< 5 s p955-min rolling
Error rate (5xx)< 0.1%30-day rolling

OpenTelemetry Instrumentation

  • Traces: Every request spans from Kong → controller → use case → repository. Span names follow care-plan-service.{operation}.
  • Metrics: Prometheus-compatible via OpenTelemetry SDK. Exported to Grafana via OTLP.
  • Logs: Structured JSON logs with tenantId, carePlanId, actorId, traceId, spanId. No PHI in log messages (IDs only).

Key Metrics

MetricTypeLabels
care_plan_requests_totalCountermethod, route, status_code, tenant_id
care_plan_request_duration_secondsHistogrammethod, route, tenant_id
care_plan_outbox_pending_countGaugetenant_id
care_plan_outbox_publish_lag_secondsHistogramevent_type
care_plan_version_conflicts_totalCountertenant_id
care_plan_status_transitions_totalCounterfrom_status, to_status, tenant_id

Dashboards

DashboardPurpose
care-plan-service/overviewRequest rate, error rate, latency percentiles by tenant
care-plan-service/outboxOutbox pending count, publish lag, DLQ age
care-plan-service/domainStatus transitions, overdue goals count, review compliance

Alerts

AlertConditionSeverityRunbook
CarePlanServiceHighErrorRate5xx rate > 1% over 5 minCritical/runbooks/care-plan-service/high-error-rate.md
CarePlanServiceLatencyHighp95 write > 2 s over 10 minWarning/runbooks/care-plan-service/latency-high.md
CarePlanOutboxLagHighOutbox pending > 100 rows for > 5 minWarning/runbooks/care-plan-service/outbox-lag.md
CarePlanOutboxStuckOutbox pending > 500 rows for > 15 minCritical/runbooks/care-plan-service/outbox-stuck.md
CarePlanServiceDownNo successful health checks for > 2 minCritical/runbooks/care-plan-service/service-down.md

Health Endpoints

EndpointReturns
GET /health/live{ "status": "ok" } — liveness probe
GET /health/ready{ "status": "ok", "db": "ok", "nats": "ok" } — readiness probe