Care Plan Service — Observability
Status: populated Owner: TBD Last updated: 2026-04-18 Companion: Service Template · 03 platform-services · 02 DDD
SLIs and SLOs
| SLI | Target (SLO) | Measurement window |
|---|---|---|
| API availability (all endpoints) | 99.9% | 30-day rolling |
| Care plan read p95 latency | < 500 ms | 1-hour rolling |
| Care plan write p95 latency | < 800 ms | 1-hour rolling |
| FHIR CarePlan read p95 latency | < 1000 ms | 1-hour rolling |
| Outbox relay lag (event publish) | < 5 s p95 | 5-min rolling |
| Error rate (5xx) | < 0.1% | 30-day rolling |
OpenTelemetry Instrumentation
- Traces: Every request spans from Kong → controller → use case → repository. Span names follow
care-plan-service.{operation}. - Metrics: Prometheus-compatible via OpenTelemetry SDK. Exported to Grafana via OTLP.
- Logs: Structured JSON logs with
tenantId,carePlanId,actorId,traceId,spanId. No PHI in log messages (IDs only).
Key Metrics
| Metric | Type | Labels |
|---|---|---|
care_plan_requests_total | Counter | method, route, status_code, tenant_id |
care_plan_request_duration_seconds | Histogram | method, route, tenant_id |
care_plan_outbox_pending_count | Gauge | tenant_id |
care_plan_outbox_publish_lag_seconds | Histogram | event_type |
care_plan_version_conflicts_total | Counter | tenant_id |
care_plan_status_transitions_total | Counter | from_status, to_status, tenant_id |
Dashboards
| Dashboard | Purpose |
|---|---|
care-plan-service/overview | Request rate, error rate, latency percentiles by tenant |
care-plan-service/outbox | Outbox pending count, publish lag, DLQ age |
care-plan-service/domain | Status transitions, overdue goals count, review compliance |
Alerts
| Alert | Condition | Severity | Runbook |
|---|---|---|---|
CarePlanServiceHighErrorRate | 5xx rate > 1% over 5 min | Critical | /runbooks/care-plan-service/high-error-rate.md |
CarePlanServiceLatencyHigh | p95 write > 2 s over 10 min | Warning | /runbooks/care-plan-service/latency-high.md |
CarePlanOutboxLagHigh | Outbox pending > 100 rows for > 5 min | Warning | /runbooks/care-plan-service/outbox-lag.md |
CarePlanOutboxStuck | Outbox pending > 500 rows for > 15 min | Critical | /runbooks/care-plan-service/outbox-stuck.md |
CarePlanServiceDown | No successful health checks for > 2 min | Critical | /runbooks/care-plan-service/service-down.md |
Health Endpoints
| Endpoint | Returns |
|---|---|
GET /health/live | { "status": "ok" } — liveness probe |
GET /health/ready | { "status": "ok", "db": "ok", "nats": "ok" } — readiness probe |