Skip to main content

Scheduling Service — Observability

Status: populated Owner: TBD Last updated: 2026-04-17 Companion: Service Template · 12 observability

1. SLIs and SLOs

SLI	SLO Target	Metric
Availability search p95 latency	< 1 000 ms	`scheduling.slot.availability.duration_p95`
Appointment create p95 latency	< 500 ms	`scheduling.appointment.create.duration_p95`
Reminder dispatch success rate	≥ 99%	`scheduling.reminder.dispatch_success_rate`
Service availability	≥ 99.9% (30-day)	Uptime probe
Outbox publish success rate	≥ 99.5%	`scheduling.outbox.publish_success_rate`

2. OpenTelemetry Instrumentation

Signal	Key names
Traces	`scheduling.bookAppointment`, `scheduling.searchAvailability`, `scheduling.cancelAppointment`, `scheduling.dispatchReminder`
Metrics	`scheduling_appointments_created_total`, `scheduling_cancellations_total`, `scheduling_noshows_total`, `scheduling_reminders_sent_total`, `scheduling_outbox_lag_seconds`
Logs	Structured JSON; `appointmentId`, `tenantId`, `actorId`; no PHI in log message bodies

3. Dashboards

Dashboard	Panels
Scheduling Overview	Booking rate, cancellation rate, no-show rate, waitlist size
Performance	p50/p95/p99 for availability search, booking
Reminder Pipeline	Dispatch rate, retry rate, failure rate by channel
Event Health	Outbox lag, publish success/failure

4. Alerts

Alert	Threshold	Severity	Runbook
Availability search p95 > 1 000 ms	5-min sustained	Warning	`runbooks/scheduling-slow-search.md`
Service error rate > 1%	5-min window	Critical	`runbooks/scheduling-error-spike.md`
Reminder dispatch failure rate > 5%	10-min window	Warning	`runbooks/scheduling-reminder-failure.md`
Outbox lag > 30 s	Any time	Warning	`runbooks/scheduling-outbox-lag.md`
Pod crash loop	2 restarts / 10 min	Critical	`runbooks/scheduling-pod-crash.md`

5. Health Endpoints

Endpoint	Purpose
`GET /health/live`	Liveness probe
`GET /health/ready`	Readiness probe — DB + NATS connected

1. SLIs and SLOs
2. OpenTelemetry Instrumentation
3. Dashboards
4. Alerts
5. Health Endpoints