Skip to main content

Scheduling Service — Observability

Status: populated Owner: TBD Last updated: 2026-04-17 Companion: Service Template · 12 observability

1. SLIs and SLOs

SLISLO TargetMetric
Availability search p95 latency< 1 000 msscheduling.slot.availability.duration_p95
Appointment create p95 latency< 500 msscheduling.appointment.create.duration_p95
Reminder dispatch success rate≥ 99%scheduling.reminder.dispatch_success_rate
Service availability≥ 99.9% (30-day)Uptime probe
Outbox publish success rate≥ 99.5%scheduling.outbox.publish_success_rate

2. OpenTelemetry Instrumentation

SignalKey names
Tracesscheduling.bookAppointment, scheduling.searchAvailability, scheduling.cancelAppointment, scheduling.dispatchReminder
Metricsscheduling_appointments_created_total, scheduling_cancellations_total, scheduling_noshows_total, scheduling_reminders_sent_total, scheduling_outbox_lag_seconds
LogsStructured JSON; appointmentId, tenantId, actorId; no PHI in log message bodies

3. Dashboards

DashboardPanels
Scheduling OverviewBooking rate, cancellation rate, no-show rate, waitlist size
Performancep50/p95/p99 for availability search, booking
Reminder PipelineDispatch rate, retry rate, failure rate by channel
Event HealthOutbox lag, publish success/failure

4. Alerts

AlertThresholdSeverityRunbook
Availability search p95 > 1 000 ms5-min sustainedWarningrunbooks/scheduling-slow-search.md
Service error rate > 1%5-min windowCriticalrunbooks/scheduling-error-spike.md
Reminder dispatch failure rate > 5%10-min windowWarningrunbooks/scheduling-reminder-failure.md
Outbox lag > 30 sAny timeWarningrunbooks/scheduling-outbox-lag.md
Pod crash loop2 restarts / 10 minCriticalrunbooks/scheduling-pod-crash.md

5. Health Endpoints

EndpointPurpose
GET /health/liveLiveness probe
GET /health/readyReadiness probe — DB + NATS connected