Skip to main content

Communication Service — Observability

Status: populated Owner: TBD Last updated: 2026-04-17 Companion: Service Template

1. SLIs / SLOs

SLIDefinitionSLO (rolling 30d)
Message send latencyp95 POST /threads/{id}/messages< 1 s
Message send availabilitysuccessful 2xx over all non-4xx≥ 99.9%
Notification dispatch latency (critical category)time from intent enqueue to provider accept, p95< 5 s
Notification dispatch latency (routine)same, p95< 60 s
Delivery success (SMS)delivered / dispatched≥ 97% per tenant
Delivery success (push)delivered / dispatched≥ 95% per tenant
Delivery success (email)delivered / dispatched≥ 98% per tenant
Virtual-session start successstarted / created≥ 98%
Virtual-session fallback ratefallback / failed≥ 95% (fallbacks always spawn when infra fails)
Outbox lagseconds between insert and publish, p95< 2 s

2. Key metrics (OpenTelemetry)

MetricTypeLabels
communication.messages.sent.countcountertenant_id, urgency, patient_linked
communication.messages.send.durationhistogramtenant_id, urgency
communication.notifications.dispatched.countcountertenant_id, channel, provider, category
communication.notifications.outcome.countcountertenant_id, channel, outcome
communication.notifications.dispatch.durationhistogramtenant_id, channel, category
communication.virtual_sessions.state.transitioncountertenant_id, from, to
communication.outbox.lag.secondsgauge
communication.inbox.dedupe.countcountertenant_id, subject
communication.adapter.healthgauge (0/1)adapter (sms-ghasi, sms-twilio, email-ses, push-fcm, etc.)

3. Traces

Spans: CreateThread, SendMessage, DispatchIntent, AdapterCall.{provider}, CreateVirtualSession, SpawnFallback. Always propagate traceparent; attach tenant.id, urgency, channel, category attributes (low cardinality only).

4. Logs

Structured JSON. Never log message body, attachment content, variable values, or recipient PII. Log intentId, dispatchId, providerMessageId, outcome, error code.

5. Dashboards

DashboardPanels
Communication overviewsends/min, dispatch success by channel, virtual session funnel
Notifications deep-diveper-provider latency, DLR success, failures by error code
Virtual carestate funnel, fallback rate, recording ingest lag
Operator alertsanomalies on failure ratio, outbox lag

6. Alerts & runbooks

AlertConditionRunbook
SMS failure ratio highfailed / dispatched > 0.2 for 5 minrunbooks/comms/sms-provider-degraded.md
Push delivery stalledFCM feedback gap > 10 minrunbooks/comms/push-dlr-stall.md
Outbox lag > 10 sgauge > 10 for 2 minrunbooks/comms/outbox-lag.md
Virtual fallback surgefallback_initiated rate > 10x baselinerunbooks/comms/vc-infra-failure.md
Attachment scan queue stuckpending > 1000 for 15 minrunbooks/comms/attachment-scan-stuck.md

Runbooks live in docs/runbooks/comms/ (to be added to ops repo).