Communication Service — Observability
Status: populated Owner: TBD Last updated: 2026-04-17 Companion: Service Template
1. SLIs / SLOs
| SLI | Definition | SLO (rolling 30d) |
|---|---|---|
| Message send latency | p95 POST /threads/{id}/messages | < 1 s |
| Message send availability | successful 2xx over all non-4xx | ≥ 99.9% |
| Notification dispatch latency (critical category) | time from intent enqueue to provider accept, p95 | < 5 s |
| Notification dispatch latency (routine) | same, p95 | < 60 s |
| Delivery success (SMS) | delivered / dispatched | ≥ 97% per tenant |
| Delivery success (push) | delivered / dispatched | ≥ 95% per tenant |
| Delivery success (email) | delivered / dispatched | ≥ 98% per tenant |
| Virtual-session start success | started / created | ≥ 98% |
| Virtual-session fallback rate | fallback / failed | ≥ 95% (fallbacks always spawn when infra fails) |
| Outbox lag | seconds between insert and publish, p95 | < 2 s |
2. Key metrics (OpenTelemetry)
| Metric | Type | Labels |
|---|---|---|
communication.messages.sent.count | counter | tenant_id, urgency, patient_linked |
communication.messages.send.duration | histogram | tenant_id, urgency |
communication.notifications.dispatched.count | counter | tenant_id, channel, provider, category |
communication.notifications.outcome.count | counter | tenant_id, channel, outcome |
communication.notifications.dispatch.duration | histogram | tenant_id, channel, category |
communication.virtual_sessions.state.transition | counter | tenant_id, from, to |
communication.outbox.lag.seconds | gauge | — |
communication.inbox.dedupe.count | counter | tenant_id, subject |
communication.adapter.health | gauge (0/1) | adapter (sms-ghasi, sms-twilio, email-ses, push-fcm, etc.) |
3. Traces
Spans: CreateThread, SendMessage, DispatchIntent, AdapterCall.{provider}, CreateVirtualSession, SpawnFallback. Always propagate traceparent; attach tenant.id, urgency, channel, category attributes (low cardinality only).
4. Logs
Structured JSON. Never log message body, attachment content, variable values, or recipient PII. Log intentId, dispatchId, providerMessageId, outcome, error code.
5. Dashboards
| Dashboard | Panels |
|---|---|
| Communication overview | sends/min, dispatch success by channel, virtual session funnel |
| Notifications deep-dive | per-provider latency, DLR success, failures by error code |
| Virtual care | state funnel, fallback rate, recording ingest lag |
| Operator alerts | anomalies on failure ratio, outbox lag |
6. Alerts & runbooks
| Alert | Condition | Runbook |
|---|---|---|
| SMS failure ratio high | failed / dispatched > 0.2 for 5 min | runbooks/comms/sms-provider-degraded.md |
| Push delivery stalled | FCM feedback gap > 10 min | runbooks/comms/push-dlr-stall.md |
| Outbox lag > 10 s | gauge > 10 for 2 min | runbooks/comms/outbox-lag.md |
| Virtual fallback surge | fallback_initiated rate > 10x baseline | runbooks/comms/vc-infra-failure.md |
| Attachment scan queue stuck | pending > 1000 for 15 min | runbooks/comms/attachment-scan-stuck.md |
Runbooks live in docs/runbooks/comms/ (to be added to ops repo).