Observability
:::info Source
Sourced from services/notification-service/OBSERVABILITY.md in the documentation repo.
:::
1. Logs
Events: notification.queued|sending|sent|delivered|failed|bounced|suppressed, notification.template.created|updated, notification.digest.sent, notification.webhook.received.
2. Metrics
RED
notif_api_requests_total{endpoint,status}counternotif_api_duration_seconds{endpoint}histogram
Domain
notif_sends_total{channel,template,outcome}counternotif_delivery_duration_seconds{channel}histogramnotif_bounce_rate{channel}gaugenotif_open_rate{channel,template}gaugenotif_click_rate{channel,template}gaugenotif_suppression_total{reason}counternotif_webhook_events_total{provider,kind}counter
Cost
notif_provider_cost_micro_usd_total{channel,tenant_id}counternotif_ai_cost_micro_usd_total{tenant_id}counter
3. Traces
Spans: notif.send.email, notif.send.sms, notif.send.push, notif.template.render, notif.digest.batch.
4. Dashboards
- Send volume by channel + template.
- Delivery rate + bounce.
- Open/click for email.
- Provider cost per tenant.
5. Alerts
| Alert | Threshold | Severity |
|---|---|---|
| bounce-rate-high | > 5% daily | P2 |
| send-failure-spike | > 3% in 10min | P2 |
| webhook-lag | > 30s p99 | P2 |
| ai-budget-exhausted | tenant 100% | P3 |
| sms-toll-fraud-suspected | unusual dest | P1 |
6. SLOs
| SLI | Target |
|---|---|
| Queue-to-send p95 | < 30s |
| Email delivery p95 | < 2 min |
| SMS delivery p95 | < 30s |
| Push delivery p95 | < 10s |
| API availability | 99.9% |