Notification Service — Observability
Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18 Companion: 12 Observability
1. SLIs / SLOs
| SLI | SLO | Window |
|---|---|---|
| Email delivery success rate | ≥ 98% | 7 d |
| CRITICAL alert delivery time (event → email sent) | ≤ 60s P95 | 30 d |
| Invoice notification delivery time | ≤ 5 min P95 | 30 d |
| NATS consumer lag | P95 ≤ 10s | 30 d |
2. Metrics
notif_events_consumed_total{subject="auth.events|billing.events|operator.health|system.alerts", result="ok|error"}
notif_notifications_dispatched_total{channel="EMAIL|SMS", category=..., status="SENT|FAILED|SUPPRESSED"}
notif_delivery_latency_seconds_bucket{channel=...}
notif_template_render_errors_total{type=...}
notif_recipient_resolve_errors_total
notif_sendgrid_errors_total{http_status=...}
notif_sms_delivery_errors_total
notif_nats_consumer_lag{subject=...} (gauge)
notif_duplicate_suppressed_total
3. Traces
OpenTelemetry spans (propagated from NATS traceparent header):
notif.dispatch(root from NATS consumer)notif.resolve_recipientsnotif.preference_checknotif.render_templatenotif.deliver.email(ornotif.deliver.sms)notif.pg.insert_log
Attributes: notif.notification_id, notif.channel, notif.category, notif.source_event_type, notif.account_id.
4. Logs (Pino → Loki)
Fields: level, ts, service=notification-service, notificationId, channel, category, sourceEventType, status, durationMs, traceId.
recipientAddressstored as masked form only:***@domain.tldor+44***0123.- No full email content in logs.
5. Dashboards (Grafana)
- Notification Overview — dispatch rate by channel/category, SENT/FAILED/SUPPRESSED split
- Delivery Health — SendGrid success rate, SMS delivery rate, retry rate
- Consumer Lag — per-subject lag, event processing latency histogram
- Template Health — render error rate by template type
6. Alerts
| Alert | Condition | Runbook |
|---|---|---|
NotifEmailDeliveryFailed | email FAILED rate > 5% for 5m | runbooks/notif/email-delivery-failed.md |
NotifSmsDeliveryFailed | SMS FAILED rate > 10% for 5m | runbooks/notif/sms-delivery-failed.md |
NotifSystemAlertFailed | system.alerts CRITICAL event status=FAILED | runbooks/notif/system-alert-failed.md |
NotifTemplateMissing | template_render_errors > 0 with reason=NOT_FOUND | runbooks/notif/template-missing.md |
NotifNatsLag | consumer lag > 5000 on any subject | runbooks/notif/nats-lag.md |
NotifPgErrors | PG errors > 5/min | runbooks/notif/pg-down.md |
NotifAuthServiceErrors | recipient resolve errors > 10/min | runbooks/notif/auth-service-down.md |