Skip to main content

Notification Service — Observability

Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18 Companion: 12 Observability

1. SLIs / SLOs

SLISLOWindow
Email delivery success rate≥ 98%7 d
CRITICAL alert delivery time (event → email sent)≤ 60s P9530 d
Invoice notification delivery time≤ 5 min P9530 d
NATS consumer lagP95 ≤ 10s30 d

2. Metrics

notif_events_consumed_total{subject="auth.events|billing.events|operator.health|system.alerts", result="ok|error"}
notif_notifications_dispatched_total{channel="EMAIL|SMS", category=..., status="SENT|FAILED|SUPPRESSED"}
notif_delivery_latency_seconds_bucket{channel=...}
notif_template_render_errors_total{type=...}
notif_recipient_resolve_errors_total
notif_sendgrid_errors_total{http_status=...}
notif_sms_delivery_errors_total
notif_nats_consumer_lag{subject=...} (gauge)
notif_duplicate_suppressed_total

3. Traces

OpenTelemetry spans (propagated from NATS traceparent header):

  • notif.dispatch (root from NATS consumer)
    • notif.resolve_recipients
    • notif.preference_check
    • notif.render_template
    • notif.deliver.email (or notif.deliver.sms)
    • notif.pg.insert_log

Attributes: notif.notification_id, notif.channel, notif.category, notif.source_event_type, notif.account_id.

4. Logs (Pino → Loki)

Fields: level, ts, service=notification-service, notificationId, channel, category, sourceEventType, status, durationMs, traceId.

  • recipientAddress stored as masked form only: ***@domain.tld or +44***0123.
  • No full email content in logs.

5. Dashboards (Grafana)

  • Notification Overview — dispatch rate by channel/category, SENT/FAILED/SUPPRESSED split
  • Delivery Health — SendGrid success rate, SMS delivery rate, retry rate
  • Consumer Lag — per-subject lag, event processing latency histogram
  • Template Health — render error rate by template type

6. Alerts

AlertConditionRunbook
NotifEmailDeliveryFailedemail FAILED rate > 5% for 5mrunbooks/notif/email-delivery-failed.md
NotifSmsDeliveryFailedSMS FAILED rate > 10% for 5mrunbooks/notif/sms-delivery-failed.md
NotifSystemAlertFailedsystem.alerts CRITICAL event status=FAILEDrunbooks/notif/system-alert-failed.md
NotifTemplateMissingtemplate_render_errors > 0 with reason=NOT_FOUNDrunbooks/notif/template-missing.md
NotifNatsLagconsumer lag > 5000 on any subjectrunbooks/notif/nats-lag.md
NotifPgErrorsPG errors > 5/minrunbooks/notif/pg-down.md
NotifAuthServiceErrorsrecipient resolve errors > 10/minrunbooks/notif/auth-service-down.md