OBSERVABILITY — notification-service
Sibling: APPLICATION_LOGIC · FAILURE_MODES · SERVICE_READINESS · DEPLOYMENT_TOPOLOGY
Strategic anchors: 02 Enterprise Architecture §12 Observability · 04 Event-Driven §10
We export OpenTelemetry signals (traces, metrics, logs) to the platform OTLP collector, which fans out to:
- Cloud Trace for traces (sampled).
- Cloud Monitoring for metrics (Prometheus exposition + OTLP push).
- Cloud Logging for logs (JSON-structured), with a parallel sink to Loki for engineer-friendly querying.
- BigQuery for long-term aggregates and SLI calculations.
All signals carry the standard ECC attributes (tenant.id, service.name='notification-service', service.version, service.instance.id, deployment.environment, gcp.region) plus notification-specific dimensions documented below.
1. Service Level Objectives (SLOs)
| SLI | Definition | Target | Window | Error budget |
|---|---|---|---|---|
| Enqueue latency | p95 of EnqueueNotificationUseCase end-to-end (HTTP 2xx, since recv) | ≤ 250 ms | 30 d | 5 % |
| Dispatch latency (transactional) | p95 of queued → dispatched for category∈{transactional,security} | ≤ 5 s | 30 d | 1 % |
| Dispatch latency (operational/reminder) | p95 | ≤ 30 s | 30 d | 5 % |
| Vendor-acknowledged delivery rate (per channel) | delivered.v1 / dispatched.v1 per channel × tenant rolling | ≥ 95 % email; ≥ 92 % sms; ≥ 95 % whatsapp; ≥ 90 % push | 7 d | n/a (SLO informational) |
| Webhook ingestion success | webhook_inbound.status='applied' / received | ≥ 99.9 % | 30 d | 0.1 % |
| WebSocket feed availability | uptime + connection success rate | ≥ 99.5 % | 30 d | 0.5 % |
| Outbox publish lag | p95 of enqueued_at → published_at | ≤ 1 s | 30 d | 1 % |
| Pub/Sub consumer lag | p95 of producedAt → ackedAt per subscription | ≤ 5 s | 30 d | 5 % |
| Template render success rate | render success / render attempts | ≥ 99.95 % | 30 d | 0.05 % |
| API availability | 2xx + 4xx (excluding 5xx) over total | ≥ 99.9 % | 30 d | 0.1 % |
SLOs are computed in BigQuery from event tables and exposed via Cloud Monitoring custom metrics (notif.slo.<name>).
2. RED metrics per route/handler
For each REST route, gRPC method, Pub/Sub subscription, and worker:
| Metric | Type | Labels |
|---|---|---|
http.server.duration_seconds | histogram | route, method, status_class, tenant |
http.server.requests_total | counter | route, method, status_code, tenant |
pubsub.consumer.duration_seconds | histogram | subscription, outcome |
pubsub.consumer.messages_total | counter | subscription, outcome (ack,nack,dlq) |
worker.tick.duration_seconds | histogram | worker |
worker.tick.outcome_total | counter | worker, outcome |
Buckets follow the platform default histogram (5 ms … 30 s, 14 buckets).
3. Domain metrics
Counters & histograms for the things we care about:
| Metric | Type | Labels |
|---|---|---|
notif.requested_total | counter | tenant, channel, category, source (event|api|scheduler|batch) |
notif.scheduled_total | counter | tenant, channel, reason |
notif.dispatched_total | counter | tenant, channel, vendor, attempt_number |
notif.delivered_total | counter | tenant, channel, vendor |
notif.failed_total | counter | tenant, channel, vendor, reason |
notif.bounced_total | counter | tenant, channel, vendor, bounce_type |
notif.opened_total | counter | tenant, channel, vendor |
notif.clicked_total | counter | tenant, channel, vendor |
notif.suppressed_total | counter | tenant, channel, reason |
notif.opted_out_total | counter | tenant, channel, source |
notif.preferences_updated_total | counter | tenant, source |
notif.template_publish_total | counter | tenant, key, source, hitl |
notif.template_archive_total | counter | tenant, key |
notif.channel_health_changed_total | counter | tenant, channel, vendor, status |
notif.dispatch.latency_seconds | histogram | tenant, channel, vendor (queued→dispatched) |
notif.delivery.latency_seconds | histogram | tenant, channel, vendor (queued→delivered) |
notif.render.duration_seconds | histogram | tenant, channel, renderer_profile, locale |
notif.render.errors_total | counter | tenant, channel, renderer_profile, error_kind |
notif.rate_limit_hits_total | counter | tenant, scope (tenant|recipient) |
notif.webhook.received_total | counter | vendor, signature_valid |
notif.webhook.applied_total | counter | vendor, type |
notif.webhook.late_correlation_total | counter | vendor |
notif.outbox.lag_seconds | gauge | partition (publish lag for oldest unpublished) |
notif.feed.ws.active_connections | gauge | tenant |
notif.batch.completion_seconds | histogram | tenant, channel |
notif.ai.* | various | per AI_INTEGRATION §11 |
4. Tracing
4.1 Span taxonomy
| Span name | Where | Notable attributes |
|---|---|---|
notification.enqueue | EnqueueNotificationUseCase root | notification.id, notification.channel, notification.category, template.key, template.semver, recipient.id, source_event.id? |
notification.preference_gate | child | decision (send|suppress|defer) |
notification.render | child | renderer.profile, locale, body.format, body.size_bytes, checksum |
notification.sender_resolve | child | vendor, sender.kind, sender.id_hash |
notification.rate_limit | child | scope, allowed |
notification.dispatch.<channel> | dispatcher worker root | vendor, attempt.number, vendor.message_id?, outcome, http.status |
notification.webhook_ingest.<vendor> | webhook root | vendor, events.count, signature_valid, status |
notification.scheduler.tick | scheduler worker | processed.count, batch.size |
notification.outbox.relay | relay worker | published.count, lag.seconds |
notification.ai.draft.fetch | AIClient call | capability.key, draft.id, latency.ms, fallback |
Sampling: 100 % for failed, bounced, suspicious_login, anything in security category; 10 % for transactional and operational; 1 % for marketing. Tail-based sampling at the collector keeps interesting traces.
4.2 Trace context propagation
- Inbound HTTP: read
traceparent/tracestate; never trust clientb3. - Pub/Sub: trace context in message attributes (
traceparent); the consumer creates a child span. - Outbound to vendors: inject
traceparentonly when vendor accepts (most do not). Always tagvendor,routefor correlation. - Outbound to
ai-orchestrator-service: injecttraceparent; correlate viadraftId.
4.3 Cross-service correlation
correlationId (the booking saga id, the user session id, etc.) is preserved end-to-end on every event envelope and span; engineers can pivot in Cloud Trace by correlation.id.
5. Logging
JSON-structured logs (one event per line) with the platform schema:
{
"ts": "2026-04-22T15:32:18.231Z",
"level": "info",
"service": "notification-service",
"version": "1.18.3",
"env": "prod",
"region": "asia-south1",
"msg": "notification.dispatched",
"tenant.id": "tnt_01H…",
"notification.id": "ntf_01J4A…",
"notification.channel": "email",
"notification.category": "transactional",
"template.key": "reservation.confirmed.email",
"template.semver": "1.4.2",
"vendor": "sendgrid",
"vendor.message_id": "smg_qz…",
"attempt.number": 1,
"outcome": "accepted",
"latency_ms": 833,
"trace_id": "01H3Z4WK7…",
"span_id": "abc1234…",
"correlation.id": "01J3Z…BOOKING"
}
Levels:
error: terminal failures, programmer bugs, security violations.warn: HITL rejections, vendor degraded fallback, late webhook correlations, AI fallback to deterministic.info: lifecycle (requested,scheduled,dispatched,delivered,failed,bounced,suppressed, etc.).debug: only in non-prod by default; hot-path debug fields gated by feature flag.
PII: addresses NEVER logged in plaintext (see SECURITY_MODEL §5). recipient.id and address.kind_hash only.
Sampling: info is fully captured for failed, bounced, suppressed, security-category; sampled at 25 % for high-volume delivered rows.
6. Dashboards
Six platform-managed Grafana/Cloud Monitoring dashboards (sources are Cloud Monitoring metrics + BigQuery):
- Overview — RED for HTTP/Pub/Sub/workers; SLO burn-down; queue backlogs; outbox lag; recent errors top-N.
- Per-channel funnel —
requested → scheduled → dispatched → delivered/failed/bouncedper channel × tenant; vendor breakdown; latency p50/p95/p99. - Vendor health — per-vendor success rate, latency, retry distribution, channel health-flip timeline.
- Templates — published / archived per tenant; per-template send volume; render error rate; locale fallback count.
- AI — draft requests, HITL queue depth, approve/reject/edit/expire rates, fallback-to-deterministic %, cost per tenant per capability.
- Compliance — opt-out rate, suppression rate by reason, marketing consent coverage, data-residency violations (should be zero).
Each dashboard has tenant-id and time selectors. Per-tenant slices are available to staff via the BFF-rendered "Notification analytics" view.
7. Alerts
| Alert | Condition | Severity | Action |
|---|---|---|---|
NotifApiHighErrorRate | 5xx rate > 1 % for 5 min on /api/v1/notifications* | page (P1) | on-call engineer |
NotifEnqueueLatencyBreach | enqueue p95 > 500 ms for 10 min | page (P1) | on-call |
NotifDispatchBacklog | status='queued' AND queued_at older than 60s count > 1000 | page (P1) | on-call |
NotifVendorDegraded | per-channel delivered rate drops > 10 pct vs 24 h baseline for 30 min | ticket (P2) | platform |
NotifChannelHealthFlip | any channel.status='down' | ticket (P2) | tenant ops |
NotifWebhookHmacFailures | > 50 invalid HMACs / 5 min on any vendor | page (P1) | on-call (possible abuse or rotation gap) |
NotifWebhookIngestionStalled | webhook_inbound.status='received' rows > 10 min unprocessed | page (P1) | on-call |
NotifSuppressionRate | suppression rate > 5 % of dispatched for 1 h on any tenant | ticket (P2) | tenant ops |
NotifAIHitlQueueDepth | per-tenant queue > 200 for > 1 h | ticket (P2) | tenant ops |
NotifAIFallbackHigh | AI fallback > 25 % per capability for > 1 h | ticket (P2) | platform AI |
NotifPostgresReplicationLag | replica lag > 10 s for 5 min | ticket (P2) | DBA |
NotifOutboxLag | notif.outbox.lag_seconds > 5 for 5 min | page (P1) | on-call |
NotifPubSubDLQGrowing | DLQ growth rate > 10/min for 10 min | page (P1) | on-call |
NotifWSConnectionsSurge | active WS connections > 2× baseline for 10 min | ticket (P2) | platform |
NotifBudgetExhausted | notif.failed_total{reason='budget_exhausted'} > 0 in 5 min | ticket (P3) | tenant ops |
Each alert references a runbook in FAILURE_MODES.
8. Audit and compliance signals
Independent of operational telemetry, the following audit events are emitted to audit-service (consumed via the platform audit topic):
- Every
template.published.v1andtemplate.archived.v1(withpublishedBy/archivedBy). - Every
channel.writeandchannel.credentials.rotate(actor + before/after diff with secret refs only). - Every
suppression.release(actor + reason + 4-eyes ticket reference if applicable). - Every
preferences.write.any(the staff acted on behalf of a guest — actor + consent record id). - Every
internal/dlq/*/retry(actor + DLQ id). - Every
Secret Manager AccessSecretVersion(via Cloud Audit Logs).
Audit logs are immutable (Cloud Logging storage immutability) and retained for 7 years.
9. Synthetic monitoring
Cloud Monitoring synthetic checks that run against staging and a thin slice of prod every 60 s:
POST /api/v1/notificationswith a fixed test tenant and template that resolves to a sandbox channel; expect 202 within 250 ms.- The synthetic "send" produces a
dispatched.v1and a delivered webhook within 60 s; the check fails if not. - WS connect to feed, receive a heartbeat within 5 s.
Synthetic test tenant tnt_synth_notif_* is excluded from analytics dashboards via tenant-tag.
10. Cost observability
- Per-tenant per-channel cost is computed daily by joining
delivery_attempts_factswith the vendor invoice import (cost_model_v1); written tonotif.cost_per_send_usd_micro{tenant,channel,vendor}. - Tenant admins see a cost panel; platform admins see top-N spenders.
- Anomaly detection on weekly cost using a STL decomposition; > 3σ triggers a ticket.
11. Local development observability
docker compose up brings up Jaeger, Prometheus, and Grafana with pre-provisioned dashboards (see LOCAL_DEV_SETUP). Engineers see traces and metrics for their local sends, including the AI orchestrator stub's draft latency.