Skip to main content

OBSERVABILITY — notification-service

Sibling: APPLICATION_LOGIC · FAILURE_MODES · SERVICE_READINESS · DEPLOYMENT_TOPOLOGY

Strategic anchors: 02 Enterprise Architecture §12 Observability · 04 Event-Driven §10

We export OpenTelemetry signals (traces, metrics, logs) to the platform OTLP collector, which fans out to:

  • Cloud Trace for traces (sampled).
  • Cloud Monitoring for metrics (Prometheus exposition + OTLP push).
  • Cloud Logging for logs (JSON-structured), with a parallel sink to Loki for engineer-friendly querying.
  • BigQuery for long-term aggregates and SLI calculations.

All signals carry the standard ECC attributes (tenant.id, service.name='notification-service', service.version, service.instance.id, deployment.environment, gcp.region) plus notification-specific dimensions documented below.


1. Service Level Objectives (SLOs)

SLIDefinitionTargetWindowError budget
Enqueue latencyp95 of EnqueueNotificationUseCase end-to-end (HTTP 2xx, since recv)≤ 250 ms30 d5 %
Dispatch latency (transactional)p95 of queued → dispatched for category∈{transactional,security}≤ 5 s30 d1 %
Dispatch latency (operational/reminder)p95≤ 30 s30 d5 %
Vendor-acknowledged delivery rate (per channel)delivered.v1 / dispatched.v1 per channel × tenant rolling≥ 95 % email; ≥ 92 % sms; ≥ 95 % whatsapp; ≥ 90 % push7 dn/a (SLO informational)
Webhook ingestion successwebhook_inbound.status='applied' / received≥ 99.9 %30 d0.1 %
WebSocket feed availabilityuptime + connection success rate≥ 99.5 %30 d0.5 %
Outbox publish lagp95 of enqueued_at → published_at≤ 1 s30 d1 %
Pub/Sub consumer lagp95 of producedAt → ackedAt per subscription≤ 5 s30 d5 %
Template render success raterender success / render attempts≥ 99.95 %30 d0.05 %
API availability2xx + 4xx (excluding 5xx) over total≥ 99.9 %30 d0.1 %

SLOs are computed in BigQuery from event tables and exposed via Cloud Monitoring custom metrics (notif.slo.<name>).


2. RED metrics per route/handler

For each REST route, gRPC method, Pub/Sub subscription, and worker:

MetricTypeLabels
http.server.duration_secondshistogramroute, method, status_class, tenant
http.server.requests_totalcounterroute, method, status_code, tenant
pubsub.consumer.duration_secondshistogramsubscription, outcome
pubsub.consumer.messages_totalcountersubscription, outcome (ack,nack,dlq)
worker.tick.duration_secondshistogramworker
worker.tick.outcome_totalcounterworker, outcome

Buckets follow the platform default histogram (5 ms … 30 s, 14 buckets).


3. Domain metrics

Counters & histograms for the things we care about:

MetricTypeLabels
notif.requested_totalcountertenant, channel, category, source (event|api|scheduler|batch)
notif.scheduled_totalcountertenant, channel, reason
notif.dispatched_totalcountertenant, channel, vendor, attempt_number
notif.delivered_totalcountertenant, channel, vendor
notif.failed_totalcountertenant, channel, vendor, reason
notif.bounced_totalcountertenant, channel, vendor, bounce_type
notif.opened_totalcountertenant, channel, vendor
notif.clicked_totalcountertenant, channel, vendor
notif.suppressed_totalcountertenant, channel, reason
notif.opted_out_totalcountertenant, channel, source
notif.preferences_updated_totalcountertenant, source
notif.template_publish_totalcountertenant, key, source, hitl
notif.template_archive_totalcountertenant, key
notif.channel_health_changed_totalcountertenant, channel, vendor, status
notif.dispatch.latency_secondshistogramtenant, channel, vendor (queued→dispatched)
notif.delivery.latency_secondshistogramtenant, channel, vendor (queued→delivered)
notif.render.duration_secondshistogramtenant, channel, renderer_profile, locale
notif.render.errors_totalcountertenant, channel, renderer_profile, error_kind
notif.rate_limit_hits_totalcountertenant, scope (tenant|recipient)
notif.webhook.received_totalcountervendor, signature_valid
notif.webhook.applied_totalcountervendor, type
notif.webhook.late_correlation_totalcountervendor
notif.outbox.lag_secondsgaugepartition (publish lag for oldest unpublished)
notif.feed.ws.active_connectionsgaugetenant
notif.batch.completion_secondshistogramtenant, channel
notif.ai.*variousper AI_INTEGRATION §11

4. Tracing

4.1 Span taxonomy

Span nameWhereNotable attributes
notification.enqueueEnqueueNotificationUseCase rootnotification.id, notification.channel, notification.category, template.key, template.semver, recipient.id, source_event.id?
notification.preference_gatechilddecision (send|suppress|defer)
notification.renderchildrenderer.profile, locale, body.format, body.size_bytes, checksum
notification.sender_resolvechildvendor, sender.kind, sender.id_hash
notification.rate_limitchildscope, allowed
notification.dispatch.<channel>dispatcher worker rootvendor, attempt.number, vendor.message_id?, outcome, http.status
notification.webhook_ingest.<vendor>webhook rootvendor, events.count, signature_valid, status
notification.scheduler.tickscheduler workerprocessed.count, batch.size
notification.outbox.relayrelay workerpublished.count, lag.seconds
notification.ai.draft.fetchAIClient callcapability.key, draft.id, latency.ms, fallback

Sampling: 100 % for failed, bounced, suspicious_login, anything in security category; 10 % for transactional and operational; 1 % for marketing. Tail-based sampling at the collector keeps interesting traces.

4.2 Trace context propagation

  • Inbound HTTP: read traceparent/tracestate; never trust client b3.
  • Pub/Sub: trace context in message attributes (traceparent); the consumer creates a child span.
  • Outbound to vendors: inject traceparent only when vendor accepts (most do not). Always tag vendor, route for correlation.
  • Outbound to ai-orchestrator-service: inject traceparent; correlate via draftId.

4.3 Cross-service correlation

correlationId (the booking saga id, the user session id, etc.) is preserved end-to-end on every event envelope and span; engineers can pivot in Cloud Trace by correlation.id.


5. Logging

JSON-structured logs (one event per line) with the platform schema:

{
"ts": "2026-04-22T15:32:18.231Z",
"level": "info",
"service": "notification-service",
"version": "1.18.3",
"env": "prod",
"region": "asia-south1",
"msg": "notification.dispatched",
"tenant.id": "tnt_01H…",
"notification.id": "ntf_01J4A…",
"notification.channel": "email",
"notification.category": "transactional",
"template.key": "reservation.confirmed.email",
"template.semver": "1.4.2",
"vendor": "sendgrid",
"vendor.message_id": "smg_qz…",
"attempt.number": 1,
"outcome": "accepted",
"latency_ms": 833,
"trace_id": "01H3Z4WK7…",
"span_id": "abc1234…",
"correlation.id": "01J3Z…BOOKING"
}

Levels:

  • error: terminal failures, programmer bugs, security violations.
  • warn: HITL rejections, vendor degraded fallback, late webhook correlations, AI fallback to deterministic.
  • info: lifecycle (requested, scheduled, dispatched, delivered, failed, bounced, suppressed, etc.).
  • debug: only in non-prod by default; hot-path debug fields gated by feature flag.

PII: addresses NEVER logged in plaintext (see SECURITY_MODEL §5). recipient.id and address.kind_hash only.

Sampling: info is fully captured for failed, bounced, suppressed, security-category; sampled at 25 % for high-volume delivered rows.


6. Dashboards

Six platform-managed Grafana/Cloud Monitoring dashboards (sources are Cloud Monitoring metrics + BigQuery):

  1. Overview — RED for HTTP/Pub/Sub/workers; SLO burn-down; queue backlogs; outbox lag; recent errors top-N.
  2. Per-channel funnelrequested → scheduled → dispatched → delivered/failed/bounced per channel × tenant; vendor breakdown; latency p50/p95/p99.
  3. Vendor health — per-vendor success rate, latency, retry distribution, channel health-flip timeline.
  4. Templates — published / archived per tenant; per-template send volume; render error rate; locale fallback count.
  5. AI — draft requests, HITL queue depth, approve/reject/edit/expire rates, fallback-to-deterministic %, cost per tenant per capability.
  6. Compliance — opt-out rate, suppression rate by reason, marketing consent coverage, data-residency violations (should be zero).

Each dashboard has tenant-id and time selectors. Per-tenant slices are available to staff via the BFF-rendered "Notification analytics" view.


7. Alerts

AlertConditionSeverityAction
NotifApiHighErrorRate5xx rate > 1 % for 5 min on /api/v1/notifications*page (P1)on-call engineer
NotifEnqueueLatencyBreachenqueue p95 > 500 ms for 10 minpage (P1)on-call
NotifDispatchBacklogstatus='queued' AND queued_at older than 60s count > 1000page (P1)on-call
NotifVendorDegradedper-channel delivered rate drops > 10 pct vs 24 h baseline for 30 minticket (P2)platform
NotifChannelHealthFlipany channel.status='down'ticket (P2)tenant ops
NotifWebhookHmacFailures> 50 invalid HMACs / 5 min on any vendorpage (P1)on-call (possible abuse or rotation gap)
NotifWebhookIngestionStalledwebhook_inbound.status='received' rows > 10 min unprocessedpage (P1)on-call
NotifSuppressionRatesuppression rate > 5 % of dispatched for 1 h on any tenantticket (P2)tenant ops
NotifAIHitlQueueDepthper-tenant queue > 200 for > 1 hticket (P2)tenant ops
NotifAIFallbackHighAI fallback > 25 % per capability for > 1 hticket (P2)platform AI
NotifPostgresReplicationLagreplica lag > 10 s for 5 minticket (P2)DBA
NotifOutboxLagnotif.outbox.lag_seconds > 5 for 5 minpage (P1)on-call
NotifPubSubDLQGrowingDLQ growth rate > 10/min for 10 minpage (P1)on-call
NotifWSConnectionsSurgeactive WS connections > 2× baseline for 10 minticket (P2)platform
NotifBudgetExhaustednotif.failed_total{reason='budget_exhausted'} > 0 in 5 minticket (P3)tenant ops

Each alert references a runbook in FAILURE_MODES.


8. Audit and compliance signals

Independent of operational telemetry, the following audit events are emitted to audit-service (consumed via the platform audit topic):

  • Every template.published.v1 and template.archived.v1 (with publishedBy / archivedBy).
  • Every channel.write and channel.credentials.rotate (actor + before/after diff with secret refs only).
  • Every suppression.release (actor + reason + 4-eyes ticket reference if applicable).
  • Every preferences.write.any (the staff acted on behalf of a guest — actor + consent record id).
  • Every internal/dlq/*/retry (actor + DLQ id).
  • Every Secret Manager AccessSecretVersion (via Cloud Audit Logs).

Audit logs are immutable (Cloud Logging storage immutability) and retained for 7 years.


9. Synthetic monitoring

Cloud Monitoring synthetic checks that run against staging and a thin slice of prod every 60 s:

  • POST /api/v1/notifications with a fixed test tenant and template that resolves to a sandbox channel; expect 202 within 250 ms.
  • The synthetic "send" produces a dispatched.v1 and a delivered webhook within 60 s; the check fails if not.
  • WS connect to feed, receive a heartbeat within 5 s.

Synthetic test tenant tnt_synth_notif_* is excluded from analytics dashboards via tenant-tag.


10. Cost observability

  • Per-tenant per-channel cost is computed daily by joining delivery_attempts_facts with the vendor invoice import (cost_model_v1); written to notif.cost_per_send_usd_micro{tenant,channel,vendor}.
  • Tenant admins see a cost panel; platform admins see top-N spenders.
  • Anomaly detection on weekly cost using a STL decomposition; > 3σ triggers a ticket.

11. Local development observability

docker compose up brings up Jaeger, Prometheus, and Grafana with pre-provisioned dashboards (see LOCAL_DEV_SETUP). Engineers see traces and metrics for their local sends, including the AI orchestrator stub's draft latency.