OBSERVABILITY — notification-service

Sibling: APPLICATION_LOGIC · FAILURE_MODES · SERVICE_READINESS · DEPLOYMENT_TOPOLOGY

Strategic anchors: 02 Enterprise Architecture §12 Observability · 04 Event-Driven §10

We export OpenTelemetry signals (traces, metrics, logs) to the platform OTLP collector, which fans out to:

Cloud Trace for traces (sampled).
Cloud Monitoring for metrics (Prometheus exposition + OTLP push).
Cloud Logging for logs (JSON-structured), with a parallel sink to Loki for engineer-friendly querying.
BigQuery for long-term aggregates and SLI calculations.

All signals carry the standard ECC attributes (tenant.id, service.name='notification-service', service.version, service.instance.id, deployment.environment, gcp.region) plus notification-specific dimensions documented below.

1. Service Level Objectives (SLOs)

SLI	Definition	Target	Window	Error budget
Enqueue latency	p95 of `EnqueueNotificationUseCase` end-to-end (HTTP 2xx, since recv)	≤ 250 ms	30 d	5 %
Dispatch latency (transactional)	p95 of `queued → dispatched` for category∈{transactional,security}	≤ 5 s	30 d	1 %
Dispatch latency (operational/reminder)	p95	≤ 30 s	30 d	5 %
Vendor-acknowledged delivery rate (per channel)	`delivered.v1 / dispatched.v1` per channel × tenant rolling	≥ 95 % email; ≥ 92 % sms; ≥ 95 % whatsapp; ≥ 90 % push	7 d	n/a (SLO informational)
Webhook ingestion success	`webhook_inbound.status='applied' / received`	≥ 99.9 %	30 d	0.1 %
WebSocket feed availability	uptime + connection success rate	≥ 99.5 %	30 d	0.5 %
Outbox publish lag	p95 of `enqueued_at → published_at`	≤ 1 s	30 d	1 %
Pub/Sub consumer lag	p95 of `producedAt → ackedAt` per subscription	≤ 5 s	30 d	5 %
Template render success rate	`render success / render attempts`	≥ 99.95 %	30 d	0.05 %
API availability	2xx + 4xx (excluding 5xx) over total	≥ 99.9 %	30 d	0.1 %

SLOs are computed in BigQuery from event tables and exposed via Cloud Monitoring custom metrics (notif.slo.<name>).

2. RED metrics per route/handler

For each REST route, gRPC method, Pub/Sub subscription, and worker:

Metric	Type	Labels
`http.server.duration_seconds`	histogram	`route, method, status_class, tenant`
`http.server.requests_total`	counter	`route, method, status_code, tenant`
`pubsub.consumer.duration_seconds`	histogram	`subscription, outcome`
`pubsub.consumer.messages_total`	counter	`subscription, outcome` (`ack`,`nack`,`dlq`)
`worker.tick.duration_seconds`	histogram	`worker`
`worker.tick.outcome_total`	counter	`worker, outcome`

Buckets follow the platform default histogram (5 ms … 30 s, 14 buckets).

3. Domain metrics

Counters & histograms for the things we care about:

Metric	Type	Labels
`notif.requested_total`	counter	`tenant, channel, category, source` (`event\|api\|scheduler\|batch`)
`notif.scheduled_total`	counter	`tenant, channel, reason`
`notif.dispatched_total`	counter	`tenant, channel, vendor, attempt_number`
`notif.delivered_total`	counter	`tenant, channel, vendor`
`notif.failed_total`	counter	`tenant, channel, vendor, reason`
`notif.bounced_total`	counter	`tenant, channel, vendor, bounce_type`
`notif.opened_total`	counter	`tenant, channel, vendor`
`notif.clicked_total`	counter	`tenant, channel, vendor`
`notif.suppressed_total`	counter	`tenant, channel, reason`
`notif.opted_out_total`	counter	`tenant, channel, source`
`notif.preferences_updated_total`	counter	`tenant, source`
`notif.template_publish_total`	counter	`tenant, key, source, hitl`
`notif.template_archive_total`	counter	`tenant, key`
`notif.channel_health_changed_total`	counter	`tenant, channel, vendor, status`
`notif.dispatch.latency_seconds`	histogram	`tenant, channel, vendor` (queued→dispatched)
`notif.delivery.latency_seconds`	histogram	`tenant, channel, vendor` (queued→delivered)
`notif.render.duration_seconds`	histogram	`tenant, channel, renderer_profile, locale`
`notif.render.errors_total`	counter	`tenant, channel, renderer_profile, error_kind`
`notif.rate_limit_hits_total`	counter	`tenant, scope` (`tenant\|recipient`)
`notif.webhook.received_total`	counter	`vendor, signature_valid`
`notif.webhook.applied_total`	counter	`vendor, type`
`notif.webhook.late_correlation_total`	counter	`vendor`
`notif.outbox.lag_seconds`	gauge	`partition` (publish lag for oldest unpublished)
`notif.feed.ws.active_connections`	gauge	`tenant`
`notif.batch.completion_seconds`	histogram	`tenant, channel`
`notif.ai.*`	various	per AI_INTEGRATION §11

4. Tracing

4.1 Span taxonomy

Span name	Where	Notable attributes
`notification.enqueue`	`EnqueueNotificationUseCase` root	`notification.id, notification.channel, notification.category, template.key, template.semver, recipient.id, source_event.id?`
`notification.preference_gate`	child	`decision` (`send\|suppress\|defer`)
`notification.render`	child	`renderer.profile, locale, body.format, body.size_bytes, checksum`
`notification.sender_resolve`	child	`vendor, sender.kind, sender.id_hash`
`notification.rate_limit`	child	`scope, allowed`
`notification.dispatch.<channel>`	dispatcher worker root	`vendor, attempt.number, vendor.message_id?, outcome, http.status`
`notification.webhook_ingest.<vendor>`	webhook root	`vendor, events.count, signature_valid, status`
`notification.scheduler.tick`	scheduler worker	`processed.count, batch.size`
`notification.outbox.relay`	relay worker	`published.count, lag.seconds`
`notification.ai.draft.fetch`	AIClient call	`capability.key, draft.id, latency.ms, fallback`

Sampling: 100 % for failed, bounced, suspicious_login, anything in security category; 10 % for transactional and operational; 1 % for marketing. Tail-based sampling at the collector keeps interesting traces.

4.2 Trace context propagation

Inbound HTTP: read traceparent/tracestate; never trust client b3.
Pub/Sub: trace context in message attributes (traceparent); the consumer creates a child span.
Outbound to vendors: inject traceparent only when vendor accepts (most do not). Always tag vendor, route for correlation.
Outbound to ai-orchestrator-service: inject traceparent; correlate via draftId.

4.3 Cross-service correlation

correlationId (the booking saga id, the user session id, etc.) is preserved end-to-end on every event envelope and span; engineers can pivot in Cloud Trace by correlation.id.

5. Logging

JSON-structured logs (one event per line) with the platform schema:

{
  "ts": "2026-04-22T15:32:18.231Z",
  "level": "info",
  "service": "notification-service",
  "version": "1.18.3",
  "env": "prod",
  "region": "asia-south1",
  "msg": "notification.dispatched",
  "tenant.id": "tnt_01H…",
  "notification.id": "ntf_01J4A…",
  "notification.channel": "email",
  "notification.category": "transactional",
  "template.key": "reservation.confirmed.email",
  "template.semver": "1.4.2",
  "vendor": "sendgrid",
  "vendor.message_id": "smg_qz…",
  "attempt.number": 1,
  "outcome": "accepted",
  "latency_ms": 833,
  "trace_id": "01H3Z4WK7…",
  "span_id": "abc1234…",
  "correlation.id": "01J3Z…BOOKING"
}

Levels:

error: terminal failures, programmer bugs, security violations.
warn: HITL rejections, vendor degraded fallback, late webhook correlations, AI fallback to deterministic.
info: lifecycle (requested, scheduled, dispatched, delivered, failed, bounced, suppressed, etc.).
debug: only in non-prod by default; hot-path debug fields gated by feature flag.

PII: addresses NEVER logged in plaintext (see SECURITY_MODEL §5). recipient.id and address.kind_hash only.

Sampling: info is fully captured for failed, bounced, suppressed, security-category; sampled at 25 % for high-volume delivered rows.

6. Dashboards

Six platform-managed Grafana/Cloud Monitoring dashboards (sources are Cloud Monitoring metrics + BigQuery):

Overview — RED for HTTP/Pub/Sub/workers; SLO burn-down; queue backlogs; outbox lag; recent errors top-N.
Per-channel funnel — requested → scheduled → dispatched → delivered/failed/bounced per channel × tenant; vendor breakdown; latency p50/p95/p99.
Vendor health — per-vendor success rate, latency, retry distribution, channel health-flip timeline.
Templates — published / archived per tenant; per-template send volume; render error rate; locale fallback count.
AI — draft requests, HITL queue depth, approve/reject/edit/expire rates, fallback-to-deterministic %, cost per tenant per capability.
Compliance — opt-out rate, suppression rate by reason, marketing consent coverage, data-residency violations (should be zero).

Each dashboard has tenant-id and time selectors. Per-tenant slices are available to staff via the BFF-rendered "Notification analytics" view.

7. Alerts

Alert	Condition	Severity	Action
`NotifApiHighErrorRate`	`5xx rate > 1 %` for 5 min on `/api/v1/notifications*`	page (P1)	on-call engineer
`NotifEnqueueLatencyBreach`	enqueue p95 > 500 ms for 10 min	page (P1)	on-call
`NotifDispatchBacklog`	`status='queued' AND queued_at older than 60s` count > 1000	page (P1)	on-call
`NotifVendorDegraded`	per-channel delivered rate drops > 10 pct vs 24 h baseline for 30 min	ticket (P2)	platform
`NotifChannelHealthFlip`	any `channel.status='down'`	ticket (P2)	tenant ops
`NotifWebhookHmacFailures`	> 50 invalid HMACs / 5 min on any vendor	page (P1)	on-call (possible abuse or rotation gap)
`NotifWebhookIngestionStalled`	`webhook_inbound.status='received'` rows > 10 min unprocessed	page (P1)	on-call
`NotifSuppressionRate`	suppression rate > 5 % of dispatched for 1 h on any tenant	ticket (P2)	tenant ops
`NotifAIHitlQueueDepth`	per-tenant queue > 200 for > 1 h	ticket (P2)	tenant ops
`NotifAIFallbackHigh`	AI fallback > 25 % per capability for > 1 h	ticket (P2)	platform AI
`NotifPostgresReplicationLag`	replica lag > 10 s for 5 min	ticket (P2)	DBA
`NotifOutboxLag`	`notif.outbox.lag_seconds > 5` for 5 min	page (P1)	on-call
`NotifPubSubDLQGrowing`	DLQ growth rate > 10/min for 10 min	page (P1)	on-call
`NotifWSConnectionsSurge`	active WS connections > 2× baseline for 10 min	ticket (P2)	platform
`NotifBudgetExhausted`	`notif.failed_total{reason='budget_exhausted'}` > 0 in 5 min	ticket (P3)	tenant ops

Each alert references a runbook in FAILURE_MODES.

8. Audit and compliance signals

Independent of operational telemetry, the following audit events are emitted to audit-service (consumed via the platform audit topic):

Every template.published.v1 and template.archived.v1 (with publishedBy / archivedBy).
Every channel.write and channel.credentials.rotate (actor + before/after diff with secret refs only).
Every suppression.release (actor + reason + 4-eyes ticket reference if applicable).
Every preferences.write.any (the staff acted on behalf of a guest — actor + consent record id).
Every internal/dlq/*/retry (actor + DLQ id).
Every Secret Manager AccessSecretVersion (via Cloud Audit Logs).

Audit logs are immutable (Cloud Logging storage immutability) and retained for 7 years.

9. Synthetic monitoring

Cloud Monitoring synthetic checks that run against staging and a thin slice of prod every 60 s:

POST /api/v1/notifications with a fixed test tenant and template that resolves to a sandbox channel; expect 202 within 250 ms.
The synthetic "send" produces a dispatched.v1 and a delivered webhook within 60 s; the check fails if not.
WS connect to feed, receive a heartbeat within 5 s.

Synthetic test tenant tnt_synth_notif_* is excluded from analytics dashboards via tenant-tag.

10. Cost observability

Per-tenant per-channel cost is computed daily by joining delivery_attempts_facts with the vendor invoice import (cost_model_v1); written to notif.cost_per_send_usd_micro{tenant,channel,vendor}.
Tenant admins see a cost panel; platform admins see top-N spenders.
Anomaly detection on weekly cost using a STL decomposition; > 3σ triggers a ticket.

11. Local development observability

docker compose up brings up Jaeger, Prometheus, and Grafana with pre-provisioned dashboards (see LOCAL_DEV_SETUP). Engineers see traces and metrics for their local sends, including the AI orchestrator stub's draft latency.

1. Service Level Objectives (SLOs)​

2. RED metrics per route/handler​

3. Domain metrics​

4. Tracing​

4.1 Span taxonomy​

4.2 Trace context propagation​

4.3 Cross-service correlation​

5. Logging​

6. Dashboards​

7. Alerts​

8. Audit and compliance signals​

9. Synthetic monitoring​

10. Cost observability​

11. Local development observability​