OBSERVABILITY — analytics-service

Sibling: SECURITY_MODEL · FAILURE_MODES · platform anchor: docs/02 §13 Observability

OpenTelemetry → OTel Collector → SigNoz (traces, metrics, logs) and Cloud Monitoring (managed metrics + uptime + SLO budgets).

1. Service identity

Resource attribute	Value
`service.name`	`analytics-service`
`service.version`	semver from CI (`x.y.z`) + git sha
`deployment.environment`	`dev` \| `stg` \| `prod`
`cloud.region`	residency-bound region
`melmastoon.role`	`api` \| `etl-worker` \| `pubsub-sink` \| `looker-broker`

Every span/log/metric carries melmastoon.tenant_id, melmastoon.user_id?, melmastoon.correlation_id, melmastoon.causation_id?. The PII redactor strips email, phone, nationalId patterns before export.

2. Required span attributes

Span	Attributes
`http.server`	`http.method`, `http.route`, `http.status_code`, `melmastoon.tenant_id`, `melmastoon.user_id`, `melmastoon.idempotency_key?`
`db.postgres`	`db.system="postgresql"`, `db.statement.template`, `db.rows_affected`, `melmastoon.tenant_id`
`bigquery.query`	`bq.job_id`, `bq.bytes_processed`, `bq.bytes_billed`, `bq.cache_hit`, `bq.location`, `melmastoon.metric_id?`, `melmastoon.dashboard_id?`, `melmastoon.widget_id?`, `melmastoon.tenant_id`
`etl.run`	`melmastoon.etl.job_id`, `melmastoon.etl.run_id`, `melmastoon.etl.projection_id`, `melmastoon.etl.rows_in`, `melmastoon.etl.rows_out`, `melmastoon.etl.duration_ms`, `melmastoon.etl.bytes_billed`
`pubsub.publish` / `pubsub.consume`	`messaging.system="gcp-pubsub"`, `messaging.destination`, `messaging.message_id`, `melmastoon.event.subject`, `melmastoon.event.version`, `melmastoon.tenant_id`
`dq.check`	`melmastoon.dq.check_id`, `melmastoon.dq.severity`, `melmastoon.dq.passed`
`ai.invoke`	`melmastoon.ai.capability`, `melmastoon.ai.model_version`, `melmastoon.ai.tokens_in`, `melmastoon.ai.tokens_out`, `melmastoon.ai.cost_usd_micro`

Spans propagate W3C traceparent from gateway → BFF → analytics-service → BigQuery / Pub/Sub.

3. Structured logs

JSON to stdout, ingested by Cloud Logging + forwarded to SigNoz. Mandatory fields:

{
  "ts": "2026-04-22T08:13:01.214Z",
  "level": "info",
  "msg": "widget data served",
  "service": "analytics-service",
  "env": "prod",
  "region": "europe-west3",
  "trace_id": "...",
  "span_id": "...",
  "tenant_id": "tnt_01H...",
  "user_id": "usr_01H...",
  "correlation_id": "corr_...",
  "widget_id": "wid_01H...",
  "bytes_billed": 8388608,
  "cache_hit": true,
  "duration_ms": 142
}

Forbidden in logs: full SQL strings (templates only), guest names/emails, raw event payloads, JWTs, embed tokens.

4. SLIs / SLOs

SLI	Target (30 d)
Widget query availability (HTTP 5xx rate < 1 %)	99.9 %
Widget query latency p95 (cached)	≤ 500 ms
Widget query latency p95 (uncached, ≤ 1 GB)	≤ 4 s
Curated freshness (event landed → curated row visible) p95	≤ 15 min
Critical metric freshness (occupancy, RevPAR) p95	≤ 5 min
ETL job success rate	≥ 99.5 %
DQ critical alert MTTR	≤ 1 h
Forecast writeback success rate	≥ 99.9 %
Pub/Sub sink lag p95	≤ 60 s

Budgets exposed via Cloud Monitoring SLO + dashboarded in SigNoz.

5. RED + USE metrics

RED on each route + on Pub/Sub consumers + on each ETL job:

http.server.request.duration_ms{route, method, status_code}
pubsub.consumer.duration_ms{subscription}
etl.run.duration_ms{job_id, projection_id}
etl.run.errors_total{job_id, reason}

USE on dependencies:

bigquery.bytes_billed_total{kind, tenant_id} — daily
bigquery.slot_ms_total{kind}
postgres.pool.active, postgres.pool.waiting
cache.hit_ratio{key_prefix}
pubsub.subscription.oldest_unacked_message_age_seconds

Custom domain metrics:

analytics.widget.query.bytes_billed{tenant_id, widget_id}
analytics.widget.query.cap_exceeded_total{tenant_id}
analytics.dq.failed_total{check_id, severity}
analytics.forecast.writeback.rows_total{model_id}
analytics.looker.token.issued_total{tenant_id}
analytics.budget.bytes_used_ratio{tenant_id} — gauge 0..1

6. Dashboards (Grafana / SigNoz)

Service health. RED per route, error budgets, Pub/Sub lag, pool usage.
Pipeline freshness. event landed → curated row latency by domain; ETL run timeline.
Query economics. Bytes billed and slot-ms per tenant per day; top-N expensive widgets; cache hit ratio.
Data quality. DQ pass/fail trend; open critical alerts; freshness deltas.
AI usage. Capability call counts, latency, token spend; off-switch state per tenant.
Tenant view. A drill-down: top queries, byte usage, dashboards last viewed, DQ alerts (used in support).

7. Alerts (PagerDuty)

Alert	Severity	Trigger	Runbook
WidgetQueryErrorRate	P1	`5xx > 1 %` for 5 min	`runbooks/analytics-widget-errors.md`
WidgetQueryLatencyP95	P2	uncached p95 > 8 s for 10 min	`runbooks/analytics-widget-latency.md`
CuratedFreshnessBreach	P2	freshness > 30 min for 10 min	`runbooks/analytics-freshness.md`
CriticalMetricStale	P1	occupancy/RevPAR freshness > 30 min	`runbooks/analytics-critical-metric.md`
ETLJobFailed	P2	any critical job fails twice in a row	`runbooks/analytics-etl.md`
DQCriticalAlert	P1	any `severity=critical` DQ result fails	`runbooks/analytics-dq.md`
BigQueryByteBudget80	P3	tenant ratio ≥ 0.8	`runbooks/analytics-budget.md`
BigQueryByteBudget100	P2	tenant ratio ≥ 1.0 (auto-pause snapshots)	same
ForecastWritebackFail	P2	writeback success < 99 % for 30 min	`runbooks/analytics-forecast.md`
PubSubSinkLag	P2	oldest unacked > 5 min	`runbooks/analytics-sink-lag.md`

All alerts attach trace exemplars and the dashboard panel link.

8. Tracing rules of thumb

Always start a span at HTTP/Pub-Sub entry; close it after the response is flushed.
Wrap every BigQuery call in a span and record bytes_billed even on error.
Wrap every ETL step (extract, transform, load) as child spans of etl.run.
Carry correlation_id from inbound headers to outbound publishes.

9. Synthetic / black-box checks

Every 60 s: GET /healthz per region.
Every 5 min: signed widget data probe with synthetic tenant tnt_synthetic_<region> returning a known fixture from BigQuery.
Every 15 min: ETL probe enqueues a synthetic event and asserts curated row appears within freshness SLO.
Every hour: Looker Studio embed token mint + headless verify.

Check failures page on-call after one breach (P2) or two breaches in 5 min (P1).

10. Cost observability

BigQuery cost attribution labels (tenant_id, widget_id?, kind) on every query.
Daily cost-by-tenant report into analytics.bigquery_cost_daily (eat your own dog food).
Cost anomaly alert when daily spend deviates > 3σ from 30-day baseline.

Cross-references: SECURITY_MODEL §6 audit, FAILURE_MODES, DEPLOYMENT_TOPOLOGY.

1. Service identity​

2. Required span attributes​

3. Structured logs​

4. SLIs / SLOs​

5. RED + USE metrics​

6. Dashboards (Grafana / SigNoz)​

7. Alerts (PagerDuty)​

8. Tracing rules of thumb​

9. Synthetic / black-box checks​

10. Cost observability​