Skip to main content

OBSERVABILITY — analytics-service

Sibling: SECURITY_MODEL · FAILURE_MODES · platform anchor: docs/02 §13 Observability

OpenTelemetry → OTel Collector → SigNoz (traces, metrics, logs) and Cloud Monitoring (managed metrics + uptime + SLO budgets).


1. Service identity

Resource attributeValue
service.nameanalytics-service
service.versionsemver from CI (x.y.z) + git sha
deployment.environmentdev | stg | prod
cloud.regionresidency-bound region
melmastoon.roleapi | etl-worker | pubsub-sink | looker-broker

Every span/log/metric carries melmastoon.tenant_id, melmastoon.user_id?, melmastoon.correlation_id, melmastoon.causation_id?. The PII redactor strips email, phone, nationalId patterns before export.


2. Required span attributes

SpanAttributes
http.serverhttp.method, http.route, http.status_code, melmastoon.tenant_id, melmastoon.user_id, melmastoon.idempotency_key?
db.postgresdb.system="postgresql", db.statement.template, db.rows_affected, melmastoon.tenant_id
bigquery.querybq.job_id, bq.bytes_processed, bq.bytes_billed, bq.cache_hit, bq.location, melmastoon.metric_id?, melmastoon.dashboard_id?, melmastoon.widget_id?, melmastoon.tenant_id
etl.runmelmastoon.etl.job_id, melmastoon.etl.run_id, melmastoon.etl.projection_id, melmastoon.etl.rows_in, melmastoon.etl.rows_out, melmastoon.etl.duration_ms, melmastoon.etl.bytes_billed
pubsub.publish / pubsub.consumemessaging.system="gcp-pubsub", messaging.destination, messaging.message_id, melmastoon.event.subject, melmastoon.event.version, melmastoon.tenant_id
dq.checkmelmastoon.dq.check_id, melmastoon.dq.severity, melmastoon.dq.passed
ai.invokemelmastoon.ai.capability, melmastoon.ai.model_version, melmastoon.ai.tokens_in, melmastoon.ai.tokens_out, melmastoon.ai.cost_usd_micro

Spans propagate W3C traceparent from gateway → BFF → analytics-service → BigQuery / Pub/Sub.


3. Structured logs

JSON to stdout, ingested by Cloud Logging + forwarded to SigNoz. Mandatory fields:

{
"ts": "2026-04-22T08:13:01.214Z",
"level": "info",
"msg": "widget data served",
"service": "analytics-service",
"env": "prod",
"region": "europe-west3",
"trace_id": "...",
"span_id": "...",
"tenant_id": "tnt_01H...",
"user_id": "usr_01H...",
"correlation_id": "corr_...",
"widget_id": "wid_01H...",
"bytes_billed": 8388608,
"cache_hit": true,
"duration_ms": 142
}

Forbidden in logs: full SQL strings (templates only), guest names/emails, raw event payloads, JWTs, embed tokens.


4. SLIs / SLOs

SLITarget (30 d)
Widget query availability (HTTP 5xx rate < 1 %)99.9 %
Widget query latency p95 (cached)≤ 500 ms
Widget query latency p95 (uncached, ≤ 1 GB)≤ 4 s
Curated freshness (event landed → curated row visible) p95≤ 15 min
Critical metric freshness (occupancy, RevPAR) p95≤ 5 min
ETL job success rate≥ 99.5 %
DQ critical alert MTTR≤ 1 h
Forecast writeback success rate≥ 99.9 %
Pub/Sub sink lag p95≤ 60 s

Budgets exposed via Cloud Monitoring SLO + dashboarded in SigNoz.


5. RED + USE metrics

RED on each route + on Pub/Sub consumers + on each ETL job:

  • http.server.request.duration_ms{route, method, status_code}
  • pubsub.consumer.duration_ms{subscription}
  • etl.run.duration_ms{job_id, projection_id}
  • etl.run.errors_total{job_id, reason}

USE on dependencies:

  • bigquery.bytes_billed_total{kind, tenant_id} — daily
  • bigquery.slot_ms_total{kind}
  • postgres.pool.active, postgres.pool.waiting
  • cache.hit_ratio{key_prefix}
  • pubsub.subscription.oldest_unacked_message_age_seconds

Custom domain metrics:

  • analytics.widget.query.bytes_billed{tenant_id, widget_id}
  • analytics.widget.query.cap_exceeded_total{tenant_id}
  • analytics.dq.failed_total{check_id, severity}
  • analytics.forecast.writeback.rows_total{model_id}
  • analytics.looker.token.issued_total{tenant_id}
  • analytics.budget.bytes_used_ratio{tenant_id} — gauge 0..1

6. Dashboards (Grafana / SigNoz)

  1. Service health. RED per route, error budgets, Pub/Sub lag, pool usage.
  2. Pipeline freshness. event landed → curated row latency by domain; ETL run timeline.
  3. Query economics. Bytes billed and slot-ms per tenant per day; top-N expensive widgets; cache hit ratio.
  4. Data quality. DQ pass/fail trend; open critical alerts; freshness deltas.
  5. AI usage. Capability call counts, latency, token spend; off-switch state per tenant.
  6. Tenant view. A drill-down: top queries, byte usage, dashboards last viewed, DQ alerts (used in support).

7. Alerts (PagerDuty)

AlertSeverityTriggerRunbook
WidgetQueryErrorRateP15xx > 1 % for 5 minrunbooks/analytics-widget-errors.md
WidgetQueryLatencyP95P2uncached p95 > 8 s for 10 minrunbooks/analytics-widget-latency.md
CuratedFreshnessBreachP2freshness > 30 min for 10 minrunbooks/analytics-freshness.md
CriticalMetricStaleP1occupancy/RevPAR freshness > 30 minrunbooks/analytics-critical-metric.md
ETLJobFailedP2any critical job fails twice in a rowrunbooks/analytics-etl.md
DQCriticalAlertP1any severity=critical DQ result failsrunbooks/analytics-dq.md
BigQueryByteBudget80P3tenant ratio ≥ 0.8runbooks/analytics-budget.md
BigQueryByteBudget100P2tenant ratio ≥ 1.0 (auto-pause snapshots)same
ForecastWritebackFailP2writeback success < 99 % for 30 minrunbooks/analytics-forecast.md
PubSubSinkLagP2oldest unacked > 5 minrunbooks/analytics-sink-lag.md

All alerts attach trace exemplars and the dashboard panel link.


8. Tracing rules of thumb

  • Always start a span at HTTP/Pub-Sub entry; close it after the response is flushed.
  • Wrap every BigQuery call in a span and record bytes_billed even on error.
  • Wrap every ETL step (extract, transform, load) as child spans of etl.run.
  • Carry correlation_id from inbound headers to outbound publishes.

9. Synthetic / black-box checks

  • Every 60 s: GET /healthz per region.
  • Every 5 min: signed widget data probe with synthetic tenant tnt_synthetic_<region> returning a known fixture from BigQuery.
  • Every 15 min: ETL probe enqueues a synthetic event and asserts curated row appears within freshness SLO.
  • Every hour: Looker Studio embed token mint + headless verify.

Check failures page on-call after one breach (P2) or two breaches in 5 min (P1).


10. Cost observability

  • BigQuery cost attribution labels (tenant_id, widget_id?, kind) on every query.
  • Daily cost-by-tenant report into analytics.bigquery_cost_daily (eat your own dog food).
  • Cost anomaly alert when daily spend deviates > 3σ from 30-day baseline.

Cross-references: SECURITY_MODEL §6 audit, FAILURE_MODES, DEPLOYMENT_TOPOLOGY.