OBSERVABILITY — analytics-service
Sibling: SECURITY_MODEL · FAILURE_MODES · platform anchor: docs/02 §13 Observability
OpenTelemetry → OTel Collector → SigNoz (traces, metrics, logs) and Cloud Monitoring (managed metrics + uptime + SLO budgets).
1. Service identity
| Resource attribute | Value |
|---|---|
service.name | analytics-service |
service.version | semver from CI (x.y.z) + git sha |
deployment.environment | dev | stg | prod |
cloud.region | residency-bound region |
melmastoon.role | api | etl-worker | pubsub-sink | looker-broker |
Every span/log/metric carries melmastoon.tenant_id, melmastoon.user_id?, melmastoon.correlation_id, melmastoon.causation_id?. The PII redactor strips email, phone, nationalId patterns before export.
2. Required span attributes
| Span | Attributes |
|---|---|
http.server | http.method, http.route, http.status_code, melmastoon.tenant_id, melmastoon.user_id, melmastoon.idempotency_key? |
db.postgres | db.system="postgresql", db.statement.template, db.rows_affected, melmastoon.tenant_id |
bigquery.query | bq.job_id, bq.bytes_processed, bq.bytes_billed, bq.cache_hit, bq.location, melmastoon.metric_id?, melmastoon.dashboard_id?, melmastoon.widget_id?, melmastoon.tenant_id |
etl.run | melmastoon.etl.job_id, melmastoon.etl.run_id, melmastoon.etl.projection_id, melmastoon.etl.rows_in, melmastoon.etl.rows_out, melmastoon.etl.duration_ms, melmastoon.etl.bytes_billed |
pubsub.publish / pubsub.consume | messaging.system="gcp-pubsub", messaging.destination, messaging.message_id, melmastoon.event.subject, melmastoon.event.version, melmastoon.tenant_id |
dq.check | melmastoon.dq.check_id, melmastoon.dq.severity, melmastoon.dq.passed |
ai.invoke | melmastoon.ai.capability, melmastoon.ai.model_version, melmastoon.ai.tokens_in, melmastoon.ai.tokens_out, melmastoon.ai.cost_usd_micro |
Spans propagate W3C traceparent from gateway → BFF → analytics-service → BigQuery / Pub/Sub.
3. Structured logs
JSON to stdout, ingested by Cloud Logging + forwarded to SigNoz. Mandatory fields:
{
"ts": "2026-04-22T08:13:01.214Z",
"level": "info",
"msg": "widget data served",
"service": "analytics-service",
"env": "prod",
"region": "europe-west3",
"trace_id": "...",
"span_id": "...",
"tenant_id": "tnt_01H...",
"user_id": "usr_01H...",
"correlation_id": "corr_...",
"widget_id": "wid_01H...",
"bytes_billed": 8388608,
"cache_hit": true,
"duration_ms": 142
}
Forbidden in logs: full SQL strings (templates only), guest names/emails, raw event payloads, JWTs, embed tokens.
4. SLIs / SLOs
| SLI | Target (30 d) |
|---|---|
| Widget query availability (HTTP 5xx rate < 1 %) | 99.9 % |
| Widget query latency p95 (cached) | ≤ 500 ms |
| Widget query latency p95 (uncached, ≤ 1 GB) | ≤ 4 s |
| Curated freshness (event landed → curated row visible) p95 | ≤ 15 min |
| Critical metric freshness (occupancy, RevPAR) p95 | ≤ 5 min |
| ETL job success rate | ≥ 99.5 % |
| DQ critical alert MTTR | ≤ 1 h |
| Forecast writeback success rate | ≥ 99.9 % |
| Pub/Sub sink lag p95 | ≤ 60 s |
Budgets exposed via Cloud Monitoring SLO + dashboarded in SigNoz.
5. RED + USE metrics
RED on each route + on Pub/Sub consumers + on each ETL job:
http.server.request.duration_ms{route, method, status_code}pubsub.consumer.duration_ms{subscription}etl.run.duration_ms{job_id, projection_id}etl.run.errors_total{job_id, reason}
USE on dependencies:
bigquery.bytes_billed_total{kind, tenant_id}— dailybigquery.slot_ms_total{kind}postgres.pool.active,postgres.pool.waitingcache.hit_ratio{key_prefix}pubsub.subscription.oldest_unacked_message_age_seconds
Custom domain metrics:
analytics.widget.query.bytes_billed{tenant_id, widget_id}analytics.widget.query.cap_exceeded_total{tenant_id}analytics.dq.failed_total{check_id, severity}analytics.forecast.writeback.rows_total{model_id}analytics.looker.token.issued_total{tenant_id}analytics.budget.bytes_used_ratio{tenant_id}— gauge 0..1
6. Dashboards (Grafana / SigNoz)
- Service health. RED per route, error budgets, Pub/Sub lag, pool usage.
- Pipeline freshness.
event landed → curated rowlatency by domain; ETL run timeline. - Query economics. Bytes billed and slot-ms per tenant per day; top-N expensive widgets; cache hit ratio.
- Data quality. DQ pass/fail trend; open critical alerts; freshness deltas.
- AI usage. Capability call counts, latency, token spend; off-switch state per tenant.
- Tenant view. A drill-down: top queries, byte usage, dashboards last viewed, DQ alerts (used in support).
7. Alerts (PagerDuty)
| Alert | Severity | Trigger | Runbook |
|---|---|---|---|
| WidgetQueryErrorRate | P1 | 5xx > 1 % for 5 min | runbooks/analytics-widget-errors.md |
| WidgetQueryLatencyP95 | P2 | uncached p95 > 8 s for 10 min | runbooks/analytics-widget-latency.md |
| CuratedFreshnessBreach | P2 | freshness > 30 min for 10 min | runbooks/analytics-freshness.md |
| CriticalMetricStale | P1 | occupancy/RevPAR freshness > 30 min | runbooks/analytics-critical-metric.md |
| ETLJobFailed | P2 | any critical job fails twice in a row | runbooks/analytics-etl.md |
| DQCriticalAlert | P1 | any severity=critical DQ result fails | runbooks/analytics-dq.md |
| BigQueryByteBudget80 | P3 | tenant ratio ≥ 0.8 | runbooks/analytics-budget.md |
| BigQueryByteBudget100 | P2 | tenant ratio ≥ 1.0 (auto-pause snapshots) | same |
| ForecastWritebackFail | P2 | writeback success < 99 % for 30 min | runbooks/analytics-forecast.md |
| PubSubSinkLag | P2 | oldest unacked > 5 min | runbooks/analytics-sink-lag.md |
All alerts attach trace exemplars and the dashboard panel link.
8. Tracing rules of thumb
- Always start a span at HTTP/Pub-Sub entry; close it after the response is flushed.
- Wrap every BigQuery call in a span and record
bytes_billedeven on error. - Wrap every ETL step (extract, transform, load) as child spans of
etl.run. - Carry
correlation_idfrom inbound headers to outbound publishes.
9. Synthetic / black-box checks
- Every 60 s:
GET /healthzper region. - Every 5 min: signed widget data probe with synthetic tenant
tnt_synthetic_<region>returning a known fixture from BigQuery. - Every 15 min: ETL probe enqueues a synthetic event and asserts curated row appears within freshness SLO.
- Every hour: Looker Studio embed token mint + headless verify.
Check failures page on-call after one breach (P2) or two breaches in 5 min (P1).
10. Cost observability
- BigQuery cost attribution labels (
tenant_id,widget_id?,kind) on every query. - Daily cost-by-tenant report into
analytics.bigquery_cost_daily(eat your own dog food). - Cost anomaly alert when daily spend deviates > 3σ from 30-day baseline.
Cross-references: SECURITY_MODEL §6 audit, FAILURE_MODES, DEPLOYMENT_TOPOLOGY.