Skip to main content

OBSERVABILITY — reporting-service

Sibling: APPLICATION_LOGIC · FAILURE_MODES · platform anchor: docs/02 §10 Observability

Telemetry stack: OpenTelemetry SDKOTel Collector (sidecar in dev, sidecar+gateway in prod) → SigNoz (traces, logs) and Cloud Monitoring (metrics, alerts). All emissions follow naming.


1. Service identity in telemetry

service.name = reporting-service
service.namespace = melmastoon
service.version = $GIT_SHA
deployment.environment = dev | staging | prod
cloud.provider = gcp
cloud.region = $GCP_REGION

2. Required span attributes

Every span we emit:

AttributeSource
melmastoon.tenant_idRequest context
melmastoon.correlation_idInbound header / event envelope
melmastoon.causation_idOriginating event id when applicable
melmastoon.user_idJWT subject (omitted on machine paths)
melmastoon.report_id, melmastoon.run_id, melmastoon.template_id, melmastoon.template_version_idDomain ids attached at the use case boundary
messaging.system = google_pubsub (publish/consume spans)OTel Pub/Sub instrumentation
db.system = postgresql, db.statement (sanitized)Drizzle wrapper
genai.system, genai.model, genai.cost_usd_microsAIClient wrapper

PII is forbidden in span attributes; the logger redactor strips email/phone patterns before export.


3. Structured logs

JSON Lines on stdout, severity ∈ {DEBUG, INFO, WARN, ERROR}. Required fields:

{
"ts": "2026-04-22T10:00:00.123Z",
"severity": "INFO",
"service": "reporting-service",
"version": "abc1234",
"env": "prod",
"tenant_id": "tnt_…",
"correlation_id": "cor_…",
"causation_id": "cor_…",
"trace_id": "…",
"span_id": "…",
"event": "report.run.completed",
"report_id": "rep_…",
"run_id": "run_…",
"duration_ms": 1834
}

Loggers attach trace context automatically; manual console.log is forbidden by ESLint rule.


4. SLIs and SLOs

SLIDefinitionTarget (rolling 30 d)
Run success rateruns with status='completed' ÷ runs with terminal status≥ 99.0 %
Run latency p95 (operational)completed_at - queued_at for ad-hoc runs of category operational≤ 8 s
Run latency p95 (regulatory)same, category regulatory, ≤ 50k rows≤ 30 s
Schedule fire timelinessdrift from cron expectationp95 ≤ 60 s
Regulatory submission success ratesucceeded ÷ all terminal≥ 99.5 % monthly
API availabilitynon-5xx ÷ all on /api/v1/reports/*≥ 99.9 %
Pub/Sub publish successpublished outbox rows ÷ attempted≥ 99.99 %
AI step skipped ratereport.ai_skipped.v1 ÷ runs eligible for AIinformational, alert if > 10 %

Error budget burn alerts: 1 h fast burn (14 d budget) and 6 h slow burn per platform standard.


5. RED + USE metrics

Metric (Cloud Monitoring + Prometheus exposition)TypeLabels
reporting_runs_totalcountertenant_id, category, status, format
reporting_run_duration_secondshistogramtenant_id, category, format
reporting_run_rowshistogramtenant_id, category
reporting_artifact_byteshistogramtenant_id, format
reporting_render_step_secondshistogramstep ∈ {query, compose, render, persist, deliver}
reporting_schedule_drift_secondsgaugetenant_id
reporting_regulatory_submissions_totalcounterjurisdiction_code, status
reporting_regulatory_submission_attemptshistogramjurisdiction_code
reporting_outbox_lag_secondsgauge(per pod)
reporting_inbox_dedupe_hits_totalcountersubject
reporting_ai_skipped_totalcounterreason ∈ {timeout, budget, residency, low_confidence}
reporting_ai_cost_usd_micros_totalcountertenant_id, capability

USE for the worker pool: pod CPU, memory, render saturation (Puppeteer page count vs cap), GCS upload bandwidth, BigQuery slot utilization (read).


6. Dashboards

The platform Grafana stack includes Reporting Overview, Reporting Per Tenant, and Reporting Regulatory dashboards. Panels:

  • Run throughput by category & status.
  • Run latency p50/p95/p99 by category.
  • Top 10 tenants by run count and by AI cost.
  • Schedule drift heatmap by hour.
  • Regulatory submission queue depth and success rate by jurisdiction.
  • Outbox lag and Pub/Sub publish errors.
  • Renderer error breakdown (Puppeteer crash, OOM, template invariant).

7. Alerts

AlertConditionSeverityRunbook
RunFailureBurstrate(reporting_runs_total{status="failed"}[5m]) / rate(reporting_runs_total[5m]) > 0.05 for 10 mP2runbooks/reporting/run-failure-burst.md
RegulatorySubmissionMissedany submission with status='pending' and next_attempt_at < now() - 6hP1runbooks/reporting/regulatory-missed.md
OutboxLagmax_over_time(reporting_outbox_lag_seconds[10m]) > 60P2runbooks/platform/outbox-lag.md
ScheduleDriftHighquantile(0.95, reporting_schedule_drift_seconds) > 120 for 30 mP3runbooks/reporting/schedule-drift.md
AIBudgetExhaustedincrease(reporting_ai_skipped_total{reason="budget"}[1h]) > 0P3 (notify tenant.owner once / day)n/a
PuppeteerCrashLoopincrease(reporting_render_step_seconds_count{step="render",status="error"}[10m]) > 20P1runbooks/reporting/puppeteer-crash.md
BigQueryTimeoutSpikeupstream analytics-service query timeouts > 1 % for 15 mP2shared with analytics

8. Tracing & replay

  • Each event published carries traceparent in attributes; consumers continue the trace.
  • reporting.report.completed.v1 consumers (notification, audit) get linked spans for root-cause investigation.
  • A canary synthetic test runs every 5 min: requests a tiny operational report against a synthetic tenant property, asserts terminal status completed within 15 s. Failures page the on-call.

Synthetic check definition lives at services/reporting-service/synthetics/canary-run.yaml.


9. On-call playbook entry points

Every alert references a runbook in docs/runbooks/reporting/:

  • run-failure-burst.md — triage by template/version, check upstream analytics-service health, drain DLQ if needed.
  • regulatory-missed.md — open the submission, choose retry-now or manually-resolved with note, page legal liaison if past statutory cutoff.
  • puppeteer-crash.md — pin Chromium version, scale workers vertically, fall back to single-process mode.
  • schedule-drift.md — check Cloud Scheduler quota, worker pool saturation, regional incidents.

Cross-references: FAILURE_MODES, DEPLOYMENT_TOPOLOGY, SERVICE_READINESS.