OBSERVABILITY — reporting-service

Sibling: APPLICATION_LOGIC · FAILURE_MODES · platform anchor: docs/02 §10 Observability

Telemetry stack: OpenTelemetry SDK → OTel Collector (sidecar in dev, sidecar+gateway in prod) → SigNoz (traces, logs) and Cloud Monitoring (metrics, alerts). All emissions follow naming.

1. Service identity in telemetry

service.name           = reporting-service
service.namespace      = melmastoon
service.version        = $GIT_SHA
deployment.environment = dev | staging | prod
cloud.provider         = gcp
cloud.region           = $GCP_REGION

2. Required span attributes

Every span we emit:

Attribute	Source
`melmastoon.tenant_id`	Request context
`melmastoon.correlation_id`	Inbound header / event envelope
`melmastoon.causation_id`	Originating event id when applicable
`melmastoon.user_id`	JWT subject (omitted on machine paths)
`melmastoon.report_id`, `melmastoon.run_id`, `melmastoon.template_id`, `melmastoon.template_version_id`	Domain ids attached at the use case boundary
`messaging.system = google_pubsub` (publish/consume spans)	OTel Pub/Sub instrumentation
`db.system = postgresql`, `db.statement` (sanitized)	Drizzle wrapper
`genai.system`, `genai.model`, `genai.cost_usd_micros`	AIClient wrapper

PII is forbidden in span attributes; the logger redactor strips email/phone patterns before export.

3. Structured logs

JSON Lines on stdout, severity ∈ {DEBUG, INFO, WARN, ERROR}. Required fields:

{
  "ts": "2026-04-22T10:00:00.123Z",
  "severity": "INFO",
  "service": "reporting-service",
  "version": "abc1234",
  "env": "prod",
  "tenant_id": "tnt_…",
  "correlation_id": "cor_…",
  "causation_id": "cor_…",
  "trace_id": "…",
  "span_id": "…",
  "event": "report.run.completed",
  "report_id": "rep_…",
  "run_id": "run_…",
  "duration_ms": 1834
}

Loggers attach trace context automatically; manual console.log is forbidden by ESLint rule.

4. SLIs and SLOs

SLI	Definition	Target (rolling 30 d)
Run success rate	runs with `status='completed'` ÷ runs with terminal status	≥ 99.0 %
Run latency p95 (operational)	`completed_at - queued_at` for ad-hoc runs of category `operational`	≤ 8 s
Run latency p95 (regulatory)	same, category `regulatory`, ≤ 50k rows	≤ 30 s
Schedule fire timeliness	drift from cron expectation	p95 ≤ 60 s
Regulatory submission success rate	succeeded ÷ all terminal	≥ 99.5 % monthly
API availability	non-5xx ÷ all on `/api/v1/reports/*`	≥ 99.9 %
Pub/Sub publish success	published outbox rows ÷ attempted	≥ 99.99 %
AI step skipped rate	`report.ai_skipped.v1` ÷ runs eligible for AI	informational, alert if > 10 %

Error budget burn alerts: 1 h fast burn (14 d budget) and 6 h slow burn per platform standard.

5. RED + USE metrics

Metric (Cloud Monitoring + Prometheus exposition)	Type	Labels
`reporting_runs_total`	counter	`tenant_id`, `category`, `status`, `format`
`reporting_run_duration_seconds`	histogram	`tenant_id`, `category`, `format`
`reporting_run_rows`	histogram	`tenant_id`, `category`
`reporting_artifact_bytes`	histogram	`tenant_id`, `format`
`reporting_render_step_seconds`	histogram	`step` ∈ {query, compose, render, persist, deliver}
`reporting_schedule_drift_seconds`	gauge	`tenant_id`
`reporting_regulatory_submissions_total`	counter	`jurisdiction_code`, `status`
`reporting_regulatory_submission_attempts`	histogram	`jurisdiction_code`
`reporting_outbox_lag_seconds`	gauge	(per pod)
`reporting_inbox_dedupe_hits_total`	counter	`subject`
`reporting_ai_skipped_total`	counter	`reason` ∈ {timeout, budget, residency, low_confidence}
`reporting_ai_cost_usd_micros_total`	counter	`tenant_id`, `capability`

USE for the worker pool: pod CPU, memory, render saturation (Puppeteer page count vs cap), GCS upload bandwidth, BigQuery slot utilization (read).

6. Dashboards

The platform Grafana stack includes Reporting Overview, Reporting Per Tenant, and Reporting Regulatory dashboards. Panels:

Run throughput by category & status.
Run latency p50/p95/p99 by category.
Top 10 tenants by run count and by AI cost.
Schedule drift heatmap by hour.
Regulatory submission queue depth and success rate by jurisdiction.
Outbox lag and Pub/Sub publish errors.
Renderer error breakdown (Puppeteer crash, OOM, template invariant).

7. Alerts

Alert	Condition	Severity	Runbook
`RunFailureBurst`	`rate(reporting_runs_total{status="failed"}[5m]) / rate(reporting_runs_total[5m]) > 0.05` for 10 m	P2	`runbooks/reporting/run-failure-burst.md`
`RegulatorySubmissionMissed`	any submission with `status='pending'` and `next_attempt_at < now() - 6h`	P1	`runbooks/reporting/regulatory-missed.md`
`OutboxLag`	`max_over_time(reporting_outbox_lag_seconds[10m]) > 60`	P2	`runbooks/platform/outbox-lag.md`
`ScheduleDriftHigh`	`quantile(0.95, reporting_schedule_drift_seconds) > 120` for 30 m	P3	`runbooks/reporting/schedule-drift.md`
`AIBudgetExhausted`	`increase(reporting_ai_skipped_total{reason="budget"}[1h]) > 0`	P3 (notify tenant.owner once / day)	n/a
`PuppeteerCrashLoop`	`increase(reporting_render_step_seconds_count{step="render",status="error"}[10m]) > 20`	P1	`runbooks/reporting/puppeteer-crash.md`
`BigQueryTimeoutSpike`	upstream `analytics-service` query timeouts > 1 % for 15 m	P2	shared with analytics

8. Tracing & replay

Each event published carries traceparent in attributes; consumers continue the trace.
reporting.report.completed.v1 consumers (notification, audit) get linked spans for root-cause investigation.
A canary synthetic test runs every 5 min: requests a tiny operational report against a synthetic tenant property, asserts terminal status completed within 15 s. Failures page the on-call.

Synthetic check definition lives at services/reporting-service/synthetics/canary-run.yaml.

9. On-call playbook entry points

Every alert references a runbook in docs/runbooks/reporting/:

run-failure-burst.md — triage by template/version, check upstream analytics-service health, drain DLQ if needed.
regulatory-missed.md — open the submission, choose retry-now or manually-resolved with note, page legal liaison if past statutory cutoff.
puppeteer-crash.md — pin Chromium version, scale workers vertically, fall back to single-process mode.
schedule-drift.md — check Cloud Scheduler quota, worker pool saturation, regional incidents.

Cross-references: FAILURE_MODES, DEPLOYMENT_TOPOLOGY, SERVICE_READINESS.

1. Service identity in telemetry​

2. Required span attributes​

3. Structured logs​

4. SLIs and SLOs​

5. RED + USE metrics​

6. Dashboards​

7. Alerts​

8. Tracing & replay​

9. On-call playbook entry points​