OBSERVABILITY — reporting-service
Sibling: APPLICATION_LOGIC · FAILURE_MODES · platform anchor: docs/02 §10 Observability
Telemetry stack: OpenTelemetry SDK → OTel Collector (sidecar in dev, sidecar+gateway in prod) → SigNoz (traces, logs) and Cloud Monitoring (metrics, alerts). All emissions follow naming.
1. Service identity in telemetry
service.name = reporting-service
service.namespace = melmastoon
service.version = $GIT_SHA
deployment.environment = dev | staging | prod
cloud.provider = gcp
cloud.region = $GCP_REGION
2. Required span attributes
Every span we emit:
| Attribute | Source |
|---|---|
melmastoon.tenant_id | Request context |
melmastoon.correlation_id | Inbound header / event envelope |
melmastoon.causation_id | Originating event id when applicable |
melmastoon.user_id | JWT subject (omitted on machine paths) |
melmastoon.report_id, melmastoon.run_id, melmastoon.template_id, melmastoon.template_version_id | Domain ids attached at the use case boundary |
messaging.system = google_pubsub (publish/consume spans) | OTel Pub/Sub instrumentation |
db.system = postgresql, db.statement (sanitized) | Drizzle wrapper |
genai.system, genai.model, genai.cost_usd_micros | AIClient wrapper |
PII is forbidden in span attributes; the logger redactor strips email/phone patterns before export.
3. Structured logs
JSON Lines on stdout, severity ∈ {DEBUG, INFO, WARN, ERROR}. Required fields:
{
"ts": "2026-04-22T10:00:00.123Z",
"severity": "INFO",
"service": "reporting-service",
"version": "abc1234",
"env": "prod",
"tenant_id": "tnt_…",
"correlation_id": "cor_…",
"causation_id": "cor_…",
"trace_id": "…",
"span_id": "…",
"event": "report.run.completed",
"report_id": "rep_…",
"run_id": "run_…",
"duration_ms": 1834
}
Loggers attach trace context automatically; manual console.log is forbidden by ESLint rule.
4. SLIs and SLOs
| SLI | Definition | Target (rolling 30 d) |
|---|---|---|
| Run success rate | runs with status='completed' ÷ runs with terminal status | ≥ 99.0 % |
| Run latency p95 (operational) | completed_at - queued_at for ad-hoc runs of category operational | ≤ 8 s |
| Run latency p95 (regulatory) | same, category regulatory, ≤ 50k rows | ≤ 30 s |
| Schedule fire timeliness | drift from cron expectation | p95 ≤ 60 s |
| Regulatory submission success rate | succeeded ÷ all terminal | ≥ 99.5 % monthly |
| API availability | non-5xx ÷ all on /api/v1/reports/* | ≥ 99.9 % |
| Pub/Sub publish success | published outbox rows ÷ attempted | ≥ 99.99 % |
| AI step skipped rate | report.ai_skipped.v1 ÷ runs eligible for AI | informational, alert if > 10 % |
Error budget burn alerts: 1 h fast burn (14 d budget) and 6 h slow burn per platform standard.
5. RED + USE metrics
| Metric (Cloud Monitoring + Prometheus exposition) | Type | Labels |
|---|---|---|
reporting_runs_total | counter | tenant_id, category, status, format |
reporting_run_duration_seconds | histogram | tenant_id, category, format |
reporting_run_rows | histogram | tenant_id, category |
reporting_artifact_bytes | histogram | tenant_id, format |
reporting_render_step_seconds | histogram | step ∈ {query, compose, render, persist, deliver} |
reporting_schedule_drift_seconds | gauge | tenant_id |
reporting_regulatory_submissions_total | counter | jurisdiction_code, status |
reporting_regulatory_submission_attempts | histogram | jurisdiction_code |
reporting_outbox_lag_seconds | gauge | (per pod) |
reporting_inbox_dedupe_hits_total | counter | subject |
reporting_ai_skipped_total | counter | reason ∈ {timeout, budget, residency, low_confidence} |
reporting_ai_cost_usd_micros_total | counter | tenant_id, capability |
USE for the worker pool: pod CPU, memory, render saturation (Puppeteer page count vs cap), GCS upload bandwidth, BigQuery slot utilization (read).
6. Dashboards
The platform Grafana stack includes Reporting Overview, Reporting Per Tenant, and Reporting Regulatory dashboards. Panels:
- Run throughput by category & status.
- Run latency p50/p95/p99 by category.
- Top 10 tenants by run count and by AI cost.
- Schedule drift heatmap by hour.
- Regulatory submission queue depth and success rate by jurisdiction.
- Outbox lag and Pub/Sub publish errors.
- Renderer error breakdown (Puppeteer crash, OOM, template invariant).
7. Alerts
| Alert | Condition | Severity | Runbook |
|---|---|---|---|
RunFailureBurst | rate(reporting_runs_total{status="failed"}[5m]) / rate(reporting_runs_total[5m]) > 0.05 for 10 m | P2 | runbooks/reporting/run-failure-burst.md |
RegulatorySubmissionMissed | any submission with status='pending' and next_attempt_at < now() - 6h | P1 | runbooks/reporting/regulatory-missed.md |
OutboxLag | max_over_time(reporting_outbox_lag_seconds[10m]) > 60 | P2 | runbooks/platform/outbox-lag.md |
ScheduleDriftHigh | quantile(0.95, reporting_schedule_drift_seconds) > 120 for 30 m | P3 | runbooks/reporting/schedule-drift.md |
AIBudgetExhausted | increase(reporting_ai_skipped_total{reason="budget"}[1h]) > 0 | P3 (notify tenant.owner once / day) | n/a |
PuppeteerCrashLoop | increase(reporting_render_step_seconds_count{step="render",status="error"}[10m]) > 20 | P1 | runbooks/reporting/puppeteer-crash.md |
BigQueryTimeoutSpike | upstream analytics-service query timeouts > 1 % for 15 m | P2 | shared with analytics |
8. Tracing & replay
- Each event published carries
traceparentin attributes; consumers continue the trace. reporting.report.completed.v1consumers (notification, audit) get linked spans for root-cause investigation.- A canary synthetic test runs every 5 min: requests a tiny operational report against a synthetic tenant property, asserts terminal status
completedwithin 15 s. Failures page the on-call.
Synthetic check definition lives at services/reporting-service/synthetics/canary-run.yaml.
9. On-call playbook entry points
Every alert references a runbook in docs/runbooks/reporting/:
run-failure-burst.md— triage by template/version, check upstreamanalytics-servicehealth, drain DLQ if needed.regulatory-missed.md— open the submission, choose retry-now or manually-resolved with note, page legal liaison if past statutory cutoff.puppeteer-crash.md— pin Chromium version, scale workers vertically, fall back to single-process mode.schedule-drift.md— check Cloud Scheduler quota, worker pool saturation, regional incidents.
Cross-references: FAILURE_MODES, DEPLOYMENT_TOPOLOGY, SERVICE_READINESS.