FAILURE_MODES — reporting-service
Sibling: OBSERVABILITY · APPLICATION_LOGIC · SERVICE_RISK_REGISTER · platform anchor: docs/02 §11 Resilience
This document catalogues failure modes by stage, with detection, mitigation, and recovery. Every code referenced is in the canonical registry under API_CONTRACTS §0.4 / docs/standards/ERROR_CODES.
1. Lifecycle stages
[REQUEST] → [PERSIST] → [QUEUE] → [QUERY] → [COMPOSE] → [RENDER] → [PERSIST_ARTIFACT] → [DELIVER] → [REGULATORY_SUBMIT (optional)] → [PUBLISH_EVENT]
Each stage maps to one or more failure modes below.
2. Failure mode catalogue
2.1 Request layer
| Mode | Trigger | Detection | Mitigation | Recovery |
|---|---|---|---|---|
| Idempotency-Key collision | Same key, different payload | Conflict on (tenantId, idempotency_key) | Reject with MELMASTOON.IDEMPOTENCY.KEY_COLLISION (422) | Caller must change key |
| Property scope violation | JWT lacks property | RBAC+ABAC check | Reject MELMASTOON.REPORTING.PROPERTY_SCOPE_VIOLATION | Caller obtains scope |
| Filter validation failure | Required filter missing | Domain invariant | Reject MELMASTOON.REPORTING.FILTERS_INVALID (400 + RFC 7807) | Caller fixes input |
| Template archived | template.archived=true | Read check | Reject MELMASTOON.REPORTING.TEMPLATE_ARCHIVED (410) | Caller switches to active version |
2.2 Persistence
| Mode | Trigger | Detection | Mitigation | Recovery |
|---|---|---|---|---|
| Cloud SQL connection exhaustion | Burst | pg_stat_activity saturation, app pool wait | PgBouncer transaction pooling, app circuit-breaker | Autoscale connections; back off; alert P2 |
| RLS misconfiguration | Missing policy | CI migration check + integration test fail | Block deploy | n/a |
| Outbox publish lag | Pub/Sub down | reporting_outbox_lag_seconds > 60 | Outbox publisher retries with exponential backoff | Once Pub/Sub healthy, drain queue; alert P2 |
| Inbox dedupe miss | Concurrent delivery to two pods | UNIQUE on (subject, message_id) | Insert before effect; rollback effect on conflict | Self-healing |
2.3 Query (analytics dependency)
| Mode | Trigger | Detection | Mitigation | Recovery |
|---|---|---|---|---|
| AnalyticsClient timeout | BigQuery slot starvation | Per-query span > deadline | Retry with jitter (≤ 3); else fail run with MELMASTOON.REPORTING.UPSTREAM_ANALYTICS_TIMEOUT retriable | Worker reschedules via Pub/Sub redelivery |
| AnalyticsClient 5xx | Service outage | HTTP/5xx spike | Open circuit-breaker; fail fast for 60 s | Circuit half-opens automatically |
| Result too large | Row count > template rowCap | Stream count check | Abort with MELMASTOON.REPORTING.RESULT_TOO_LARGE non-retriable | Caller narrows filters |
| Schema drift in projection | New column missing | Drizzle/AnalyticsClient typed accessor throws | Fail with MELMASTOON.REPORTING.PROJECTION_SCHEMA_MISMATCH; alert | Pin earlier projection version; coordinate with analytics-service |
2.4 Compose & render
| Mode | Trigger | Detection | Mitigation | Recovery |
|---|---|---|---|---|
| Renderer OOM | Large dataset, large images | Pod OOMKilled / heap warn | Stream rows; cap pages; reject pages > 5000 with MELMASTOON.REPORTING.RENDER_PAGE_LIMIT | Lower template rowCap; rerun |
| Puppeteer crash | Chromium fatal | Process exit watcher | Restart Chromium; retry render once; else fail retriable | Pub/Sub redeliver |
| Font missing for locale | Locale glyphs not in image | Renderer glyph warning | Fallback to platform default font + emit warning | Update image; rebuild |
| AI step latency exceeded | aiClient.invoke timeout | OTel span | Skip callouts; emit report.ai_skipped.v1 | Run completes without AI annotations |
| Template invariant violated at render | Layout block references missing column | Renderer pre-flight | Fail run MELMASTOON.REPORTING.TEMPLATE_INVARIANT_FAILED non-retriable | Author fixes template version |
2.5 Persist artifact
| Mode | Trigger | Detection | Mitigation | Recovery |
|---|---|---|---|---|
| GCS 5xx burst | GCS regional incident | Upload error | Retry with backoff (5 attempts); preserve buffer | Mark run failed retriable; reschedule |
| KMS unavailable for CMEK | KMS regional incident | API error | Retry; if persists fail retriable | Recovers when KMS restored |
| Object name collision | Same SHA same path (rare) | Pre-check gcsExists | Skip upload; reuse existing artifact metadata | Self-healing |
2.6 Deliver
| Mode | Trigger | Detection | Mitigation | Recovery |
|---|---|---|---|---|
| Notification request 4xx | Recipient invalid | NotificationClient returns code | Mark subscription failure; emit report.delivered with subset success | Operator updates subscription |
| Notification request 5xx | Notification outage | Retry with backoff | Emit report.completed (delivery rolls separately) | Notification service retries chain |
| WebDAV/SFTP connect timeout | External endpoint down | Adapter timeout | Backoff; fail subscription channel after N attempts | Pause subscription; alert tenant |
2.7 Regulatory submission
| Mode | Trigger | Detection | Mitigation | Recovery |
|---|---|---|---|---|
| Adapter 4xx (rejected payload) | Wrong jurisdiction format | Adapter-specific validator | Mark failed, non-retriable; alert tenant.owner | Operator corrects template; re-attach artifact |
| Adapter timeout | Upstream slow | Circuit-breaker | Retry next attempt window | Self-heal once upstream OK |
| Submission missed (past statutory cutoff) | next_attempt_at < now() - cutoff | Cron monitor | Page on-call P1; emit regulatory.submission_failed.v1 with pastCutoff=true | Manual escalation; legal liaison |
| Receipt invalid (signature/hash mismatch) | Adapter returns malformed | Verifier | Mark failed retriable; quarantine receipt | Re-submit |
| Object lock prevents replace | Try to overwrite locked object | GCS error | Treat as success (existing artifact authoritative); log | Self-heal |
2.8 Scheduling
| Mode | Trigger | Detection | Mitigation | Recovery |
|---|---|---|---|---|
| Cloud Scheduler double-fire | Provider deduplication miss | (scheduleId, fireAt) idempotency | Drop duplicate fire; metric increment | Self-heal |
| 5 consecutive failures | Run fails repeatedly | Schedule failure counter | Auto-disable; emit schedule.disabled.v1; notify tenant | Operator re-enables after fix |
| Time-zone DST gap | Wall-clock skip | Cron parser TZ-aware (luxon) | Use TZ-aware schedules | n/a |
| Scheduling collision (multiple schedules same minute per tenant) | Burst | Token bucket per tenant | Smooth via priority queue | Slight delay (≤ 60 s) |
3. Backpressure & circuit breakers
| Dependency | Breaker | Half-open probe |
|---|---|---|
analytics-service query API | Open after 10 errors / 60 s; closed after 5 successes | 1 probe / 10 s |
notification-service | Open after 20 errors / 60 s | 1 probe / 30 s |
ai-orchestrator-service | Open after 5 errors / 30 s | 1 probe / 60 s; while open we skip AI |
| GCS upload | Soft breaker; switch to retry-with-cap; DO NOT open hard (would lose work) | per-request |
| Regulatory adapters | Per-jurisdiction breaker | scheduled retry |
4. DLQ policy
- Pub/Sub subscriptions to internal
reporting.run.dispatchtopic have a DLQ at 5 attempts; messages land inmelmastoon.reporting.run.dispatch.dlq. - DLQ contents are surfaced in Grafana and an automated daily summary opens a Linear ticket.
- DLQ replay is a manual operation via
bin/replay-dlq.tswith--max=Nand--dry-runflags.
5. Recovery & RTO/RPO
| Scenario | RTO | RPO |
|---|---|---|
| Single pod crash | seconds (Cloud Run) | 0 |
| Cloud SQL regional HA failover | ≤ 60 s | 0 |
| Region outage (residency-bound) | ≤ 4 h (manual fail-over to standby region same residency) | ≤ 5 min Cloud SQL PITR |
| Regulatory bucket region outage | wait or restore from cross-region replica (where allowed) | 0 |
| Catastrophic logical bug corrupting templates | ≤ 30 min (revert template version + re-publish) | per-row OCC; no historical loss |
6. Manual interventions (runbooks)
| Action | Tool | Permission |
|---|---|---|
| Cancel a run | POST /api/v1/reports/runs/{id}:cancel | reports.author (caller's run) or admin |
| Re-attempt a failed run | POST /api/v1/reports/runs/{id}:retry | admin |
| Resolve regulatory submission manually | POST /api/v1/reports/regulatory/{id}:resolve | reports.regulatory_submitter + reason in audit |
| Replay DLQ batch | bin/replay-dlq.ts --max=… | platform on-call |
| Restore template version from audit | bin/restore-template.ts | reports.template_publisher |
Cross-references: SERVICE_RISK_REGISTER, OBSERVABILITY §7 alerts.