Skip to main content

FAILURE_MODES — reporting-service

Sibling: OBSERVABILITY · APPLICATION_LOGIC · SERVICE_RISK_REGISTER · platform anchor: docs/02 §11 Resilience

This document catalogues failure modes by stage, with detection, mitigation, and recovery. Every code referenced is in the canonical registry under API_CONTRACTS §0.4 / docs/standards/ERROR_CODES.


1. Lifecycle stages

[REQUEST] → [PERSIST] → [QUEUE] → [QUERY] → [COMPOSE] → [RENDER] → [PERSIST_ARTIFACT] → [DELIVER] → [REGULATORY_SUBMIT (optional)] → [PUBLISH_EVENT]

Each stage maps to one or more failure modes below.


2. Failure mode catalogue

2.1 Request layer

ModeTriggerDetectionMitigationRecovery
Idempotency-Key collisionSame key, different payloadConflict on (tenantId, idempotency_key)Reject with MELMASTOON.IDEMPOTENCY.KEY_COLLISION (422)Caller must change key
Property scope violationJWT lacks propertyRBAC+ABAC checkReject MELMASTOON.REPORTING.PROPERTY_SCOPE_VIOLATIONCaller obtains scope
Filter validation failureRequired filter missingDomain invariantReject MELMASTOON.REPORTING.FILTERS_INVALID (400 + RFC 7807)Caller fixes input
Template archivedtemplate.archived=trueRead checkReject MELMASTOON.REPORTING.TEMPLATE_ARCHIVED (410)Caller switches to active version

2.2 Persistence

ModeTriggerDetectionMitigationRecovery
Cloud SQL connection exhaustionBurstpg_stat_activity saturation, app pool waitPgBouncer transaction pooling, app circuit-breakerAutoscale connections; back off; alert P2
RLS misconfigurationMissing policyCI migration check + integration test failBlock deployn/a
Outbox publish lagPub/Sub downreporting_outbox_lag_seconds > 60Outbox publisher retries with exponential backoffOnce Pub/Sub healthy, drain queue; alert P2
Inbox dedupe missConcurrent delivery to two podsUNIQUE on (subject, message_id)Insert before effect; rollback effect on conflictSelf-healing

2.3 Query (analytics dependency)

ModeTriggerDetectionMitigationRecovery
AnalyticsClient timeoutBigQuery slot starvationPer-query span > deadlineRetry with jitter (≤ 3); else fail run with MELMASTOON.REPORTING.UPSTREAM_ANALYTICS_TIMEOUT retriableWorker reschedules via Pub/Sub redelivery
AnalyticsClient 5xxService outageHTTP/5xx spikeOpen circuit-breaker; fail fast for 60 sCircuit half-opens automatically
Result too largeRow count > template rowCapStream count checkAbort with MELMASTOON.REPORTING.RESULT_TOO_LARGE non-retriableCaller narrows filters
Schema drift in projectionNew column missingDrizzle/AnalyticsClient typed accessor throwsFail with MELMASTOON.REPORTING.PROJECTION_SCHEMA_MISMATCH; alertPin earlier projection version; coordinate with analytics-service

2.4 Compose & render

ModeTriggerDetectionMitigationRecovery
Renderer OOMLarge dataset, large imagesPod OOMKilled / heap warnStream rows; cap pages; reject pages > 5000 with MELMASTOON.REPORTING.RENDER_PAGE_LIMITLower template rowCap; rerun
Puppeteer crashChromium fatalProcess exit watcherRestart Chromium; retry render once; else fail retriablePub/Sub redeliver
Font missing for localeLocale glyphs not in imageRenderer glyph warningFallback to platform default font + emit warningUpdate image; rebuild
AI step latency exceededaiClient.invoke timeoutOTel spanSkip callouts; emit report.ai_skipped.v1Run completes without AI annotations
Template invariant violated at renderLayout block references missing columnRenderer pre-flightFail run MELMASTOON.REPORTING.TEMPLATE_INVARIANT_FAILED non-retriableAuthor fixes template version

2.5 Persist artifact

ModeTriggerDetectionMitigationRecovery
GCS 5xx burstGCS regional incidentUpload errorRetry with backoff (5 attempts); preserve bufferMark run failed retriable; reschedule
KMS unavailable for CMEKKMS regional incidentAPI errorRetry; if persists fail retriableRecovers when KMS restored
Object name collisionSame SHA same path (rare)Pre-check gcsExistsSkip upload; reuse existing artifact metadataSelf-healing

2.6 Deliver

ModeTriggerDetectionMitigationRecovery
Notification request 4xxRecipient invalidNotificationClient returns codeMark subscription failure; emit report.delivered with subset successOperator updates subscription
Notification request 5xxNotification outageRetry with backoffEmit report.completed (delivery rolls separately)Notification service retries chain
WebDAV/SFTP connect timeoutExternal endpoint downAdapter timeoutBackoff; fail subscription channel after N attemptsPause subscription; alert tenant

2.7 Regulatory submission

ModeTriggerDetectionMitigationRecovery
Adapter 4xx (rejected payload)Wrong jurisdiction formatAdapter-specific validatorMark failed, non-retriable; alert tenant.ownerOperator corrects template; re-attach artifact
Adapter timeoutUpstream slowCircuit-breakerRetry next attempt windowSelf-heal once upstream OK
Submission missed (past statutory cutoff)next_attempt_at < now() - cutoffCron monitorPage on-call P1; emit regulatory.submission_failed.v1 with pastCutoff=trueManual escalation; legal liaison
Receipt invalid (signature/hash mismatch)Adapter returns malformedVerifierMark failed retriable; quarantine receiptRe-submit
Object lock prevents replaceTry to overwrite locked objectGCS errorTreat as success (existing artifact authoritative); logSelf-heal

2.8 Scheduling

ModeTriggerDetectionMitigationRecovery
Cloud Scheduler double-fireProvider deduplication miss(scheduleId, fireAt) idempotencyDrop duplicate fire; metric incrementSelf-heal
5 consecutive failuresRun fails repeatedlySchedule failure counterAuto-disable; emit schedule.disabled.v1; notify tenantOperator re-enables after fix
Time-zone DST gapWall-clock skipCron parser TZ-aware (luxon)Use TZ-aware schedulesn/a
Scheduling collision (multiple schedules same minute per tenant)BurstToken bucket per tenantSmooth via priority queueSlight delay (≤ 60 s)

3. Backpressure & circuit breakers

DependencyBreakerHalf-open probe
analytics-service query APIOpen after 10 errors / 60 s; closed after 5 successes1 probe / 10 s
notification-serviceOpen after 20 errors / 60 s1 probe / 30 s
ai-orchestrator-serviceOpen after 5 errors / 30 s1 probe / 60 s; while open we skip AI
GCS uploadSoft breaker; switch to retry-with-cap; DO NOT open hard (would lose work)per-request
Regulatory adaptersPer-jurisdiction breakerscheduled retry

4. DLQ policy

  • Pub/Sub subscriptions to internal reporting.run.dispatch topic have a DLQ at 5 attempts; messages land in melmastoon.reporting.run.dispatch.dlq.
  • DLQ contents are surfaced in Grafana and an automated daily summary opens a Linear ticket.
  • DLQ replay is a manual operation via bin/replay-dlq.ts with --max=N and --dry-run flags.

5. Recovery & RTO/RPO

ScenarioRTORPO
Single pod crashseconds (Cloud Run)0
Cloud SQL regional HA failover≤ 60 s0
Region outage (residency-bound)≤ 4 h (manual fail-over to standby region same residency)≤ 5 min Cloud SQL PITR
Regulatory bucket region outagewait or restore from cross-region replica (where allowed)0
Catastrophic logical bug corrupting templates≤ 30 min (revert template version + re-publish)per-row OCC; no historical loss

6. Manual interventions (runbooks)

ActionToolPermission
Cancel a runPOST /api/v1/reports/runs/{id}:cancelreports.author (caller's run) or admin
Re-attempt a failed runPOST /api/v1/reports/runs/{id}:retryadmin
Resolve regulatory submission manuallyPOST /api/v1/reports/regulatory/{id}:resolvereports.regulatory_submitter + reason in audit
Replay DLQ batchbin/replay-dlq.ts --max=…platform on-call
Restore template version from auditbin/restore-template.tsreports.template_publisher

Cross-references: SERVICE_RISK_REGISTER, OBSERVABILITY §7 alerts.