Skip to main content

FAILURE_MODES — analytics-service

Sibling: APPLICATION_LOGIC · DATA_MODEL · OBSERVABILITY · platform anchors: docs/02 §15 Resilience, docs/standards/ERROR_CODES

Catalog by lifecycle stage. Error codes are MELMASTOON.ANALYTICS.<CODE> (ERROR_CODES).


1. Pub/Sub sink (raw event landing)

#FailureTriggerDetectionMitigationRecovery
S1Schema validation fails for incoming eventDrift in publisherSink span error + DLQ ackRoute to analytics.dlq topic; emit data_quality.alert.v1; log subject + versionDrift owner publishes corrected payload or registers v2; DLQ replay tool
S2BigQuery streaming insert 5xx burstBigQuery service degradationInsert error rate metric; SLO burnExponential backoff up to 60 s; spill to GCS staging bucket and load via batch when service recoversBackfill from staging once healthy
S3Sink falls behind (Pub/Sub lag)Spike in ingestoldest_unacked_message_age > 5 minAutoscale up; raise concurrency; pre-create partitions; alert P2Drain backlog; if persistent, grow reserved slots
S4Tenant id missing in envelopePublisher bugEnvelope validator rejectsDLQ + critical alert; do not synthesize tenant idPublisher fix; replay from DLQ once envelope corrected
S5OIDC verification fails on pushMisconfigured subscription401 in handlerReject; alert security; rotate OIDC bindingsRe-subscribe with correct audience

2. ETL & curated layer

#FailureTriggerDetectionMitigationRecovery
E1ETL job exceeds time budgetBackfill or hot dayJob duration > SLOWorkflow timeout 60 min; mark failed; emit etl.failed.v1Manual rerun with tighter window or larger slots reservation
E2MERGE produces duplicatesBad join keys after schema driftDQ row-count check failsPause downstream metrics; alert P2Fix _join_key; rerun for affected partitions; coexistence v1/v2
E3Source table not yet landedSink lag exceeds ETL triggerPre-flight freshness check failsSkip + reschedule with backoff (max 3); emit etl.failed.v1 reason SOURCE_NOT_READYAuto-retry until fresh
E4Slot exhaustionConcurrent expensive jobsBigQuery RESOURCES_EXCEEDEDReservation autoscale; queue jobs by priority; lower jobs throttledTune priorities; consider per-tenant reservation
E5Forecast writeback partial failValidation fails mid-batchPer-row error mapCommit valid rows; emit MELMASTOON.ANALYTICS.FORECAST_PARTIAL event with errors; orchestrator retries failed rowsInvestigate row reasons; orchestrator retries idempotent
E6DQ critical failsBad upstream dataDQ engine writes dq_results; alertBlock dependent metrics from publishing metric.computed.v1; emit data_quality.alert.v1Owner remediates; rerun ETL when source fixed
E7Backfill drifts costWide windowCost monitor anomalyPre-flight dry-run; cap window; chunked backfillResume in chunks; budget approval

3. Query API (widget runs / ad-hoc)

#FailureTriggerDetectionMitigationRecovery
Q1Byte cap exceeded (dry-run)Wide filter / no partitionPre-flight bytes_processed > capReject 400 MELMASTOON.ANALYTICS.QUERY_BYTE_CAP_EXCEEDED with suggestion to narrow filtersCaller refines; or raises cap with analytics.budget_admin
Q2Cross-tenant attemptSQL injection in params or body tenant overrideParam-binding validator + UDF403 PROPERTY_SCOPE_VIOLATION or CROSS_TENANT_DENIED; security alertAudit logs; tenant lockout if repeated
Q3BigQuery 5xxBackend hiccupRetry policy + circuit breakerRetry up to 3 with backoff; on breaker open, return 503 with Retry-AfterAuto-recover when healthy
Q4Latency breach (uncached p95 > 8 s)Cold cache + heavy querySLI alertSuggest cached widget alternative; increase materialization cadence; tune partitioningRe-tune metric SQL; add intermediate projection
Q5Schema version mismatch (@schema_version pinned)Curated table evolvedMELMASTOON.ANALYTICS.SCHEMA_VERSION_MISMATCHReturn 400 with current version; auto-rebind on next queryCaller upgrades pin; coexistence period covers it
Q6Metric archived but referenced by widgetLifecycle mismatch410 METRIC_ARCHIVEDSurface in dashboard UI as deprecated; force migrationDashboard owner replaces metric

4. Dashboards / widgets / saved queries

#FailureTriggerDetectionMitigationRecovery
D1OCC conflict on dashboard updateConcurrent edits409 OPTIMISTIC_LOCKClient refetches If-Match and retriesUX surfaces merge prompt
D2Widget dimension references missing dimSchema driftPre-publish lintReject save; surface available dimsAuthor updates widget
D3Saved query parameter has non-bound interpolationAuthor wrote ${var}Parser blocks saveReject 400 SAVED_QUERY_INVALIDAuthor switches to @param
D4Looker embed token revoked mid-sessionBinding removed401 from brokerForce token refresh; if still revoked, sign-out user from embedTenant admin re-grants

5. Pub/Sub (control-plane events consumed)

#FailureTriggerDetectionMitigationRecovery
C1tenant.deleted.v1 re-deliveredPub/Sub at-least-onceInbox dedupeIdempotent purge; second run no-opOK
C2ai.forecast.produced.v1 invalid envelopeOrchestrator bugEnvelope validatorDLQ + alert; emit MELMASTOON.ANALYTICS.FORECAST_INVALID_ENVELOPEOrchestrator fix; replay
C3Forecast tenant mismatchRouting bugPer-row tenant checkReject batch; emit FORECAST_INVALID_TENANTOrchestrator fix
C4Backlog of forecast eventsModel retrain burstsSubscription lagAutoscale ETL worker; raise concurrency cap temporarilyBurn down backlog

6. Postgres metadata

#FailureTriggerDetectionMitigationRecovery
P1Pool exhaustionSpike in dashboard readsPool gaugeAutoscale Cloud Run; PgBouncer transaction pooling; reject with 503Capacity tune
P2RLS misconfigMigration mistakeTenant-isolation integration testBlock releaseHotfix migration
P3Primary failoverCloud SQL maintenanceHealth degradedRead-only mode with cached widget data; queue writesResume on failover complete

7. Outbox / publish

#FailureTriggerDetectionMitigationRecovery
O1Pub/Sub publish failsNetwork blipOutbox row stays pendingWorker retries with backoff; surface outbox.lag metricAuto-recover
O2Duplicate publishWorker crash mid-publishConsumer dedupe via eventIdInbox guardOK
O3Outbox lag > SLOWorker bug or volume spikeLag alertScale worker; investigate poison messagesDrain

8. AI integration

#FailureTriggerDetectionMitigationRecovery
A1Orchestrator timeoutModel overloadedai.invoke span errorFallback to no-AI path (e.g., metric explainer hidden); UX bannerAuto-recover
A2Budget exhaustedTenant cap reachedMELMASTOON.AI.BUDGET_EXHAUSTEDDisable AI surface; warn adminAdmin tops up budget
A3Off-switch enabledTenant disabled capability403 from orchestratorHide AI surfaces server-sideAdmin re-enables
A4Forecast write-back schema mismatchModel output format changeValidatorDLQ batch; alertOrchestrator versions model output

9. Cross-cutting policies

  • Retries. Idempotent operations: 3 attempts with jittered exponential backoff (250 ms → 4 s).
  • Circuit breakers. Opened on ≥ 50 % errors over 30 s window; half-open after 30 s.
  • DLQ policy. Pub/Sub DLQ analytics.dlq with 14-day retention; replayer in tools/dlq-replay.ts.
  • Backpressure. API returns 429 with Retry-After if pool or BQ slot reservation saturated.
  • Tenant noisy-neighbor. Per-tenant byte budget + per-tenant concurrency cap; auto-pause snapshot generators on budget breach.

10. Manual intervention runbooks

RunbookWhen
runbooks/analytics-dq.mdDQ critical alert
runbooks/analytics-etl.mdETL job failed twice
runbooks/analytics-budget.mdTenant byte budget breach
runbooks/analytics-sink-lag.mdSink lag breach
runbooks/analytics-forecast.mdForecast writeback failure
runbooks/analytics-looker.mdEmbed token issues

Each runbook lists symptoms, dashboards, queries, mitigation steps, and rollback. Linked from PagerDuty alerts (OBSERVABILITY §7).


11. RTO / RPO summary

ScenarioRTORPO
Cloud SQL primary failover5 min< 1 min
BigQuery curated table corruption30 min (restore from snapshot)up to 1 day (replay from raw)
Pub/Sub backlog 1 h30 min drain0 (replayable)
Region outage (within residency only)30 min5 min

Cross-references: DATA_MODEL §3, OBSERVABILITY §7 alerts, SERVICE_RISK_REGISTER.