FAILURE_MODES — analytics-service
Sibling: APPLICATION_LOGIC · DATA_MODEL · OBSERVABILITY · platform anchors: docs/02 §15 Resilience, docs/standards/ERROR_CODES
Catalog by lifecycle stage. Error codes are MELMASTOON.ANALYTICS.<CODE> (ERROR_CODES).
1. Pub/Sub sink (raw event landing)
| # | Failure | Trigger | Detection | Mitigation | Recovery |
|---|---|---|---|---|---|
| S1 | Schema validation fails for incoming event | Drift in publisher | Sink span error + DLQ ack | Route to analytics.dlq topic; emit data_quality.alert.v1; log subject + version | Drift owner publishes corrected payload or registers v2; DLQ replay tool |
| S2 | BigQuery streaming insert 5xx burst | BigQuery service degradation | Insert error rate metric; SLO burn | Exponential backoff up to 60 s; spill to GCS staging bucket and load via batch when service recovers | Backfill from staging once healthy |
| S3 | Sink falls behind (Pub/Sub lag) | Spike in ingest | oldest_unacked_message_age > 5 min | Autoscale up; raise concurrency; pre-create partitions; alert P2 | Drain backlog; if persistent, grow reserved slots |
| S4 | Tenant id missing in envelope | Publisher bug | Envelope validator rejects | DLQ + critical alert; do not synthesize tenant id | Publisher fix; replay from DLQ once envelope corrected |
| S5 | OIDC verification fails on push | Misconfigured subscription | 401 in handler | Reject; alert security; rotate OIDC bindings | Re-subscribe with correct audience |
2. ETL & curated layer
| # | Failure | Trigger | Detection | Mitigation | Recovery |
|---|---|---|---|---|---|
| E1 | ETL job exceeds time budget | Backfill or hot day | Job duration > SLO | Workflow timeout 60 min; mark failed; emit etl.failed.v1 | Manual rerun with tighter window or larger slots reservation |
| E2 | MERGE produces duplicates | Bad join keys after schema drift | DQ row-count check fails | Pause downstream metrics; alert P2 | Fix _join_key; rerun for affected partitions; coexistence v1/v2 |
| E3 | Source table not yet landed | Sink lag exceeds ETL trigger | Pre-flight freshness check fails | Skip + reschedule with backoff (max 3); emit etl.failed.v1 reason SOURCE_NOT_READY | Auto-retry until fresh |
| E4 | Slot exhaustion | Concurrent expensive jobs | BigQuery RESOURCES_EXCEEDED | Reservation autoscale; queue jobs by priority; lower jobs throttled | Tune priorities; consider per-tenant reservation |
| E5 | Forecast writeback partial fail | Validation fails mid-batch | Per-row error map | Commit valid rows; emit MELMASTOON.ANALYTICS.FORECAST_PARTIAL event with errors; orchestrator retries failed rows | Investigate row reasons; orchestrator retries idempotent |
| E6 | DQ critical fails | Bad upstream data | DQ engine writes dq_results; alert | Block dependent metrics from publishing metric.computed.v1; emit data_quality.alert.v1 | Owner remediates; rerun ETL when source fixed |
| E7 | Backfill drifts cost | Wide window | Cost monitor anomaly | Pre-flight dry-run; cap window; chunked backfill | Resume in chunks; budget approval |
3. Query API (widget runs / ad-hoc)
| # | Failure | Trigger | Detection | Mitigation | Recovery |
|---|---|---|---|---|---|
| Q1 | Byte cap exceeded (dry-run) | Wide filter / no partition | Pre-flight bytes_processed > cap | Reject 400 MELMASTOON.ANALYTICS.QUERY_BYTE_CAP_EXCEEDED with suggestion to narrow filters | Caller refines; or raises cap with analytics.budget_admin |
| Q2 | Cross-tenant attempt | SQL injection in params or body tenant override | Param-binding validator + UDF | 403 PROPERTY_SCOPE_VIOLATION or CROSS_TENANT_DENIED; security alert | Audit logs; tenant lockout if repeated |
| Q3 | BigQuery 5xx | Backend hiccup | Retry policy + circuit breaker | Retry up to 3 with backoff; on breaker open, return 503 with Retry-After | Auto-recover when healthy |
| Q4 | Latency breach (uncached p95 > 8 s) | Cold cache + heavy query | SLI alert | Suggest cached widget alternative; increase materialization cadence; tune partitioning | Re-tune metric SQL; add intermediate projection |
| Q5 | Schema version mismatch (@schema_version pinned) | Curated table evolved | MELMASTOON.ANALYTICS.SCHEMA_VERSION_MISMATCH | Return 400 with current version; auto-rebind on next query | Caller upgrades pin; coexistence period covers it |
| Q6 | Metric archived but referenced by widget | Lifecycle mismatch | 410 METRIC_ARCHIVED | Surface in dashboard UI as deprecated; force migration | Dashboard owner replaces metric |
4. Dashboards / widgets / saved queries
| # | Failure | Trigger | Detection | Mitigation | Recovery |
|---|---|---|---|---|---|
| D1 | OCC conflict on dashboard update | Concurrent edits | 409 OPTIMISTIC_LOCK | Client refetches If-Match and retries | UX surfaces merge prompt |
| D2 | Widget dimension references missing dim | Schema drift | Pre-publish lint | Reject save; surface available dims | Author updates widget |
| D3 | Saved query parameter has non-bound interpolation | Author wrote ${var} | Parser blocks save | Reject 400 SAVED_QUERY_INVALID | Author switches to @param |
| D4 | Looker embed token revoked mid-session | Binding removed | 401 from broker | Force token refresh; if still revoked, sign-out user from embed | Tenant admin re-grants |
5. Pub/Sub (control-plane events consumed)
| # | Failure | Trigger | Detection | Mitigation | Recovery |
|---|---|---|---|---|---|
| C1 | tenant.deleted.v1 re-delivered | Pub/Sub at-least-once | Inbox dedupe | Idempotent purge; second run no-op | OK |
| C2 | ai.forecast.produced.v1 invalid envelope | Orchestrator bug | Envelope validator | DLQ + alert; emit MELMASTOON.ANALYTICS.FORECAST_INVALID_ENVELOPE | Orchestrator fix; replay |
| C3 | Forecast tenant mismatch | Routing bug | Per-row tenant check | Reject batch; emit FORECAST_INVALID_TENANT | Orchestrator fix |
| C4 | Backlog of forecast events | Model retrain bursts | Subscription lag | Autoscale ETL worker; raise concurrency cap temporarily | Burn down backlog |
6. Postgres metadata
| # | Failure | Trigger | Detection | Mitigation | Recovery |
|---|---|---|---|---|---|
| P1 | Pool exhaustion | Spike in dashboard reads | Pool gauge | Autoscale Cloud Run; PgBouncer transaction pooling; reject with 503 | Capacity tune |
| P2 | RLS misconfig | Migration mistake | Tenant-isolation integration test | Block release | Hotfix migration |
| P3 | Primary failover | Cloud SQL maintenance | Health degraded | Read-only mode with cached widget data; queue writes | Resume on failover complete |
7. Outbox / publish
| # | Failure | Trigger | Detection | Mitigation | Recovery |
|---|---|---|---|---|---|
| O1 | Pub/Sub publish fails | Network blip | Outbox row stays pending | Worker retries with backoff; surface outbox.lag metric | Auto-recover |
| O2 | Duplicate publish | Worker crash mid-publish | Consumer dedupe via eventId | Inbox guard | OK |
| O3 | Outbox lag > SLO | Worker bug or volume spike | Lag alert | Scale worker; investigate poison messages | Drain |
8. AI integration
| # | Failure | Trigger | Detection | Mitigation | Recovery |
|---|---|---|---|---|---|
| A1 | Orchestrator timeout | Model overloaded | ai.invoke span error | Fallback to no-AI path (e.g., metric explainer hidden); UX banner | Auto-recover |
| A2 | Budget exhausted | Tenant cap reached | MELMASTOON.AI.BUDGET_EXHAUSTED | Disable AI surface; warn admin | Admin tops up budget |
| A3 | Off-switch enabled | Tenant disabled capability | 403 from orchestrator | Hide AI surfaces server-side | Admin re-enables |
| A4 | Forecast write-back schema mismatch | Model output format change | Validator | DLQ batch; alert | Orchestrator versions model output |
9. Cross-cutting policies
- Retries. Idempotent operations: 3 attempts with jittered exponential backoff (250 ms → 4 s).
- Circuit breakers. Opened on ≥ 50 % errors over 30 s window; half-open after 30 s.
- DLQ policy. Pub/Sub DLQ
analytics.dlqwith 14-day retention; replayer intools/dlq-replay.ts. - Backpressure. API returns 429 with
Retry-Afterif pool or BQ slot reservation saturated. - Tenant noisy-neighbor. Per-tenant byte budget + per-tenant concurrency cap; auto-pause snapshot generators on budget breach.
10. Manual intervention runbooks
| Runbook | When |
|---|---|
runbooks/analytics-dq.md | DQ critical alert |
runbooks/analytics-etl.md | ETL job failed twice |
runbooks/analytics-budget.md | Tenant byte budget breach |
runbooks/analytics-sink-lag.md | Sink lag breach |
runbooks/analytics-forecast.md | Forecast writeback failure |
runbooks/analytics-looker.md | Embed token issues |
Each runbook lists symptoms, dashboards, queries, mitigation steps, and rollback. Linked from PagerDuty alerts (OBSERVABILITY §7).
11. RTO / RPO summary
| Scenario | RTO | RPO |
|---|---|---|
| Cloud SQL primary failover | 5 min | < 1 min |
| BigQuery curated table corruption | 30 min (restore from snapshot) | up to 1 day (replay from raw) |
| Pub/Sub backlog 1 h | 30 min drain | 0 (replayable) |
| Region outage (within residency only) | 30 min | 5 min |
Cross-references: DATA_MODEL §3, OBSERVABILITY §7 alerts, SERVICE_RISK_REGISTER.