Failure Modes
:::info Source
Sourced from services/analytics-service/FAILURE_MODES.md in the documentation repo.
:::
1. Scenarios
1.1 Firehose Lag
- NATS → Kafka bridge backs up.
- Dashboards show stale data.
- Mitigation: autoscale firehose; alert on lag > 60s; users see "last updated" timestamp.
1.2 ClickHouse Node Failure
- Query failover to replica.
- Writes queue on Kafka until node recovers.
- Replica promotion automated.
1.3 Ad-Hoc Query Runaway
- Query consumes too much CPU/memory.
- Mitigation: query timeout (30s default, 300s admin); memory limits; dedicated query pool.
1.4 Export Too Large
- Mitigation: size cap per export (100M rows); streaming output; multi-part download.
1.5 Schema Drift (Producer Emits New Field)
- Firehose tolerates unknown fields.
- Schema registry alerts on drift.
- Materialized views may need recomputation.
1.6 GDPR Erasure Slow on Cold Tier
- Cold Parquet files rewritten asynchronously.
- Erasure SLA: 30 days from request.
- Audit trail of erasure operation.
1.7 AI Insight Hallucinated SQL
- Generated SQL validated; destructive ops rejected.
- Tenant filter auto-injected; query runs safely or errors.
1.8 Cross-Tenant Query Leak
- Pre-execution validator asserts tenant filter presence.
- Runtime audit logs every query.
2. Retry / Backoff
| Op | Max | Backoff |
|---|---|---|
| Firehose ingest | infinite | exp, managed by Kafka |
| Export worker | 3 | 1m, 5m, 15m |
| AI call | 2 | 1s, 5s |
| Outbox | infinite | exp cap 5m |
3. Circuit Breakers
ClickHouse: 10 fail/30s → 60s. AI gateway: 10 fail/30s → 60s.
4. Fallbacks
| Primary | Fallback |
|---|---|
| Fresh data | Stale dashboard cache |
| AI NL query | Canned metric with matching keywords |
| Export direct | Paginated API download |
5. Chaos
- Kill firehose pod → Kafka retains; resumes on restart.
- ClickHouse node down → queries use replica.
- Query exceeds timeout → cleanly aborted with error.