Skip to main content

Failure Modes

:::info Source Sourced from services/analytics-service/FAILURE_MODES.md in the documentation repo. :::

1. Scenarios

1.1 Firehose Lag

NATS → Kafka bridge backs up.
Dashboards show stale data.
Mitigation: autoscale firehose; alert on lag > 60s; users see "last updated" timestamp.

1.2 ClickHouse Node Failure

Query failover to replica.
Writes queue on Kafka until node recovers.
Replica promotion automated.

1.3 Ad-Hoc Query Runaway

Query consumes too much CPU/memory.
Mitigation: query timeout (30s default, 300s admin); memory limits; dedicated query pool.

1.4 Export Too Large

Mitigation: size cap per export (100M rows); streaming output; multi-part download.

1.5 Schema Drift (Producer Emits New Field)

Firehose tolerates unknown fields.
Schema registry alerts on drift.
Materialized views may need recomputation.

Cold Parquet files rewritten asynchronously.
Erasure SLA: 30 days from request.
Audit trail of erasure operation.

1.7 AI Insight Hallucinated SQL

Generated SQL validated; destructive ops rejected.
Tenant filter auto-injected; query runs safely or errors.

1.8 Cross-Tenant Query Leak

Pre-execution validator asserts tenant filter presence.
Runtime audit logs every query.

2. Retry / Backoff

Op	Max	Backoff
Firehose ingest	infinite	exp, managed by Kafka
Export worker	3	1m, 5m, 15m
AI call	2	1s, 5s
Outbox	infinite	exp cap 5m

3. Circuit Breakers

ClickHouse: 10 fail/30s → 60s. AI gateway: 10 fail/30s → 60s.

4. Fallbacks

Primary	Fallback
Fresh data	Stale dashboard cache
AI NL query	Canned metric with matching keywords
Export direct	Paginated API download

5. Chaos

Kill firehose pod → Kafka retains; resumes on restart.
ClickHouse node down → queries use replica.
Query exceeds timeout → cleanly aborted with error.

1. Scenarios
2. Retry / Backoff
3. Circuit Breakers
4. Fallbacks
5. Chaos