Skip to main content

Failure Modes

:::info Source Sourced from services/analytics-service/FAILURE_MODES.md in the documentation repo. :::

1. Scenarios

1.1 Firehose Lag

  • NATS → Kafka bridge backs up.
  • Dashboards show stale data.
  • Mitigation: autoscale firehose; alert on lag > 60s; users see "last updated" timestamp.

1.2 ClickHouse Node Failure

  • Query failover to replica.
  • Writes queue on Kafka until node recovers.
  • Replica promotion automated.

1.3 Ad-Hoc Query Runaway

  • Query consumes too much CPU/memory.
  • Mitigation: query timeout (30s default, 300s admin); memory limits; dedicated query pool.

1.4 Export Too Large

  • Mitigation: size cap per export (100M rows); streaming output; multi-part download.

1.5 Schema Drift (Producer Emits New Field)

  • Firehose tolerates unknown fields.
  • Schema registry alerts on drift.
  • Materialized views may need recomputation.

1.6 GDPR Erasure Slow on Cold Tier

  • Cold Parquet files rewritten asynchronously.
  • Erasure SLA: 30 days from request.
  • Audit trail of erasure operation.

1.7 AI Insight Hallucinated SQL

  • Generated SQL validated; destructive ops rejected.
  • Tenant filter auto-injected; query runs safely or errors.

1.8 Cross-Tenant Query Leak

  • Pre-execution validator asserts tenant filter presence.
  • Runtime audit logs every query.

2. Retry / Backoff

OpMaxBackoff
Firehose ingestinfiniteexp, managed by Kafka
Export worker31m, 5m, 15m
AI call21s, 5s
Outboxinfiniteexp cap 5m

3. Circuit Breakers

ClickHouse: 10 fail/30s → 60s. AI gateway: 10 fail/30s → 60s.

4. Fallbacks

PrimaryFallback
Fresh dataStale dashboard cache
AI NL queryCanned metric with matching keywords
Export directPaginated API download

5. Chaos

  • Kill firehose pod → Kafka retains; resumes on restart.
  • ClickHouse node down → queries use replica.
  • Query exceeds timeout → cleanly aborted with error.