Analytics Service — Failure Modes
Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18
| # | Failure | User/Platform Impact | Detection | Mitigation |
|---|---|---|---|---|
| 1 | PG primary down | Event processing pauses (NAK to NATS); REST API returns 503 | AnlytPgErrors alert | NATS NAK causes redelivery; events backlog in NATS stream (retained 7d/3d); processing resumes on PG recovery |
| 2 | NATS consumer lag grows (PG slow) | Metrics lag behind real-time by consumer lag duration | NATS consumer pending count alert | Scale pods; PG query optimization; acceptable lag < 5 min for dashboards |
| 3 | Event schema mismatch (upstream breaking change) | Events fail deserialization; consumer DLQ | AnlytDeserializationErrors alert | Schema registry CI gate prevents this; manual intervention if gate bypassed |
| 4 | Daily rollup job fails | metrics_daily stale; dashboard shows yesterday's data | AnlytRollupFailed alert | Rollup is idempotent; re-run manually; hourly data still current |
| 5 | processed_events table full / purge cron fails | Dedup table grows; queries slow | Table size monitoring | Purge cron alert; manual DELETE WHERE processed_at < now() - interval '48h' |
| 6 | REST API slow (complex aggregation query) | Dashboard renders slowly | P95 latency alert | Pre-aggregated daily tables minimize query work; add DB index on hot query patterns |
| 7 | ClickHouse ETL fails | Historical queries (> 90 d) unavailable | ETL job failure alert | PG has 90 d hot; ClickHouse is for historical only; transient failure is low impact |
| 8 | Double-count on pod crash mid-upsert | Metrics slightly inflated | Anomaly on delivery rate (> 100%) | processed_events dedup prevents double-count; PG transaction ensures atomicity |
| 9 | NATS stream retention exceeded before consumer catches up | Events lost permanently; gap in metrics | NATS stream fill alert | Increase stream retention (default 7 d billing, 3 d DLR); scale consumer before retention boundary |
| 10 | Account usage endpoint returns data for wrong account | Data privacy incident | Integration test + mTLS policy | accountId scope validation in use case; mTLS limits callers |