Skip to main content

Analytics Service — Failure Modes

Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18

#FailureUser/Platform ImpactDetectionMitigation
1PG primary downEvent processing pauses (NAK to NATS); REST API returns 503AnlytPgErrors alertNATS NAK causes redelivery; events backlog in NATS stream (retained 7d/3d); processing resumes on PG recovery
2NATS consumer lag grows (PG slow)Metrics lag behind real-time by consumer lag durationNATS consumer pending count alertScale pods; PG query optimization; acceptable lag < 5 min for dashboards
3Event schema mismatch (upstream breaking change)Events fail deserialization; consumer DLQAnlytDeserializationErrors alertSchema registry CI gate prevents this; manual intervention if gate bypassed
4Daily rollup job failsmetrics_daily stale; dashboard shows yesterday's dataAnlytRollupFailed alertRollup is idempotent; re-run manually; hourly data still current
5processed_events table full / purge cron failsDedup table grows; queries slowTable size monitoringPurge cron alert; manual DELETE WHERE processed_at < now() - interval '48h'
6REST API slow (complex aggregation query)Dashboard renders slowlyP95 latency alertPre-aggregated daily tables minimize query work; add DB index on hot query patterns
7ClickHouse ETL failsHistorical queries (> 90 d) unavailableETL job failure alertPG has 90 d hot; ClickHouse is for historical only; transient failure is low impact
8Double-count on pod crash mid-upsertMetrics slightly inflatedAnomaly on delivery rate (> 100%)processed_events dedup prevents double-count; PG transaction ensures atomicity
9NATS stream retention exceeded before consumer catches upEvents lost permanently; gap in metricsNATS stream fill alertIncrease stream retention (default 7 d billing, 3 d DLR); scale consumer before retention boundary
10Account usage endpoint returns data for wrong accountData privacy incidentIntegration test + mTLS policyaccountId scope validation in use case; mTLS limits callers