Skip to main content

Analytics Service — Service Risk Register

Status: populated Owner: Platform Engineering Last updated: 2026-04-18

IDRiskLikelihoodImpactMitigationOwner
R-ANLYT-01Double-count if processed_events dedup bypassedLowHighprocessed_events INSERT with ON CONFLICT DO NOTHING; integration test idempotency.spec.tsEngineering
R-ANLYT-02NATS stream retention expires before consumer catches up (event loss)LowHighRetention 7 d (billing) / 3 d (DLR); alert on consumer lag > 10,000; scale consumerSRE
R-ANLYT-03Upstream schema change (breaking) corrupts metrics silentlyMediumHighSchema registry CI gate; analytics validates schema on consume; deserialization error alertEngineering
R-ANLYT-04Daily rollup job fails silentlyLowMediumAlert AnlytRollupFailed if no rollup in 3 h; rollup is idempotent — can be re-runSRE
R-ANLYT-05Account usage data cross-contamination (wrong account in response)LowCriticalaccountId scope validation in use case; integration test account-scope.spec.tsEngineering
R-ANLYT-06PG partition bloat (metrics_hourly > 90 d)MediumMediumPartition maintenance cron; alert on partition age; ClickHouse ETL as pressure valveSRE
R-ANLYT-07processed_events table grows unbounded (purge cron fails)MediumMediumDaily purge cron; alert on table > 5 M rows; manual purge runbookSRE
R-ANLYT-08Analytics totals diverge from billing totalsLowHighDaily reconciliation job; alert on divergence > 0.01%Engineering
R-ANLYT-09ClickHouse ETL fails and historical queries unavailableLowLowClickHouse is non-critical (> 90 d queries only); alert on ETL failure; PG serves 90 dSRE
R-ANLYT-10P95 latency computed inaccurately (low sample count in bucket)MediumLowDocument approximation in API response; flag buckets with < 100 samplesEngineering