Analytics Service — Service Risk Register
Status: populated Owner: Platform Engineering Last updated: 2026-04-18
| ID | Risk | Likelihood | Impact | Mitigation | Owner |
|---|---|---|---|---|---|
| R-ANLYT-01 | Double-count if processed_events dedup bypassed | Low | High | processed_events INSERT with ON CONFLICT DO NOTHING; integration test idempotency.spec.ts | Engineering |
| R-ANLYT-02 | NATS stream retention expires before consumer catches up (event loss) | Low | High | Retention 7 d (billing) / 3 d (DLR); alert on consumer lag > 10,000; scale consumer | SRE |
| R-ANLYT-03 | Upstream schema change (breaking) corrupts metrics silently | Medium | High | Schema registry CI gate; analytics validates schema on consume; deserialization error alert | Engineering |
| R-ANLYT-04 | Daily rollup job fails silently | Low | Medium | Alert AnlytRollupFailed if no rollup in 3 h; rollup is idempotent — can be re-run | SRE |
| R-ANLYT-05 | Account usage data cross-contamination (wrong account in response) | Low | Critical | accountId scope validation in use case; integration test account-scope.spec.ts | Engineering |
| R-ANLYT-06 | PG partition bloat (metrics_hourly > 90 d) | Medium | Medium | Partition maintenance cron; alert on partition age; ClickHouse ETL as pressure valve | SRE |
| R-ANLYT-07 | processed_events table grows unbounded (purge cron fails) | Medium | Medium | Daily purge cron; alert on table > 5 M rows; manual purge runbook | SRE |
| R-ANLYT-08 | Analytics totals diverge from billing totals | Low | High | Daily reconciliation job; alert on divergence > 0.01% | Engineering |
| R-ANLYT-09 | ClickHouse ETL fails and historical queries unavailable | Low | Low | ClickHouse is non-critical (> 90 d queries only); alert on ETL failure; PG serves 90 d | SRE |
| R-ANLYT-10 | P95 latency computed inaccurately (low sample count in bucket) | Medium | Low | Document approximation in API response; flag buckets with < 100 samples | Engineering |