SERVICE_RISK_REGISTER — analytics-service
Sibling: SERVICE_OVERVIEW · SECURITY_MODEL · FAILURE_MODES · SERVICE_READINESS
Active risks scored on Likelihood (L: 1–5) × Impact (I: 1–5). Score = L × I. Treatment classes: Accept · Mitigate · Transfer · Eliminate.
1. Top risks
R-ANL-001 — Cross-tenant data exposure via authorized view bypass
| Field | Value |
|---|
| L × I | 2 × 5 = 10 |
| Owner | Security squad lead |
| Trigger | Misconfigured view, leaked tenant principal, or SQL injection bypassing param binding |
| Detection | Tenant-isolation integration test; daily access-binding reconciliation; security audit logs |
| Mitigation (M) | Three-layer isolation (JWT, UDF, authorized view); param-only saved queries; SQL lint forbids tenant_id overrides; reconciliation job vs iam-service |
| Residual | Low; periodic red-team exercise |
| Review | Quarterly |
| Field | Value |
|---|
| L × I | 4 × 4 = 16 |
| Owner | Finance + SRE |
| Trigger | Author publishes wide-filter widget; backfill triggered by misconfig |
| Detection | analytics.budget.bytes_used_ratio gauge; cost anomaly alert |
| Mitigation (M) | Per-query byte cap; per-tenant daily budget; auto-pause snapshot generators at 100 %; reservation autoscale ceiling; pre-flight dry-run on save |
| Residual | Medium during launches |
| Review | Monthly |
R-ANL-003 — Schema drift breaks dashboards & metrics
| Field | Value |
|---|
| L × I | 3 × 4 = 12 |
| Owner | Analytics platform lead |
| Trigger | Producer service ships v2 event without coexistence; curated table altered breaking publishers |
| Detection | DQ schema-drift check; CI schema-drift gate; consumer test failures |
| Mitigation (M) | _schema_version pin; v1/v2 coexistence; event-contract CI; curated DDL via Terraform with two-phase rename |
| Residual | Low |
| Review | Quarterly |
R-ANL-004 — Critical metric staleness (occupancy / RevPAR)
| Field | Value |
|---|
| L × I | 3 × 4 = 12 |
| Owner | Analytics squad lead |
| Trigger | Pub/Sub backlog, ETL failure, BigQuery slot exhaustion |
| Detection | CriticalMetricStale SLO alert (P1) |
| Mitigation (M) | High-frequency cadence (5 min) for critical metrics; dedicated reservation; automatic rerun on transient failures; freshness DQ check |
| Residual | Low |
| Review | Monthly |
R-ANL-005 — Forecast writeback corrupts curated rows
| Field | Value |
|---|
| L × I | 2 × 5 = 10 |
| Owner | AI squad + analytics lead |
| Trigger | Orchestrator ships malformed batch; tenant mismatch; model-version regression |
| Detection | Per-row tenant + schema validation; ForecastWritebackFail alert; downstream DQ checks |
| Mitigation (M) | Strict validator; partial-batch error map; MELMASTOON.ANALYTICS.FORECAST_INVALID_* events; idempotent MERGE keys |
| Residual | Low |
| Review | Quarterly |
R-ANL-006 — Looker Studio embed token misuse
| Field | Value |
|---|
| L × I | 2 × 4 = 8 |
| Owner | Security + analytics |
| Trigger | Stolen token, forgotten revocation |
| Detection | Per-tenant token issuance metric; binding reconciliation |
| Mitigation (M) | KMS-signed JWT, ≤ 60 min TTL; binding-revocation immediate; embed re-validates per page load; audit looker.token_issued |
| Residual | Low |
| Review | Quarterly |
R-ANL-007 — Pub/Sub sink lag during traffic spikes
| Field | Value |
|---|
| L × I | 3 × 3 = 9 |
| Owner | SRE |
| Trigger | High-season events spike, BigQuery streaming throttling |
| Detection | oldest_unacked_message_age alert |
| Mitigation (M) | Sink autoscale; spill-to-GCS when streaming inserts fail; chunked batch loader fallback; capacity load test |
| Residual | Medium |
| Review | Monthly |
R-ANL-008 — Composer/Workflows DAG silent skip
| Field | Value |
|---|
| L × I | 2 × 3 = 6 |
| Owner | Analytics platform |
| Trigger | DAG paused, IAM regression, region drift |
| Detection | ETL run heartbeat metric; etl.failed.v1 absence; cron audit |
| Mitigation (M) | Heartbeat events etl.started.v1 + etl.completed.v1; dashboard alert if no run within cadence |
| Residual | Low |
| Review | Quarterly |
R-ANL-009 — Saved query SQL injection
| Field | Value |
|---|
| L × I | 2 × 5 = 10 |
| Owner | Security |
| Trigger | Author writes string concatenation; param interpolation overlooked |
| Detection | Save-time parser blocks; SQL lint CI |
| Mitigation (M) | Allowlist-only datasets/tables; parameter binding only; integration test with malicious payloads |
| Residual | Low |
| Review | Quarterly |
R-ANL-010 — Right-to-erasure delay > SLA
| Field | Value |
|---|
| L × I | 2 × 4 = 8 |
| Owner | Compliance lead |
| Trigger | Cascade purge fails on a curated table; backlog of tenant.deleted.v1 |
| Detection | Purge reconciliation report; tenant_purge.duration_seconds SLI |
| Mitigation (M) | Idempotent purge; partition-aware DELETE/TRUNCATE; scheduled retry with alert; legal-hold override path documented |
| Residual | Low |
| Review | Quarterly |
R-ANL-011 — AI suggestion quality regresses
| Field | Value |
|---|
| L × I | 3 × 2 = 6 |
| Owner | AI squad |
| Trigger | Model upgrade affects metric explanations or forecasts |
| Detection | Per-capability success metric; user feedback loop; eval harness |
| Mitigation (M) | Capability-level off-switch; HITL on writes; canary % rollout per tenant; rollback to previous model version |
| Residual | Medium |
| Review | Monthly |
R-ANL-012 — Region/residency violation
| Field | Value |
|---|
| L × I | 1 × 5 = 5 |
| Owner | Security + platform |
| Trigger | Cross-region replication misconfig; misrouted Pub/Sub |
| Detection | Region tag audits; VPC-SC perimeter denials |
| Mitigation (E) | Per-region deployments enforced by Terraform; VPC-SC perimeter; release gate verifies dataset region |
| Residual | Very low |
| Review | Quarterly |
2. Risk treatment matrix
| Score | Action |
|---|
| ≥ 16 | Executive review; weekly status update |
| 10–15 | Squad-level mitigation plan with owner & due date |
| 6–9 | Tracked; reviewed at retro |
| ≤ 5 | Accepted; reviewed quarterly |
3. Lifecycle & cadence
- New risks added by anyone in the squad via PR.
- Owner assigned within one sprint.
- Quarterly review by service owner + security + finance + AI lead.
- Risks closed only when residual is "accepted" or fully eliminated; closure recorded with date and rationale.
Cross-references: FAILURE_MODES, SECURITY_MODEL, SERVICE_READINESS.