search-aggregation-service — SERVICE_RISK_REGISTER
Companion: SECURITY_MODEL · FAILURE_MODES · SERVICE_READINESS · DEPLOYMENT_TOPOLOGY
Living register, reviewed monthly. Likelihood × Impact = inherent risk scored before mitigation; residual is the post-mitigation score. 1 = low, 5 = critical.
R-1 — Cross-tenant PII leak via search index
| Aspect | Detail |
|---|
| Description | Forbidden upstream field (e.g. ownerEmail, bankAccount, lockSerial) is projected into the search index and returned to anonymous public traffic. |
| Likelihood | 4 |
| Impact | 5 (regulatory + reputational) |
| Inherent | 20 |
| Mitigation | 4-layer allow-list: type guard, projection policy, schema (dynamic: strict), nightly audit. CI gates on schema diff. PRs touching the allow-list require security review. |
| Residual | 4 |
| Owner | service owner + security |
| Next action | Quarterly review of allow-list against new upstream fields (property-service, pricing-service schema diffs). |
R-2 — OpenSearch outage degrading consumer surface
| Aspect | Detail |
|---|
| Description | Aiven cluster outage or shard-failure storm causes search latency to balloon or 5xx. |
| Likelihood | 3 |
| Impact | 4 |
| Inherent | 12 |
| Mitigation | Postgres-only fallback with degradationLevel="opensearch_unavailable"; circuit breaker; cache TTL extension; per-region clusters. |
| Residual | 6 (latency degraded but service stays up) |
| Owner | SRE |
| Next action | Load-test fallback at peak RPS — currently rehearsed at 50 % peak only. |
R-3 — Stale projection from upstream event lag or out-of-order delivery
| Aspect | Detail |
|---|
| Description | Out-of-order or delayed events from pricing-service or inventory-service cause stale prices/availability shown to users. |
| Likelihood | 4 |
| Impact | 3 |
| Inherent | 12 |
| Mitigation | Pub/Sub ordering by propertyId; vector-clock guard; freshness SLO (p95 < 30 s) with burn alerts; drift sweep job. |
| Residual | 4 |
| Owner | service owner |
| Next action | Add per-upstream producer_lag_seconds SLO and dashboard. |
R-4 — Catastrophic data loss in BigQuery event archive
| Aspect | Detail |
|---|
| Description | The replay-source archive is corrupted or partially deleted, blocking index rebuild from epoch. |
| Likelihood | 1 |
| Impact | 4 |
| Inherent | 4 |
| Mitigation | Dual-region GCS Object Lock (90 d) on archive copies; nightly snapshot integrity check; canonical Postgres rebuilds short windows directly without BigQuery for ≤ 7 d via outbox replay. |
| Residual | 2 |
| Owner | platform data |
| Next action | Quarterly archive integrity check log → audit dashboard. |
R-5 — Boost-rule abuse (operator inflates rankings)
| Aspect | Detail |
|---|
| Description | A tenant operator sets aggressive multiplier to dominate cross-tenant rankings unfairly. |
| Likelihood | 4 |
| Impact | 3 |
| Inherent | 12 |
| Mitigation | Domain multiplier clamp [0.1, 5.0]; ranker normalizes per-region; sponsored slots are a separate (Phase 3) channel; audit topic + Looker watchlist for spikes. |
| Residual | 6 |
| Owner | product + platform trust |
| Next action | Add anomaly detector on multiplier velocity; ratio-cap per region. |
R-6 — Hot-query DoS on a single canonical query
| Aspect | Detail |
|---|
| Description | A scraping bot or naive client hammers one expensive query, saturating OpenSearch shards. |
| Likelihood | 4 |
| Impact | 3 |
| Inherent | 12 |
| Mitigation | Canonical-query negative cache; Cloud Armor per-IP rate limit; query-cost guard (deep paging cap, bbox area cap, facet count cap); WAF preconfigured rules. |
| Residual | 4 |
| Owner | SRE |
| Next action | Add per-canonical-hash sliding window throttle (currently per-IP only). |
R-7 — Cursor forgery to enumerate index
| Aspect | Detail |
|---|
| Description | Attacker forges or mutates pagination cursors to scrape the index past the 10 000-doc deep-paging cap. |
| Likelihood | 2 |
| Impact | 2 |
| Inherent | 4 |
| Mitigation | HMAC-signed cursors with 24 h key rotation; deep-paging hard cap; rate limits. |
| Residual | 1 |
| Owner | security |
| Next action | none |
R-8 — Tenant cascade purge incomplete (data retention violation)
| Aspect | Detail |
|---|
| Description | A tenant.deleted.v1 event is acked but cascade fails partway, leaving rows in the index past the SLO. |
| Likelihood | 2 |
| Impact | 4 |
| Inherent | 8 |
| Mitigation | Resumable cascade keyed on tenant_id index; explicit ack only after OpenSearchEnginePort.deleteByTenant returns; auditor flags orphan rows nightly. |
| Residual | 3 |
| Owner | service owner + DPO |
| Next action | Add an SLO for cascade duration and a paging alert above 60 min. |
R-9 — Wrong region pinning leaks property to disallowed region
| Aspect | Detail |
|---|
| Description | A property destined for region IR ends up indexed in AF due to bad upstream metadata or policy bug. |
| Likelihood | 3 |
| Impact | 3 |
| Inherent | 9 |
| Mitigation | RegionPinningPolicy validates against allow-list at projection time; auditor counter region_mismatch_total; periodic full sweep. |
| Residual | 4 |
| Owner | service owner |
| Next action | Add property-id sample audit weekly. |
R-10 — AI orchestrator cost runaway
| Aspect | Detail |
|---|
| Description | Embedding job or intent parser runs unbounded due to a misconfigured cache key or schema change, blowing through the AI budget. |
| Likelihood | 2 |
| Impact | 3 |
| Inherent | 6 |
| Mitigation | Hard rate-limit at orchestrator side; per-service monthly cost cap; cache hit ratio dashboard; alert at 80 % budget. |
| Residual | 3 |
| Owner | service owner + platform finance |
| Next action | Add cost regression test in nightly CI (compare AI spend vs baseline). |
R-11 — Vendor lock-in on Aiven OpenSearch
| Aspect | Detail |
|---|
| Description | If Aiven prices increase or service degrades, migrating away is costly. |
| Likelihood | 2 |
| Impact | 3 |
| Inherent | 6 |
| Mitigation | OpenSearch is open-source; index template is portable; index rebuild from BigQuery archive enables quick migration to self-managed OpenSearch on GKE/Cloud Run as called out in SERVICE_OVERVIEW § Tech stack. |
| Residual | 3 |
| Owner | platform owner |
| Next action | Annual cost-vs-self-managed analysis. |
R-12 — Multilingual collation incorrectness
| Aspect | Detail |
|---|
| Description | Pashto/Dari/Tajik collation in OpenSearch returns poor matches due to missing analyzer plugins or tokenizer drift between versions. |
| Likelihood | 3 |
| Impact | 3 |
| Inherent | 9 |
| Mitigation | Curated multi-script test corpus with golden top-N per query; analyzer pinned in index template; ICU tokenizer + custom filters; analyzer changes require full reindex. |
| Residual | 4 |
| Owner | search domain expert |
| Next action | Quarterly re-validation by native speakers (volunteer panel + paid review). |
R-13 — Pub/Sub schema drift on consumed events
| Aspect | Detail |
|---|
| Description | Upstream service ships a new event payload version without coordination, breaking the consumer. |
| Likelihood | 3 |
| Impact | 4 |
| Inherent | 12 |
| Mitigation | AsyncAPI registry CI check at publisher and consumer side; topic-major-version bumps require dual-publish ≥ 30 d; consumer is tolerant of unknown additive fields. |
| Residual | 4 |
| Owner | platform integration |
| Next action | Quarterly schema-compat review with upstream owners. |
R-14 — On-call burnout / single-team dependency
| Aspect | Detail |
|---|
| Description | Service owned by a small team; on-call coverage gaps after attrition. |
| Likelihood | 3 |
| Impact | 3 |
| Inherent | 9 |
| Mitigation | Follow-the-sun rotation across 2 regions; runbooks tested by non-team engineers; cross-train two engineers per quarter. |
| Residual | 4 |
| Owner | engineering manager |
| Next action | Document the top-5 incidents annually for new responders. |
R-15 — Migration regression (expand→backfill→contract slip)
| Aspect | Detail |
|---|
| Description | A migration step is skipped or merged out of order, leaving partial schema state. |
| Likelihood | 2 |
| Impact | 4 |
| Inherent | 8 |
| Mitigation | Migration plan dry-run in CI; phased deployment per MIGRATION_PLAN.md; release manager checklist. |
| Residual | 3 |
| Owner | service owner + platform DBA |
| Next action | Add a weekly drift detector against expected.schema.sql. |
Summary heat map
| ID | Title | Inherent | Residual | Trend |
|---|
| R-1 | Cross-tenant PII leak | 20 | 4 | ↘ |
| R-2 | OpenSearch outage | 12 | 6 | → |
| R-3 | Stale projection | 12 | 4 | ↘ |
| R-4 | BigQuery archive loss | 4 | 2 | → |
| R-5 | Boost-rule abuse | 12 | 6 | → |
| R-6 | Hot-query DoS | 12 | 4 | ↘ |
| R-7 | Cursor forgery | 4 | 1 | → |
| R-8 | Cascade purge incomplete | 8 | 3 | ↘ |
| R-9 | Wrong region pinning | 9 | 4 | → |
| R-10 | AI cost runaway | 6 | 3 | → |
| R-11 | Aiven lock-in | 6 | 3 | → |
| R-12 | Multilingual collation | 9 | 4 | → |
| R-13 | Pub/Sub schema drift | 12 | 4 | ↘ |
| R-14 | On-call burnout | 9 | 4 | → |
| R-15 | Migration regression | 8 | 3 | ↘ |
Reviewed monthly by the service owner and SRE; security risks (R-1, R-5, R-7, R-8) reviewed monthly with security.