Skip to main content

search-aggregation-service — SERVICE_RISK_REGISTER

Companion: SECURITY_MODEL · FAILURE_MODES · SERVICE_READINESS · DEPLOYMENT_TOPOLOGY

Living register, reviewed monthly. Likelihood × Impact = inherent risk scored before mitigation; residual is the post-mitigation score. 1 = low, 5 = critical.


R-1 — Cross-tenant PII leak via search index

AspectDetail
DescriptionForbidden upstream field (e.g. ownerEmail, bankAccount, lockSerial) is projected into the search index and returned to anonymous public traffic.
Likelihood4
Impact5 (regulatory + reputational)
Inherent20
Mitigation4-layer allow-list: type guard, projection policy, schema (dynamic: strict), nightly audit. CI gates on schema diff. PRs touching the allow-list require security review.
Residual4
Ownerservice owner + security
Next actionQuarterly review of allow-list against new upstream fields (property-service, pricing-service schema diffs).

R-2 — OpenSearch outage degrading consumer surface

AspectDetail
DescriptionAiven cluster outage or shard-failure storm causes search latency to balloon or 5xx.
Likelihood3
Impact4
Inherent12
MitigationPostgres-only fallback with degradationLevel="opensearch_unavailable"; circuit breaker; cache TTL extension; per-region clusters.
Residual6 (latency degraded but service stays up)
OwnerSRE
Next actionLoad-test fallback at peak RPS — currently rehearsed at 50 % peak only.

R-3 — Stale projection from upstream event lag or out-of-order delivery

AspectDetail
DescriptionOut-of-order or delayed events from pricing-service or inventory-service cause stale prices/availability shown to users.
Likelihood4
Impact3
Inherent12
MitigationPub/Sub ordering by propertyId; vector-clock guard; freshness SLO (p95 < 30 s) with burn alerts; drift sweep job.
Residual4
Ownerservice owner
Next actionAdd per-upstream producer_lag_seconds SLO and dashboard.

R-4 — Catastrophic data loss in BigQuery event archive

AspectDetail
DescriptionThe replay-source archive is corrupted or partially deleted, blocking index rebuild from epoch.
Likelihood1
Impact4
Inherent4
MitigationDual-region GCS Object Lock (90 d) on archive copies; nightly snapshot integrity check; canonical Postgres rebuilds short windows directly without BigQuery for ≤ 7 d via outbox replay.
Residual2
Ownerplatform data
Next actionQuarterly archive integrity check log → audit dashboard.

R-5 — Boost-rule abuse (operator inflates rankings)

AspectDetail
DescriptionA tenant operator sets aggressive multiplier to dominate cross-tenant rankings unfairly.
Likelihood4
Impact3
Inherent12
MitigationDomain multiplier clamp [0.1, 5.0]; ranker normalizes per-region; sponsored slots are a separate (Phase 3) channel; audit topic + Looker watchlist for spikes.
Residual6
Ownerproduct + platform trust
Next actionAdd anomaly detector on multiplier velocity; ratio-cap per region.

R-6 — Hot-query DoS on a single canonical query

AspectDetail
DescriptionA scraping bot or naive client hammers one expensive query, saturating OpenSearch shards.
Likelihood4
Impact3
Inherent12
MitigationCanonical-query negative cache; Cloud Armor per-IP rate limit; query-cost guard (deep paging cap, bbox area cap, facet count cap); WAF preconfigured rules.
Residual4
OwnerSRE
Next actionAdd per-canonical-hash sliding window throttle (currently per-IP only).

R-7 — Cursor forgery to enumerate index

AspectDetail
DescriptionAttacker forges or mutates pagination cursors to scrape the index past the 10 000-doc deep-paging cap.
Likelihood2
Impact2
Inherent4
MitigationHMAC-signed cursors with 24 h key rotation; deep-paging hard cap; rate limits.
Residual1
Ownersecurity
Next actionnone

R-8 — Tenant cascade purge incomplete (data retention violation)

AspectDetail
DescriptionA tenant.deleted.v1 event is acked but cascade fails partway, leaving rows in the index past the SLO.
Likelihood2
Impact4
Inherent8
MitigationResumable cascade keyed on tenant_id index; explicit ack only after OpenSearchEnginePort.deleteByTenant returns; auditor flags orphan rows nightly.
Residual3
Ownerservice owner + DPO
Next actionAdd an SLO for cascade duration and a paging alert above 60 min.

R-9 — Wrong region pinning leaks property to disallowed region

AspectDetail
DescriptionA property destined for region IR ends up indexed in AF due to bad upstream metadata or policy bug.
Likelihood3
Impact3
Inherent9
MitigationRegionPinningPolicy validates against allow-list at projection time; auditor counter region_mismatch_total; periodic full sweep.
Residual4
Ownerservice owner
Next actionAdd property-id sample audit weekly.

R-10 — AI orchestrator cost runaway

AspectDetail
DescriptionEmbedding job or intent parser runs unbounded due to a misconfigured cache key or schema change, blowing through the AI budget.
Likelihood2
Impact3
Inherent6
MitigationHard rate-limit at orchestrator side; per-service monthly cost cap; cache hit ratio dashboard; alert at 80 % budget.
Residual3
Ownerservice owner + platform finance
Next actionAdd cost regression test in nightly CI (compare AI spend vs baseline).

R-11 — Vendor lock-in on Aiven OpenSearch

AspectDetail
DescriptionIf Aiven prices increase or service degrades, migrating away is costly.
Likelihood2
Impact3
Inherent6
MitigationOpenSearch is open-source; index template is portable; index rebuild from BigQuery archive enables quick migration to self-managed OpenSearch on GKE/Cloud Run as called out in SERVICE_OVERVIEW § Tech stack.
Residual3
Ownerplatform owner
Next actionAnnual cost-vs-self-managed analysis.

R-12 — Multilingual collation incorrectness

AspectDetail
DescriptionPashto/Dari/Tajik collation in OpenSearch returns poor matches due to missing analyzer plugins or tokenizer drift between versions.
Likelihood3
Impact3
Inherent9
MitigationCurated multi-script test corpus with golden top-N per query; analyzer pinned in index template; ICU tokenizer + custom filters; analyzer changes require full reindex.
Residual4
Ownersearch domain expert
Next actionQuarterly re-validation by native speakers (volunteer panel + paid review).

R-13 — Pub/Sub schema drift on consumed events

AspectDetail
DescriptionUpstream service ships a new event payload version without coordination, breaking the consumer.
Likelihood3
Impact4
Inherent12
MitigationAsyncAPI registry CI check at publisher and consumer side; topic-major-version bumps require dual-publish ≥ 30 d; consumer is tolerant of unknown additive fields.
Residual4
Ownerplatform integration
Next actionQuarterly schema-compat review with upstream owners.

R-14 — On-call burnout / single-team dependency

AspectDetail
DescriptionService owned by a small team; on-call coverage gaps after attrition.
Likelihood3
Impact3
Inherent9
MitigationFollow-the-sun rotation across 2 regions; runbooks tested by non-team engineers; cross-train two engineers per quarter.
Residual4
Ownerengineering manager
Next actionDocument the top-5 incidents annually for new responders.

R-15 — Migration regression (expand→backfill→contract slip)

AspectDetail
DescriptionA migration step is skipped or merged out of order, leaving partial schema state.
Likelihood2
Impact4
Inherent8
MitigationMigration plan dry-run in CI; phased deployment per MIGRATION_PLAN.md; release manager checklist.
Residual3
Ownerservice owner + platform DBA
Next actionAdd a weekly drift detector against expected.schema.sql.

Summary heat map

IDTitleInherentResidualTrend
R-1Cross-tenant PII leak204
R-2OpenSearch outage126
R-3Stale projection124
R-4BigQuery archive loss42
R-5Boost-rule abuse126
R-6Hot-query DoS124
R-7Cursor forgery41
R-8Cascade purge incomplete83
R-9Wrong region pinning94
R-10AI cost runaway63
R-11Aiven lock-in63
R-12Multilingual collation94
R-13Pub/Sub schema drift124
R-14On-call burnout94
R-15Migration regression83

Reviewed monthly by the service owner and SRE; security risks (R-1, R-5, R-7, R-8) reviewed monthly with security.