SERVICE_RISK_REGISTER — bff-consumer-service
Sibling: SERVICE_READINESS · FAILURE_MODES · SECURITY_MODEL
Living register of known risks. Each row carries an owner, a current likelihood × impact rating, mitigation status, and a review cadence. The register is reviewed quarterly by the Frontend Platform tech lead with SRE and security reviewers.
Severity scale (likelihood × impact):
- L = low
- M = medium
- H = high
- C = critical (production-stopping)
Status: open (active mitigation), monitored (mitigated but watched), accepted (residual risk acknowledged), closed (no longer applicable).
1. Strategic risks
| ID | Risk | Likelihood | Impact | Mitigation | Owner | Status | Review |
|---|
| R-S-1 | Marketing campaign creates 50× traffic spike (vs. 10× planned) | M | H | Cloud Armor rate-limit ratchet runbook; pre-warm autoscale to 80 instances; campaignMode flag toggled by ops; pre-event capacity test required for every major campaign | FE Platform | monitored | per campaign |
| R-S-2 | A bot operator scrapes the entire cross-tenant catalog and undercuts the meta layer | H | M | Bot detector with multi-signal scoring; reCAPTCHA Enterprise; per-fingerprint rate limit; legal cease-and-desist playbook | Security + FE Platform | open | quarterly |
| R-S-3 | Meta-layer search results stale during a tenant's price flash sale | M | M | melmastoon.search_aggregation.listing.indexed.v1 invalidates hot listing cache; tenant promotional events trigger same-second invalidate; priceFromCheapest always re-fetched; staleness banner shown when fallback used | FE Platform + Pricing | monitored | quarterly |
| R-S-4 | Phase 2 authenticated wishlist sync explodes in scope and pollutes anonymous path | M | M | Phase 2 design lives in _future/; feature flag scoped to authenticated users only; design review with architecture team before scaffolding starts | Architecture | open | annual |
| ID | Risk | Likelihood | Impact | Mitigation | Owner | Status | Review |
|---|
| R-P-1 | Single-flight collapses under sufficiently adversarial cache-key skew (e.g., random millisecond filters in query) | L | H | Cache key derivation strips cosmetic noise (rounding, ordering); fingerprinted-bot patterns trigger early; alarms on stampede metric | FE Platform | monitored | quarterly |
| R-P-2 | Memorystore eviction during traffic spike loses sessions and degrades conversion | M | M | 5 GiB working set with 30-day TTL; alert on eviction rate; auto-scale to 10 GiB during campaignMode | SRE | monitored | per campaign |
| R-P-3 | Outbox table grows unbounded if Pub/Sub publisher fails for hours | L | M | Alert at 5k / 50k / 250k row depth; manual flush script; outbox-relay redrive; storage-budget alarm | SRE | monitored | quarterly |
| R-P-4 | Cloud SQL HA failover takes > 60 s and degrades mutating endpoints | L | L | DR drill verifies actual failover time; idempotency keys absorb retries; monitor failover times trended over 12 months | SRE | accepted | annual |
| R-P-5 | One slow upstream (e.g., pricing-service) drags the slowest-of-N composition latency | H | M | Per-call deadlines; partial-result composer; priceFromCheapest=null fallback; SLO budget for upstream-attributed latency tracked separately | FE Platform | monitored | quarterly |
3. Security risks
| ID | Risk | Likelihood | Impact | Mitigation | Owner | Status | Review |
|---|
| R-Sec-1 | HMAC handoff key compromise leaks ability to impersonate handoff into tenant booking | L | C | Key in Secret Manager with restricted access; rotation every 90 days with 7-day overlap; tenant BFF logs all handoff verifications and DLQs anomalies; rotation drill quarterly | Security + FE Platform | monitored | quarterly |
| R-Sec-2 | Bot detector false-negative — sophisticated bot blends in and harvests pricing | M | M | Multi-signal scoring; behavioural anomalies tracked; manual review of high-volume sessions; legal channel for repeat offenders | Security | open | quarterly |
| R-Sec-3 | Cookie hijack via XSS on @ghasi/app-web-meta | L | H | HttpOnly cookie; CSP with strict allow-list; CSRF protection on mutating endpoints via Origin header; periodic XSS audit on the consumer web app | Security + FE | monitored | quarterly |
| R-Sec-4 | Cross-tenant data leak via misconfigured cache key | L | C | Cache keys explicitly include tenant context where applicable (only for BrandPeek); tests assert key isolation; review every new caching adapter | FE Platform | monitored | quarterly |
| R-Sec-5 | reCAPTCHA leak (secret in client) | L | M | Server-side verification only; site key public by design; secret in Secret Manager | Security | accepted | annual |
| R-Sec-6 | Search query log carries identifying user data | M | M | Search query strings hashed before logging; geo coords rounded to ~ 1 km; UA bucketed not stored raw; test asserts no email/phone patterns reach logs | Security + Data | monitored | quarterly |
| R-Sec-7 | DDoS bypasses Cloud Armor via low-volume distributed attack | M | M | Per-fingerprint rate limit; behavioural anomaly detection; on-call playbook for L7 DDoS | SRE + Security | monitored | quarterly |
4. Compliance & data risks
| ID | Risk | Likelihood | Impact | Mitigation | Owner | Status | Review |
|---|
| R-C-1 | EU traffic without consent banner triggers GDPR violation | L | H | Consumer web defers all telemetry until consent given; BFF accepts X-Consent: declined header and skips telemetry; DPIA on file | Legal + FE | monitored | annual |
| R-C-2 | Cookie banner blocks bot-detection signal collection and increases FP rate | M | M | Bot-detection runs from request signals (UA, IP-bucket, cadence) that don't require cookie consent; CAPTCHA challenge falls back gracefully | FE Platform | monitored | annual |
| R-C-3 | Data residency for EU users (Memorystore session in asia-south1) | M | M | Region-affinity routing planned in Phase 2; current scope: anonymous data only; no PII; legal review confirmed acceptable | Legal | accepted | annual |
5. Operational risks
| ID | Risk | Likelihood | Impact | Mitigation | Owner | Status | Review |
|---|
| R-O-1 | On-call burn-out due to bot-related noise | M | M | Alert tuning quarterly; bot-related alerts batched and not individually paging; weekly bot-review review | SRE | monitored | quarterly |
| R-O-2 | Schema drift from upstream service released without contract test | L | H | Pact provider verification gate; OpenAPI diff gate; nightly schema sync between BFF and upstreams | Platform Eng | monitored | quarterly |
| R-O-3 | Loss of a single Frontend Platform engineer creates bus-factor 1 on the orchestrators | L | M | Pair-on-call rota; runbooks complete; quarterly ops review with rotating reviewer | Eng Manager | monitored | annual |
6. Cost risks
| ID | Risk | Likelihood | Impact | Mitigation | Owner | Status | Review |
|---|
| R-Cost-1 | Cloud CDN costs spiral due to non-cacheable parameters in URLs | M | M | Query parameter normalization; Vary header strict; analytics dashboard tracks cache hit by route; quarterly review | FE Platform | monitored | quarterly |
| R-Cost-2 | Pub/Sub volume from telemetry exceeds budget | M | M | Sample rate per event documented; sampling enforced at outbox enqueue; adjustable per-event from feature flags; cost alarm at 120% of monthly budget | SRE | monitored | quarterly |
| R-Cost-3 | Excessive Trace export volume during incident | L | L | Trace sampler caps at 5% steady-state, raised to 100% during incident with timer-bound revert | SRE | monitored | quarterly |
7. Risk acceptance log
| ID | Date accepted | Accepted by | Reason | Re-evaluation date |
|---|
| R-Sec-5 | 2026-04-15 | Security | reCAPTCHA secret stored correctly; site key public by design | 2027-04-15 |
| R-P-4 | 2026-04-15 | SRE | Cloud SQL HA failover within SLA on every drill in last 12 months | 2027-04-15 |
| R-C-3 | 2026-04-15 | Legal | No PII in cross-region session blob; Phase 2 will introduce region affinity | 2027-04-15 |
8. Review cadence
- Quarterly: Frontend Platform tech lead + SRE on-call + security reviewer convene; revisit every row marked
monitored and open; promote / demote severity; close mitigated rows; capture new risks discovered since last review.
- Per major release: any risk row touched by the release is re-rated.
- Per incident: post-mortem owners audit this register and add any new risk surfaced.