SERVICE_RISK_REGISTER — bff-tenant-booking-service
Sibling: SERVICE_READINESS · FAILURE_MODES · SECURITY_MODEL
Living register of known risks. Quarterly review by Frontend Platform tech lead + SRE + security.
Severity: L low, M medium, H high, C critical.
Status: open, monitored, accepted, closed.
1. Strategic risks
| ID | Risk | L×I | Mitigation | Owner | Status | Review |
|---|---|---|---|---|---|---|
| R-S-1 | Tenant flash sale creates 30× spike that overwhelms pricing-service quote endpoint | M×H | Per-tenant rate limits; flashSale mode auto-scale to 60 instances; pre-warm bootstrap cache; capacity test required for every announced sale | FE Platform + Pricing | monitored | per sale |
| R-S-2 | Custom-domain support sprawl (50+ tenant domains) creates operational toil | H×M | Self-serve DNS-01 cert provisioning; automated DNS verification probe; tenant-managed-domain SLAs documented | SRE | open | quarterly |
| R-S-3 | Tenant brand violation via injected theme content (e.g., XSS in brand name) | M×H | Theme content sanitized at theme-config-service; CSP nonce; allow-list of HTML tags in policy summaries | Security + FE | monitored | quarterly |
| R-S-4 | Sharia-compliance regression in pricing/AI passthrough | L×H | complianceProfile propagation tested; AI suggestions filtered (AI §12); audit log on every quote | Compliance + FE | monitored | quarterly |
| R-S-5 | Phase 2 guest sign-in introduces account-takeover surface | M×H | Phase 2 design lives in _future/; security review with iam-service team before scaffolding; MFA required | Security | open | annual |
2. Performance & reliability risks
| ID | Risk | L×I | Mitigation | Owner | Status | Review |
|---|---|---|---|---|---|---|
| R-P-1 | Slowest-of-N composition on /availability makes p95 unbounded if pricing-service slow | H×M | Per-call deadlines; partial result composer; priceFromCheapest=null fallback; SLO tracked separately | FE Platform | monitored | quarterly |
| R-P-2 | Memorystore session-tier eviction during traffic spike loses booking drafts | L×H | noeviction policy on session tier; alarm on memory utilization; auto-scale to 6 GiB during flashSale | SRE | monitored | per sale |
| R-P-3 | reservation-service confirm latency drift causes /return timeouts | M×H | Confirm SLO tracked; idempotency keys absorb retries; UI polling as fallback | FE Platform + Reservation | monitored | quarterly |
| R-P-4 | payment-gateway-service ambiguous-return cases create stranded reservations | M×H | Polling fallback in UI for 60 s; manual reconciliation runbook; monthly metric review of ambiguous-return rate | FE Platform + Payment | monitored | monthly |
| R-P-5 | Cloud SQL HA failover > 60 s during peak | L×M | DR drill verifies; idem keys absorb; trended monthly | SRE | accepted | annual |
| R-P-6 | Outbox grows unbounded if Pub/Sub down for hours | L×M | Alert at 5k/50k/250k; manual flush script; outbox-relay redrive | SRE | monitored | quarterly |
3. Security risks
| ID | Risk | L×I | Mitigation | Owner | Status | Review |
|---|---|---|---|---|---|---|
| R-Sec-1 | HMAC handoff key compromise enables tenant impersonation | L×C | Secret Manager restricted access; rotation 90 d with 7-d overlap; verifier logs all attempts; rotation drill quarterly | Security + FE | monitored | quarterly |
| R-Sec-2 | Cross-tenant cache key collision leaks data | L×C | Tenant-scoped cache keys; tenant-isolation.spec.ts; nightly cross-tenant probe | FE Platform | monitored | quarterly |
| R-Sec-3 | Payment-return tampering (forged returnState) | L×C | Server-side verifyReturn mandatory; provider-state opaque; payment_return_invalid alert | Security + Payment | monitored | quarterly |
| R-Sec-4 | Booking-flow scraper undercuts pricing | M×M | Per-IP / per-session rate limits; reCAPTCHA on anomalies; behavioural anomaly score | Security + FE | open | quarterly |
| R-Sec-5 | Cookie hijack via subdomain takeover | L×H | Subdomain-scoped cookie; CAA records; SNI cert per tenant; periodic DNS audit | Security | monitored | quarterly |
| R-Sec-6 | PII leak via free-text specialRequests field carrying NID/passport | M×M | Input length cap (500 chars); pre-storage redaction heuristic; truncated to 200 chars in cold mirror; periodic synthetic PII probe | Security + Data | monitored | quarterly |
| R-Sec-7 | Custom-domain MITM via misconfigured DNS | L×H | DNS-01 challenge required; CAA records advised; tenant onboarding playbook | Security | open | annual |
4. Compliance & data risks
| ID | Risk | L×I | Mitigation | Owner | Status | Review |
|---|---|---|---|---|---|---|
| R-C-1 | EU traffic without consent banner | L×H | App defers telemetry until consent; BFF respects X-Consent: declined; DPIA on file | Legal + FE | monitored | annual |
| R-C-2 | DSR for guest email returns inconsistent results across services | M×M | Tenant-orchestrated DSR via tenant-service; this BFF documented as snapshot-only source | Data Steward | monitored | quarterly |
| R-C-3 | Data residency for EU users (Memorystore in asia-south1) | M×M | Phase 2 region-affinity routing; current scope: short-lived booking-time PII; legal review confirmed acceptable | Legal | accepted | annual |
| R-C-4 | Booking confirmation email contains PII; downstream notification-service audit needed | L×M | Notification service has its own retention policy; this BFF only emits event, not email | Notification + Legal | monitored | annual |
5. Operational risks
| ID | Risk | L×I | Mitigation | Owner | Status | Review |
|---|---|---|---|---|---|---|
| R-O-1 | Bus-factor 1 on the booking saga orchestration code | M×M | Pair-on-call; runbook completeness review; rotating reviewer | Eng Manager | monitored | annual |
| R-O-2 | Schema drift from upstream service released without contract test | L×H | Pact verification gate; OpenAPI diff gate; schema-drift alert | Platform Eng | monitored | quarterly |
| R-O-3 | On-call burn-out from custom-domain DNS failures | M×M | Tenant-self-serve dashboard; tenant-specific alerts batched | SRE | monitored | quarterly |
| R-O-4 | Misconfigured per-tenant rate limit blocks legitimate flash-sale traffic | L×M | Tenant-specific rate-limit overrides via flag; pre-sale capacity audit | SRE + FE | monitored | per sale |
6. Cost risks
| ID | Risk | L×I | Mitigation | Owner | Status | Review |
|---|---|---|---|---|---|---|
| R-Cost-1 | Cloud CDN cost inflation from non-cacheable variants | M×M | Vary header strict; query-param normalization; quarterly review | FE Platform | monitored | quarterly |
| R-Cost-2 | Pub/Sub volume from telemetry exceeds budget | M×M | Sample rates per event; cost alarm at 120% | SRE | monitored | quarterly |
| R-Cost-3 | Cert Manager cost growth with custom-domain count | L×L | Track per-cert cost; bundle by tenant tier | FE Platform | accepted | annual |
| R-Cost-4 | High-volume guest abandonment increases telemetry without revenue | M×L | Funnel rate review; sampling-rate calibration on flow.step_completed.v1 | SRE + Product | monitored | quarterly |
7. Risk acceptance log
| ID | Date accepted | Accepted by | Reason | Re-evaluation |
|---|---|---|---|---|
| R-P-5 | 2026-04-15 | SRE | HA failover within SLA on every drill in last 12 months | 2027-04-15 |
| R-C-3 | 2026-04-15 | Legal | Booking-time PII short-lived; Phase 2 will introduce region affinity | 2027-04-15 |
| R-Cost-3 | 2026-04-15 | FE Platform | Cert Manager cost negligible vs revenue contribution | 2027-04-15 |
8. Review cadence
- Quarterly: FE Platform tech lead + SRE on-call + security reviewer.
- Per major release: any touched row re-rated.
- Per incident: post-mortem owners audit register.