SERVICE_RISK_REGISTER — reservation-service
Sibling: SERVICE_READINESS · FAILURE_MODES · TESTING_STRATEGY
Strategic anchor: 02 Enterprise architecture §Risk model · 10 Payments architecture · 09 Lock and key integration
reservation-service is the operational nerve of the platform. A defect here surfaces immediately to guests and front-desk staff and is rarely silent. This register catalogs known and probable risks, their blast radius, the controls already in place, and the residual exposure.
Severity is the post-mitigation rating. Likelihood and impact are 1 (low) – 5 (catastrophic). Owner is the team accountable for keeping the mitigation alive.
1. Risk catalog
R-RSV-01 — Booking saga complexity drift
| Field | Value |
|---|---|
| Category | Architectural |
| Description | The booking saga touches 5 services and 8+ events. Adding a new branch (loyalty, group hold, OTA Phase 3) without rigorous compensation design can leave reservations stuck in held or pending_saga_* indefinitely. |
| Likelihood × Impact | 4 × 4 = 16 |
| Severity (post) | High |
| Mitigations | (a) TESTING_STRATEGY §4 requires the 8-compensation matrix to be exhaustive and re-run on every saga change. (b) Stuck-saga alert (saga_steps_stuck) detects within 60 s. (c) FAILURE_MODES §1–3 mandates a runbook per branch. (d) Saga state diagrams in DOMAIN_MODEL §4 are reviewed on every architectural PR. |
| Residual exposure | A novel third-party branch (e.g., Phase 3 OTA) bypasses the 8-compensation matrix until the matrix is extended. |
| Owner | PMS Core team lead |
R-RSV-02 — Hold-expiry worker outage causes inventory hoarding
| Category | Operational |
|---|---|
| Description | If the hold-expiry Cloud Run worker is paused, fails repeatedly, or its scheduler is misconfigured, holds never expire. Inventory locks held under those reservations also never release. New booking attempts return "no availability" while no real demand exists. |
| Likelihood × Impact | 3 × 4 = 12 |
| Severity (post) | Medium-High |
| Mitigations | (a) Cloud Scheduler alert if missed > 1 cycle. (b) hold_expiry_lag_seconds SLO < 30 s (OBSERVABILITY §3). (c) Worker is single-purpose and stateless — restart is safe. (d) Inventory-service has a 30-minute internal sweep that releases holds older than the longest configured TTL as a backstop (10 Payments §holds). |
| Residual exposure | A simultaneous outage of both worker and inventory backstop sweep. |
| Owner | SRE on-call |
R-RSV-03 — FX rate snapshot stale or missing for IRR
| Category | Financial / Regulatory |
|---|---|
| Description | Iran-tenant guests expect IRR pricing. If pricing-service returns no fxSnapshot or one older than tenant.fx.maxAgeMinutes (default 30 min), the reservation either fails (good) or silently uses a stale rate (bad). Stale rates create disputes and potential regulatory exposure (misrepresenting price). |
| Likelihood × Impact | 3 × 4 = 12 |
| Severity (post) | Medium-High |
| Mitigations | (a) Quote rejection if fxSnapshot.capturedAt older than tenant policy (APPLICATION_LOGIC §RequestQuote). (b) MELMASTOON.RESERVATION.FX_SNAPSHOT_STALE error with explicit Retry-After. (c) Snapshot is frozen on hold — no recompute on confirm. (d) Test TESTING_STRATEGY §3 FX stability prevents recompute regressions. |
| Residual exposure | If pricing-service silently returns stale data without capturedAt, freshness check passes. Mitigated by Pact contract pinning the capturedAt field as required. |
| Owner | Pricing platform team + PMS Core |
R-RSV-04 — OTA reconciliation (Phase 3+) creates duplicate reservations
| Category | Architectural / Phase 3 |
|---|---|
| Description | Phase 3 introduces inbound OTA reservations (Booking.com, Expedia, etc.). Without strict idempotency on the OTA confirmation number, a redelivered OTA webhook can produce a second Reservation for the same external booking, double-allocating inventory. |
| Likelihood × Impact | 4 × 5 = 20 |
| Severity (post) | High |
| Mitigations | (a) (tenant_id, channel='ota', channel_partner_id, channel_external_id) UNIQUE index planned in MIGRATION_PLAN. (b) IngestOtaReservation use case dedupes by external ID before any saga step (APPLICATION_LOGIC). (c) Phase 3 launch gated on a chaos drill that replays a webhook 50 times. |
| Residual exposure | OTA partners that mutate their own confirmation number across modifications (rare). Handled by a per-partner adapter mapping. |
| Owner | OTA integration squad (forming Q3) |
R-RSV-05 — Cash-on-arrival fraud and abandonment
| Category | Financial / Operational |
|---|---|
| Description | Cash-on-arrival is first-class for Afghan/Iranian markets where card penetration is low. Bad actors can flood holds → confirm-as-cash → never arrive, blocking inventory and harming legitimate bookings. |
| Likelihood × Impact | 3 × 4 = 12 |
| Severity (post) | Medium-High |
| Mitigations | (a) Per-tenant policy cashOnArrival.maxOpenReservationsPerGuest (default 3). (b) AI anomaly score on every cash-only reservation (AI_INTEGRATION §1); high score routes to pending_review. (c) no_show.policy.cashOnArrival.deposit_required_after allows tenants to demand a card-on-file deposit after N prior no-shows. (d) RecordNoShow updates a per-guest counter that feeds the anomaly model. |
| Residual exposure | Coordinated multi-guest attacks. Detection moved to fraud-service in Phase 2.5. |
| Owner | Trust & Safety + Pricing |
R-RSV-06 — Concurrent staff modification race vs guest paying
| Category | Concurrency |
|---|---|
| Description | Front-desk staff modifies a held reservation (room change, date extension) at the same moment the guest's payment intent captures. Without OCC and saga arbitration, the reservation can land in an inconsistent state (e.g., new room with old payment amount). |
| Likelihood × Impact | 4 × 3 = 12 |
| Severity (post) | Medium |
| Mitigations | (a) version column + If-Match on every mutation. (b) Modification while saga step is await_payment is rejected with MELMASTOON.RESERVATION.SAGA_IN_PROGRESS. (c) Backoffice UI surfaces "guest is paying — wait" banner from pendingSagaStep. (d) Test TESTING_STRATEGY §3 concurrency covers this. |
| Residual exposure | Misuse where staff bypasses the banner. Audit log captures every attempt. |
| Owner | PMS Core + Backoffice BFF |
R-RSV-07 — Double-book race between direct and OTA channels
| Category | Concurrency |
|---|---|
| Description | A direct booking and an OTA-pushed reservation arrive within the same millisecond for the same room-night. Inventory-service is authoritative but the race surfaces here. |
| Likelihood × Impact | 3 × 5 = 15 |
| Severity (post) | Medium-High |
| Mitigations | (a) Inventory holds are atomic (04 Event-driven §sagas). (b) Reservation reacts to inventory.allocation.failed.v1 with C1 compensation (FAILURE_MODES C1). (c) Loser channel is notified within seconds; OTA loser triggers an automatic OTA cancellation (Phase 3). |
| Residual exposure | Brief ambiguity in the loser's UI between submit and rejection. Acceptable. |
| Owner | Inventory + PMS Core |
R-RSV-08 — Lock issuance failure blocks check-in
| Category | Integration |
|---|---|
| Description | lock-integration-service adapter for a tenant fails (vendor outage, expired API key, lock offline). Guest is at the desk. |
| Likelihood × Impact | 3 × 4 = 12 |
| Severity (post) | Medium |
| Mitigations | (a) requires_manual_key=true flag set on reservation; staff issues physical key and continues check-in (FAILURE_MODES C7). (b) Background retry continues attempting credential issuance for up to 24 h. (c) Alert + runbook for vendor-wide outages. (d) Audit trail captures manual override per 09 Lock and key §audit. |
| Residual exposure | Reputational impact if manual fallback is needed too often per tenant. |
| Owner | Lock Integrations team |
R-RSV-09 — Date-arithmetic / timezone bugs across DST and Persian calendar
| Category | Domain correctness |
|---|---|
| Description | Stay windows must be computed in property-local time. Bugs around DST transitions, Persian calendar (Solar Hijri) display, and IRR-tenant tz=Asia/Tehran (which observed DST until recently) can shift nights by ±1. |
| Likelihood × Impact | 3 × 4 = 12 |
| Severity (post) | Medium |
| Mitigations | (a) All persistence in date (ISO Gregorian) with a separate propertyTimezone. (b) Explicit conversion via Luxon at the BFF boundary. (c) Test suite TESTING_STRATEGY §3 date arithmetic covers DST-spring-forward, DST-fall-back, leap year, Persian leap year display. (d) Linter forbids new Date() in domain. |
| Residual exposure | New Persian calendar reform changes (rare). |
| Owner | PMS Core |
R-RSV-10 — Outbox lag during traffic spike
| Category | Performance |
|---|---|
| Description | Hajj season or government holiday spike pushes throughput past the outbox dispatcher's design point. Downstream services (notification, billing, search) lag. |
| Likelihood × Impact | 3 × 3 = 9 |
| Severity (post) | Medium |
| Mitigations | (a) outbox_dispatch_lag_seconds SLO < 5 s (OBSERVABILITY). (b) Dispatcher horizontally scaled per DEPLOYMENT_TOPOLOGY. (c) Backpressure: outbox writes never throttle the transactional path. |
| Residual exposure | Brief notification delay during peaks; documented as acceptable. |
| Owner | SRE + PMS Core |
R-RSV-11 — PII leak via event payloads to BigQuery
| Category | Privacy / Compliance |
|---|---|
| Description | Event payloads sink to BigQuery for analytics. Including raw email/phone violates GDPR / data-minimization for regulated retention. |
| Likelihood × Impact | 2 × 5 = 10 |
| Severity (post) | Low-Medium |
| Mitigations | (a) EVENT_SCHEMAS §1 mandates contactHash (sha256 with tenant salt), never raw values. (b) audit-service envelope encryption for the few fields that must remain reversible. (c) Schema CI guard rejects new event subjects with raw email or phone fields. |
| Residual exposure | Free-text specialRequest.text may incidentally contain PII. AI redaction sanitizes before sink. |
| Owner | Privacy Officer |
R-RSV-12 — Migration importer corrupts legacy data
| Category | Migration |
|---|---|
| Description | The Excel/CSV importer (MIGRATION_PLAN) misclassifies a legacy row (e.g., a "tentative" booking imported as confirmed), polluting analytics and triggering wrong-state notifications. |
| Likelihood × Impact | 3 × 3 = 9 |
| Severity (post) | Medium |
| Mitigations | (a) Importer dry-run mode mandatory before commit. (b) Notifications suppressed for imported reservations (source=migration flag). (c) Sample-set review by tenant ops before any import > 100 rows. |
| Residual exposure | Tenant-supplied data quality issues. Surfaced via importer report. |
| Owner | Migration squad |
2. Risk heatmap (post-mitigation severity)
Impact ↑
5 │ R-04 R-07
4 │ R-08 R-01,02,03,05,11
3 │ R-06,09 R-10,12
2 │
1 │
1 2 3 4 5 → Likelihood
3. Review cadence
- Monthly: PMS Core team walks through this register; any risk that fired in the last month is re-rated.
- On change: any architectural PR that touches the saga, FX, OTA, or cash-on-arrival paths must reference the affected risk row(s) in the PR description.
- Quarterly: SRE + Security review for residual exposure trend.
4. Out-of-scope (tracked elsewhere)
| Risk | Owner |
|---|---|
| Payment processor PCI scope | payment-gateway-service (10 Payments) |
| Lock vendor BLE pairing | lock-integration-service (09 Lock and key) |
| Search index drift | search-aggregation-service |
| Notification deliverability | notification-service |