Skip to main content

SERVICE_RISK_REGISTER — reservation-service

Sibling: SERVICE_READINESS · FAILURE_MODES · TESTING_STRATEGY

Strategic anchor: 02 Enterprise architecture §Risk model · 10 Payments architecture · 09 Lock and key integration

reservation-service is the operational nerve of the platform. A defect here surfaces immediately to guests and front-desk staff and is rarely silent. This register catalogs known and probable risks, their blast radius, the controls already in place, and the residual exposure.

Severity is the post-mitigation rating. Likelihood and impact are 1 (low) – 5 (catastrophic). Owner is the team accountable for keeping the mitigation alive.


1. Risk catalog

R-RSV-01 — Booking saga complexity drift

FieldValue
CategoryArchitectural
DescriptionThe booking saga touches 5 services and 8+ events. Adding a new branch (loyalty, group hold, OTA Phase 3) without rigorous compensation design can leave reservations stuck in held or pending_saga_* indefinitely.
Likelihood × Impact4 × 4 = 16
Severity (post)High
Mitigations(a) TESTING_STRATEGY §4 requires the 8-compensation matrix to be exhaustive and re-run on every saga change. (b) Stuck-saga alert (saga_steps_stuck) detects within 60 s. (c) FAILURE_MODES §1–3 mandates a runbook per branch. (d) Saga state diagrams in DOMAIN_MODEL §4 are reviewed on every architectural PR.
Residual exposureA novel third-party branch (e.g., Phase 3 OTA) bypasses the 8-compensation matrix until the matrix is extended.
OwnerPMS Core team lead

R-RSV-02 — Hold-expiry worker outage causes inventory hoarding

CategoryOperational
DescriptionIf the hold-expiry Cloud Run worker is paused, fails repeatedly, or its scheduler is misconfigured, holds never expire. Inventory locks held under those reservations also never release. New booking attempts return "no availability" while no real demand exists.
Likelihood × Impact3 × 4 = 12
Severity (post)Medium-High
Mitigations(a) Cloud Scheduler alert if missed > 1 cycle. (b) hold_expiry_lag_seconds SLO < 30 s (OBSERVABILITY §3). (c) Worker is single-purpose and stateless — restart is safe. (d) Inventory-service has a 30-minute internal sweep that releases holds older than the longest configured TTL as a backstop (10 Payments §holds).
Residual exposureA simultaneous outage of both worker and inventory backstop sweep.
OwnerSRE on-call

R-RSV-03 — FX rate snapshot stale or missing for IRR

CategoryFinancial / Regulatory
DescriptionIran-tenant guests expect IRR pricing. If pricing-service returns no fxSnapshot or one older than tenant.fx.maxAgeMinutes (default 30 min), the reservation either fails (good) or silently uses a stale rate (bad). Stale rates create disputes and potential regulatory exposure (misrepresenting price).
Likelihood × Impact3 × 4 = 12
Severity (post)Medium-High
Mitigations(a) Quote rejection if fxSnapshot.capturedAt older than tenant policy (APPLICATION_LOGIC §RequestQuote). (b) MELMASTOON.RESERVATION.FX_SNAPSHOT_STALE error with explicit Retry-After. (c) Snapshot is frozen on hold — no recompute on confirm. (d) Test TESTING_STRATEGY §3 FX stability prevents recompute regressions.
Residual exposureIf pricing-service silently returns stale data without capturedAt, freshness check passes. Mitigated by Pact contract pinning the capturedAt field as required.
OwnerPricing platform team + PMS Core

R-RSV-04 — OTA reconciliation (Phase 3+) creates duplicate reservations

CategoryArchitectural / Phase 3
DescriptionPhase 3 introduces inbound OTA reservations (Booking.com, Expedia, etc.). Without strict idempotency on the OTA confirmation number, a redelivered OTA webhook can produce a second Reservation for the same external booking, double-allocating inventory.
Likelihood × Impact4 × 5 = 20
Severity (post)High
Mitigations(a) (tenant_id, channel='ota', channel_partner_id, channel_external_id) UNIQUE index planned in MIGRATION_PLAN. (b) IngestOtaReservation use case dedupes by external ID before any saga step (APPLICATION_LOGIC). (c) Phase 3 launch gated on a chaos drill that replays a webhook 50 times.
Residual exposureOTA partners that mutate their own confirmation number across modifications (rare). Handled by a per-partner adapter mapping.
OwnerOTA integration squad (forming Q3)

R-RSV-05 — Cash-on-arrival fraud and abandonment

CategoryFinancial / Operational
DescriptionCash-on-arrival is first-class for Afghan/Iranian markets where card penetration is low. Bad actors can flood holds → confirm-as-cash → never arrive, blocking inventory and harming legitimate bookings.
Likelihood × Impact3 × 4 = 12
Severity (post)Medium-High
Mitigations(a) Per-tenant policy cashOnArrival.maxOpenReservationsPerGuest (default 3). (b) AI anomaly score on every cash-only reservation (AI_INTEGRATION §1); high score routes to pending_review. (c) no_show.policy.cashOnArrival.deposit_required_after allows tenants to demand a card-on-file deposit after N prior no-shows. (d) RecordNoShow updates a per-guest counter that feeds the anomaly model.
Residual exposureCoordinated multi-guest attacks. Detection moved to fraud-service in Phase 2.5.
OwnerTrust & Safety + Pricing

R-RSV-06 — Concurrent staff modification race vs guest paying

CategoryConcurrency
DescriptionFront-desk staff modifies a held reservation (room change, date extension) at the same moment the guest's payment intent captures. Without OCC and saga arbitration, the reservation can land in an inconsistent state (e.g., new room with old payment amount).
Likelihood × Impact4 × 3 = 12
Severity (post)Medium
Mitigations(a) version column + If-Match on every mutation. (b) Modification while saga step is await_payment is rejected with MELMASTOON.RESERVATION.SAGA_IN_PROGRESS. (c) Backoffice UI surfaces "guest is paying — wait" banner from pendingSagaStep. (d) Test TESTING_STRATEGY §3 concurrency covers this.
Residual exposureMisuse where staff bypasses the banner. Audit log captures every attempt.
OwnerPMS Core + Backoffice BFF

R-RSV-07 — Double-book race between direct and OTA channels

CategoryConcurrency
DescriptionA direct booking and an OTA-pushed reservation arrive within the same millisecond for the same room-night. Inventory-service is authoritative but the race surfaces here.
Likelihood × Impact3 × 5 = 15
Severity (post)Medium-High
Mitigations(a) Inventory holds are atomic (04 Event-driven §sagas). (b) Reservation reacts to inventory.allocation.failed.v1 with C1 compensation (FAILURE_MODES C1). (c) Loser channel is notified within seconds; OTA loser triggers an automatic OTA cancellation (Phase 3).
Residual exposureBrief ambiguity in the loser's UI between submit and rejection. Acceptable.
OwnerInventory + PMS Core

R-RSV-08 — Lock issuance failure blocks check-in

CategoryIntegration
Descriptionlock-integration-service adapter for a tenant fails (vendor outage, expired API key, lock offline). Guest is at the desk.
Likelihood × Impact3 × 4 = 12
Severity (post)Medium
Mitigations(a) requires_manual_key=true flag set on reservation; staff issues physical key and continues check-in (FAILURE_MODES C7). (b) Background retry continues attempting credential issuance for up to 24 h. (c) Alert + runbook for vendor-wide outages. (d) Audit trail captures manual override per 09 Lock and key §audit.
Residual exposureReputational impact if manual fallback is needed too often per tenant.
OwnerLock Integrations team

R-RSV-09 — Date-arithmetic / timezone bugs across DST and Persian calendar

CategoryDomain correctness
DescriptionStay windows must be computed in property-local time. Bugs around DST transitions, Persian calendar (Solar Hijri) display, and IRR-tenant tz=Asia/Tehran (which observed DST until recently) can shift nights by ±1.
Likelihood × Impact3 × 4 = 12
Severity (post)Medium
Mitigations(a) All persistence in date (ISO Gregorian) with a separate propertyTimezone. (b) Explicit conversion via Luxon at the BFF boundary. (c) Test suite TESTING_STRATEGY §3 date arithmetic covers DST-spring-forward, DST-fall-back, leap year, Persian leap year display. (d) Linter forbids new Date() in domain.
Residual exposureNew Persian calendar reform changes (rare).
OwnerPMS Core

R-RSV-10 — Outbox lag during traffic spike

CategoryPerformance
DescriptionHajj season or government holiday spike pushes throughput past the outbox dispatcher's design point. Downstream services (notification, billing, search) lag.
Likelihood × Impact3 × 3 = 9
Severity (post)Medium
Mitigations(a) outbox_dispatch_lag_seconds SLO < 5 s (OBSERVABILITY). (b) Dispatcher horizontally scaled per DEPLOYMENT_TOPOLOGY. (c) Backpressure: outbox writes never throttle the transactional path.
Residual exposureBrief notification delay during peaks; documented as acceptable.
OwnerSRE + PMS Core

R-RSV-11 — PII leak via event payloads to BigQuery

CategoryPrivacy / Compliance
DescriptionEvent payloads sink to BigQuery for analytics. Including raw email/phone violates GDPR / data-minimization for regulated retention.
Likelihood × Impact2 × 5 = 10
Severity (post)Low-Medium
Mitigations(a) EVENT_SCHEMAS §1 mandates contactHash (sha256 with tenant salt), never raw values. (b) audit-service envelope encryption for the few fields that must remain reversible. (c) Schema CI guard rejects new event subjects with raw email or phone fields.
Residual exposureFree-text specialRequest.text may incidentally contain PII. AI redaction sanitizes before sink.
OwnerPrivacy Officer

R-RSV-12 — Migration importer corrupts legacy data

CategoryMigration
DescriptionThe Excel/CSV importer (MIGRATION_PLAN) misclassifies a legacy row (e.g., a "tentative" booking imported as confirmed), polluting analytics and triggering wrong-state notifications.
Likelihood × Impact3 × 3 = 9
Severity (post)Medium
Mitigations(a) Importer dry-run mode mandatory before commit. (b) Notifications suppressed for imported reservations (source=migration flag). (c) Sample-set review by tenant ops before any import > 100 rows.
Residual exposureTenant-supplied data quality issues. Surfaced via importer report.
OwnerMigration squad

2. Risk heatmap (post-mitigation severity)

Impact ↑
5 │ R-04 R-07
4 │ R-08 R-01,02,03,05,11
3 │ R-06,09 R-10,12
2 │
1 │
1 2 3 4 5 → Likelihood

3. Review cadence

  • Monthly: PMS Core team walks through this register; any risk that fired in the last month is re-rated.
  • On change: any architectural PR that touches the saga, FX, OTA, or cash-on-arrival paths must reference the affected risk row(s) in the PR description.
  • Quarterly: SRE + Security review for residual exposure trend.

4. Out-of-scope (tracked elsewhere)

RiskOwner
Payment processor PCI scopepayment-gateway-service (10 Payments)
Lock vendor BLE pairinglock-integration-service (09 Lock and key)
Search index driftsearch-aggregation-service
Notification deliverabilitynotification-service