Skip to main content

FAILURE_MODES — reservation-service

Sibling: OBSERVABILITY · DEPLOYMENT_TOPOLOGY · TESTING_STRATEGY

Strategic anchors: 02 §14 Resilience · 04 §6 Outbox/Inbox · 09 §7 Lock failure modes · 10 §13 Payment failure modes

This catalog enumerates how reservation-service can fail, who notices, what runbook applies, and how each failure ends. The booking saga is the highest-leverage path on the platform; every step has an explicit compensation, and every failure has a named alert (OBSERVABILITY §6) and runbook in runbooks/reservation/.


1. Saga step failures (booking forward path)

F1. Inventory hold rejected (inventory.allocation.failed.v1)

FieldValue
Surfaceguest funnel returns 409 MELMASTOON.RESERVATION.OVERBOOKING_BLOCKED; staff walk-in returns same
Detectioninventory.allocation.failed.v1 arrives within 5 s of hold
CompensationCancelReservationUseCase(reasonCode='inventory_failed'); emit reservation.cancelled.v1; cancel pending payment intent via payment-gateway-service
User impactGuest sees "These rooms just went off-sale; please pick alternatives." Staff sees a re-quote dialog with available rooms.
Runbookrunbooks/reservation/inventory-rejected.md
TestC1 in TESTING_STRATEGY §4.2

F2. Payment timeout during hold (payment.transaction.failed.v1 or hold TTL elapses)

FieldValue
Surfaceguest sees payment failure or "Hold expired" message; reservation transitions to cancelled (payment failed) or expired_hold (TTL)
Detectioninbox handler / hold-expiry sweeper
Compensationrelease inventory hold (event-driven); cancel payment intent if still pending; no folio created
User impactguest can re-quote and try again; no double charge
Runbookrunbooks/reservation/payment-timeout.md
TestC2, C3

F3. Guest abandons mid-checkout (no event arrives)

FieldValue
Surfaceclient-side abandonment; no server signal
Detectionhold TTL elapses → hold_expired.v1
Compensationidentical to F2 (TTL path)
User impactinventory free for the next guest within 10 minutes (default TTL)
Runbookrunbooks/reservation/abandoned-funnel.md

F4. Lock issuance failure at confirm or check-in (lock.key.failed.v1)

FieldValue
Surfacereservation stays in confirmed or transitions to checked_in with requiresManualKey=true
Detectionlock.key.failed.v1 arrives; alert RESV-007 fires above threshold
Compensationnone on the reservation — the booking remains valid; staff issues a manual key (legacy method) and the modification audit logs the override
User impactguest may experience a short delay at the door; staff handle in-person
Runbookrunbooks/reservation/lock-degraded.md; coordinated with lock-integration-service runbook
TestC6

F5. notification-service send failure

FieldValue
Surfaceguest does not receive confirmation; staff sees "Email delivery delayed" badge in backoffice
Detectionnotification-service retries; after exhaustion, emits notification.delivery_failed.v1
Compensationreservation remains confirmed; staff can manually re-send via backoffice; no impact on saga state
Runbookrunbooks/reservation/notification-delayed.md

2. Modification sub-saga failures

F6. Date-change: payment delta charge fails

FieldValue
SurfacePATCH /:id returns 422 MELMASTOON.PAYMENT.CHARGE_FAILED
Compensationrevert inventory reallocation (release the new hold, restore the old hold); reservation stayWindow unchanged
User impactstaff retries with alternative payment method or contacts guest
Runbookrunbooks/reservation/date-change-charge-failed.md
TestC7

F7. Room-change: lock revoke fails after key reissue

FieldValue
Surfacereservation persists with new room; requiresManualKey=true flag set; alert RESV-007
Compensationnone on aggregate; ops issues manual revoke; modification audit row carries lock_revoke_pending
Runbookrunbooks/reservation/room-change-lock-revoke.md

F8. Group partial cancellation race

FieldValue
Surfacetwo operators cancelling different items concurrently; OCC ensures both succeed eventually
Compensationsecond writer reloads via STALE_VERSION, retries with new version; no inventory or refund duplication
Runbookrunbooks/reservation/group-cancel-race.md
TestC8 + concurrency-group spec

3. Concurrency & ordering failures

F9. OCC stale version on hot reservation

FieldValue
Surface409 MELMASTOON.RESERVATION.STALE_VERSION; client reloads and retries
Detectionreservation_occ_conflicts_total counter; warning at > 2% per 30 min
Compensationclient-side retry with reload; server-side no-op
User impactimperceptible if rate stays low; UI surfaces "Refreshing reservation…" if rate is high
Runbookrunbooks/reservation/occ-storm.md (look for hot reservation, ordering-key skew, or runaway client)

F10. Out-of-order Pub/Sub delivery within an aggregate

FieldValue
Surfaceinbox handler observes a payment.captured.v1 for an unknown reservation (event ahead of its held.v1)
Detectionhandler does not find the aggregate; raises RESERVATION_NOT_FOUND_FOR_EVENT; message NACK'd for retry
Compensationredelivery resolves the race; per-aggregate ordering key prevents this in normal operation; the safety net is the inbox dedupe + retry with backoff
Runbookrunbooks/reservation/out-of-order-event.md

4. Storage failures

F11. Cloud SQL primary unavailable

FieldValue
Surfacereads return 503 (after retry budget); writes return 503 with Retry-After: 5
DetectionDB connection error rate spikes; readiness probe fails
CompensationCloud SQL HA failover within 90 s; outbox relay backs off; on resume, queued writes retry
User impactbrief inability to book/check-in/check-out; no data loss; no double effects
Runbookrunbooks/reservation/postgres-unavailable.md

F12. Outbox relay stalls

FieldValue
Surfacedownstream services do not receive new events; alert RESV-004
Detectionreservation_outbox_lag_seconds > 30 s p99
Compensationrestart relay pod; if persistent, scale up; replay safe (consumers are idempotent)
Runbookrunbooks/reservation/outbox-lag.md

F13. Inbox dedupe table corruption

FieldValue
Surfacededupe lookup fails or returns spurious matches; events rejected as duplicate or applied twice
Detectionspike in reservation_inbox_dedupe_hits_total or in domain errors from re-applied transitions
Compensationaggregate state machine is the second line of defense (illegal-transition check); manual rebuild of inbox_processed from BigQuery archive
Runbookrunbooks/reservation/inbox-dedupe-rebuild.md

5. Hold-expiry worker failures

F14. Worker stalled or down

FieldValue
Surfaceinventory not released for expired holds; available rooms shrink; alert RESV-002
Detectionreservation_outbox_lag_seconds for hold_expired.v1 > 60 s
Compensationrestart worker; sweeper is single-replica by design — the next pass picks up the backlog
User impactvisible only as transient inventory pressure; guests get re-quote on next booking attempt
Runbookrunbooks/reservation/hold-expiry-stalled.md

F15. Sweeper double-runs (mis-deployment)

FieldValue
Surfaceduplicate hold_expired.v1 for the same reservation
Compensationaggregate expireHold() is idempotent — second call no-ops; consumers (inventory-service, payment-gateway-service) dedupe on event_id
Runbookrunbooks/reservation/sweeper-duplicates.md

6. Edge-case operational failures

F16. Reservation against an out-of-order room (property.room.taken_out_of_order.v1)

FieldValue
Surfaceproperty service emits OOO; we find affected active reservations
Compensationfor each affected reservation: trigger room_change sub-saga; if no alternative available within property type, emit reservation.alert.relocation_required.v1 for staff to handle manually
User impactguest receives proactive notification; staff prompted in backoffice
Runbookrunbooks/reservation/room-ooo-reaccommodation.md

F17. Date extension blocked by next reservation

FieldValue
SurfacePATCH /:id date_change returns 409 MELMASTOON.RESERVATION.OVERBOOKING_BLOCKED because the next reservation already holds the room
CompensationUI offers room-change as alternative
Runbookrunbooks/reservation/date-extension-conflict.md

F18. No-show with prepaid

FieldValue
SurfaceRecordNoShowUseCase runs after grace; reservation.no_show.v1 emitted; billing-service posts no-show penalty
Compensationno compensation needed — penalty is policy-driven
Runbookrunbooks/reservation/no-show-prepaid.md

F19. Early checkout with refund eligibility

FieldValue
SurfaceRecordEarlyCheckoutUseCase; early_checkout.v1 and checked_out.v1 emitted; billing-service issues refund per policy
Runbookrunbooks/reservation/early-checkout-refund.md

F20. Walk-in deferred KYC

FieldValue
Surfacewalk-in by phone-by-staff with deferKyc=true; reservation has guest fields partially filled
Compensationfollow-up task created in notification-service to ping staff for KYC within 24 h
Runbookrunbooks/reservation/walk-in-kyc-followup.md

F21. OTA inbound reconciliation drift (Phase 3+)

FieldValue
SurfaceOTA partner reports a reservation we have no record of (or vice versa)
Detectionreservation-channel-manager-service daily reconcile job
Compensationmanual operator workflow under runbooks/reservation/ota-reconcile.md; not auto-resolved

F22. Suspected-fraud auto-block timeout

FieldValue
Surfaceheld-for-review reservation auto-rejects after 15 min without staff confirmation
Compensationreservation.cancelled.v1 with reasonCode='guest_cancelled' and reason='fraud_review_timeout'
Runbookrunbooks/reservation/fraud-review-backlog.md

7. Compound failure: payment processor outage during a hold

hold placed → 200 to client with paymentIntent.clientSecret

│ client attempts payment via Stripe; Stripe is down


hold TTL elapses (10 min) → sweeper emits hold_expired.v1


inventory-service releases allocation
payment-gateway-service cancels the unconfirmed intent
notification-service sends "We couldn't take payment — try again" email

Outcome: no double charge, no stranded inventory, no manual intervention. The same flow handles MFS, PayPal, and HesabPay outages.


8. Cross-references