FAILURE_MODES — reservation-service
Sibling: OBSERVABILITY · DEPLOYMENT_TOPOLOGY · TESTING_STRATEGY
Strategic anchors: 02 §14 Resilience · 04 §6 Outbox/Inbox · 09 §7 Lock failure modes · 10 §13 Payment failure modes
This catalog enumerates how reservation-service can fail, who notices, what runbook applies, and how each failure ends. The booking saga is the highest-leverage path on the platform; every step has an explicit compensation, and every failure has a named alert (OBSERVABILITY §6) and runbook in runbooks/reservation/.
1. Saga step failures (booking forward path)
F1. Inventory hold rejected (inventory.allocation.failed.v1)
| Field | Value |
|---|---|
| Surface | guest funnel returns 409 MELMASTOON.RESERVATION.OVERBOOKING_BLOCKED; staff walk-in returns same |
| Detection | inventory.allocation.failed.v1 arrives within 5 s of hold |
| Compensation | CancelReservationUseCase(reasonCode='inventory_failed'); emit reservation.cancelled.v1; cancel pending payment intent via payment-gateway-service |
| User impact | Guest sees "These rooms just went off-sale; please pick alternatives." Staff sees a re-quote dialog with available rooms. |
| Runbook | runbooks/reservation/inventory-rejected.md |
| Test | C1 in TESTING_STRATEGY §4.2 |
F2. Payment timeout during hold (payment.transaction.failed.v1 or hold TTL elapses)
| Field | Value |
|---|---|
| Surface | guest sees payment failure or "Hold expired" message; reservation transitions to cancelled (payment failed) or expired_hold (TTL) |
| Detection | inbox handler / hold-expiry sweeper |
| Compensation | release inventory hold (event-driven); cancel payment intent if still pending; no folio created |
| User impact | guest can re-quote and try again; no double charge |
| Runbook | runbooks/reservation/payment-timeout.md |
| Test | C2, C3 |
F3. Guest abandons mid-checkout (no event arrives)
| Field | Value |
|---|---|
| Surface | client-side abandonment; no server signal |
| Detection | hold TTL elapses → hold_expired.v1 |
| Compensation | identical to F2 (TTL path) |
| User impact | inventory free for the next guest within 10 minutes (default TTL) |
| Runbook | runbooks/reservation/abandoned-funnel.md |
F4. Lock issuance failure at confirm or check-in (lock.key.failed.v1)
| Field | Value |
|---|---|
| Surface | reservation stays in confirmed or transitions to checked_in with requiresManualKey=true |
| Detection | lock.key.failed.v1 arrives; alert RESV-007 fires above threshold |
| Compensation | none on the reservation — the booking remains valid; staff issues a manual key (legacy method) and the modification audit logs the override |
| User impact | guest may experience a short delay at the door; staff handle in-person |
| Runbook | runbooks/reservation/lock-degraded.md; coordinated with lock-integration-service runbook |
| Test | C6 |
F5. notification-service send failure
| Field | Value |
|---|---|
| Surface | guest does not receive confirmation; staff sees "Email delivery delayed" badge in backoffice |
| Detection | notification-service retries; after exhaustion, emits notification.delivery_failed.v1 |
| Compensation | reservation remains confirmed; staff can manually re-send via backoffice; no impact on saga state |
| Runbook | runbooks/reservation/notification-delayed.md |
2. Modification sub-saga failures
F6. Date-change: payment delta charge fails
| Field | Value |
|---|---|
| Surface | PATCH /:id returns 422 MELMASTOON.PAYMENT.CHARGE_FAILED |
| Compensation | revert inventory reallocation (release the new hold, restore the old hold); reservation stayWindow unchanged |
| User impact | staff retries with alternative payment method or contacts guest |
| Runbook | runbooks/reservation/date-change-charge-failed.md |
| Test | C7 |
F7. Room-change: lock revoke fails after key reissue
| Field | Value |
|---|---|
| Surface | reservation persists with new room; requiresManualKey=true flag set; alert RESV-007 |
| Compensation | none on aggregate; ops issues manual revoke; modification audit row carries lock_revoke_pending |
| Runbook | runbooks/reservation/room-change-lock-revoke.md |
F8. Group partial cancellation race
| Field | Value |
|---|---|
| Surface | two operators cancelling different items concurrently; OCC ensures both succeed eventually |
| Compensation | second writer reloads via STALE_VERSION, retries with new version; no inventory or refund duplication |
| Runbook | runbooks/reservation/group-cancel-race.md |
| Test | C8 + concurrency-group spec |
3. Concurrency & ordering failures
F9. OCC stale version on hot reservation
| Field | Value |
|---|---|
| Surface | 409 MELMASTOON.RESERVATION.STALE_VERSION; client reloads and retries |
| Detection | reservation_occ_conflicts_total counter; warning at > 2% per 30 min |
| Compensation | client-side retry with reload; server-side no-op |
| User impact | imperceptible if rate stays low; UI surfaces "Refreshing reservation…" if rate is high |
| Runbook | runbooks/reservation/occ-storm.md (look for hot reservation, ordering-key skew, or runaway client) |
F10. Out-of-order Pub/Sub delivery within an aggregate
| Field | Value |
|---|---|
| Surface | inbox handler observes a payment.captured.v1 for an unknown reservation (event ahead of its held.v1) |
| Detection | handler does not find the aggregate; raises RESERVATION_NOT_FOUND_FOR_EVENT; message NACK'd for retry |
| Compensation | redelivery resolves the race; per-aggregate ordering key prevents this in normal operation; the safety net is the inbox dedupe + retry with backoff |
| Runbook | runbooks/reservation/out-of-order-event.md |
4. Storage failures
F11. Cloud SQL primary unavailable
| Field | Value |
|---|---|
| Surface | reads return 503 (after retry budget); writes return 503 with Retry-After: 5 |
| Detection | DB connection error rate spikes; readiness probe fails |
| Compensation | Cloud SQL HA failover within 90 s; outbox relay backs off; on resume, queued writes retry |
| User impact | brief inability to book/check-in/check-out; no data loss; no double effects |
| Runbook | runbooks/reservation/postgres-unavailable.md |
F12. Outbox relay stalls
| Field | Value |
|---|---|
| Surface | downstream services do not receive new events; alert RESV-004 |
| Detection | reservation_outbox_lag_seconds > 30 s p99 |
| Compensation | restart relay pod; if persistent, scale up; replay safe (consumers are idempotent) |
| Runbook | runbooks/reservation/outbox-lag.md |
F13. Inbox dedupe table corruption
| Field | Value |
|---|---|
| Surface | dedupe lookup fails or returns spurious matches; events rejected as duplicate or applied twice |
| Detection | spike in reservation_inbox_dedupe_hits_total or in domain errors from re-applied transitions |
| Compensation | aggregate state machine is the second line of defense (illegal-transition check); manual rebuild of inbox_processed from BigQuery archive |
| Runbook | runbooks/reservation/inbox-dedupe-rebuild.md |
5. Hold-expiry worker failures
F14. Worker stalled or down
| Field | Value |
|---|---|
| Surface | inventory not released for expired holds; available rooms shrink; alert RESV-002 |
| Detection | reservation_outbox_lag_seconds for hold_expired.v1 > 60 s |
| Compensation | restart worker; sweeper is single-replica by design — the next pass picks up the backlog |
| User impact | visible only as transient inventory pressure; guests get re-quote on next booking attempt |
| Runbook | runbooks/reservation/hold-expiry-stalled.md |
F15. Sweeper double-runs (mis-deployment)
| Field | Value |
|---|---|
| Surface | duplicate hold_expired.v1 for the same reservation |
| Compensation | aggregate expireHold() is idempotent — second call no-ops; consumers (inventory-service, payment-gateway-service) dedupe on event_id |
| Runbook | runbooks/reservation/sweeper-duplicates.md |
6. Edge-case operational failures
F16. Reservation against an out-of-order room (property.room.taken_out_of_order.v1)
| Field | Value |
|---|---|
| Surface | property service emits OOO; we find affected active reservations |
| Compensation | for each affected reservation: trigger room_change sub-saga; if no alternative available within property type, emit reservation.alert.relocation_required.v1 for staff to handle manually |
| User impact | guest receives proactive notification; staff prompted in backoffice |
| Runbook | runbooks/reservation/room-ooo-reaccommodation.md |
F17. Date extension blocked by next reservation
| Field | Value |
|---|---|
| Surface | PATCH /:id date_change returns 409 MELMASTOON.RESERVATION.OVERBOOKING_BLOCKED because the next reservation already holds the room |
| Compensation | UI offers room-change as alternative |
| Runbook | runbooks/reservation/date-extension-conflict.md |
F18. No-show with prepaid
| Field | Value |
|---|---|
| Surface | RecordNoShowUseCase runs after grace; reservation.no_show.v1 emitted; billing-service posts no-show penalty |
| Compensation | no compensation needed — penalty is policy-driven |
| Runbook | runbooks/reservation/no-show-prepaid.md |
F19. Early checkout with refund eligibility
| Field | Value |
|---|---|
| Surface | RecordEarlyCheckoutUseCase; early_checkout.v1 and checked_out.v1 emitted; billing-service issues refund per policy |
| Runbook | runbooks/reservation/early-checkout-refund.md |
F20. Walk-in deferred KYC
| Field | Value |
|---|---|
| Surface | walk-in by phone-by-staff with deferKyc=true; reservation has guest fields partially filled |
| Compensation | follow-up task created in notification-service to ping staff for KYC within 24 h |
| Runbook | runbooks/reservation/walk-in-kyc-followup.md |
F21. OTA inbound reconciliation drift (Phase 3+)
| Field | Value |
|---|---|
| Surface | OTA partner reports a reservation we have no record of (or vice versa) |
| Detection | reservation-channel-manager-service daily reconcile job |
| Compensation | manual operator workflow under runbooks/reservation/ota-reconcile.md; not auto-resolved |
F22. Suspected-fraud auto-block timeout
| Field | Value |
|---|---|
| Surface | held-for-review reservation auto-rejects after 15 min without staff confirmation |
| Compensation | reservation.cancelled.v1 with reasonCode='guest_cancelled' and reason='fraud_review_timeout' |
| Runbook | runbooks/reservation/fraud-review-backlog.md |
7. Compound failure: payment processor outage during a hold
hold placed → 200 to client with paymentIntent.clientSecret
│
│ client attempts payment via Stripe; Stripe is down
│
▼
hold TTL elapses (10 min) → sweeper emits hold_expired.v1
│
▼
inventory-service releases allocation
payment-gateway-service cancels the unconfirmed intent
notification-service sends "We couldn't take payment — try again" email
Outcome: no double charge, no stranded inventory, no manual intervention. The same flow handles MFS, PayPal, and HesabPay outages.
8. Cross-references
- Saga choreography: 04 §7
- Outbox/inbox semantics: 04 §6
- Lock failure modes: 09 §7
- Payment failure modes: 10 §13
- Alert ladder: OBSERVABILITY §6