FAILURE_MODES — inventory-service
Sibling: OBSERVABILITY · APPLICATION_LOGIC · DATA_MODEL · SECURITY_MODEL · TESTING_STRATEGY
The inventory ledger sits underneath every booking. A failure here cascades to every property, every channel, and every payment flow. This document enumerates the realistic failure modes, their detection signal, and the runbook response. Each row maps to an alert in OBSERVABILITY §6.
1. Severity ladder
| Sev | Definition | Examples |
|---|---|---|
| P0 | Correctness or tenancy breach; user-visible double-booking; cross-tenant leak | False overbooking event observed; row visible across tenants |
| P1 | Hot-path degradation; saga stalled; outbox lag > 30 s | Allocation latency p99 > 200 ms 10 min; DLQ arrivals; sweeper stalled |
| P2 | Background job failure; cold-path slowness; warning-class business signal | Calendar horizon dropping; reconciliation drift; overbooking-policy alert |
| P3 | Operator-driven anomaly; non-blocking | Manual release spike; offline-arbitration loss > 1% per tenant per 24 h |
2. Failure catalog
2.1 False overbooking event observed
| Field | Value |
|---|---|
| Detection | Counter inventory_overbooking_actual_total increments. Should be impossible (DB CHECK + EXCLUDE constraints). |
| Alert | RESV-INV-001 (P0) — page on-call and tech lead |
| Likely causes | (a) Constraint accidentally relaxed in a migration; (b) raw-SQL admin path bypassing constraints; (c) BYPASSRLS tooling forgot to re-enable invariants; (d) cross-tenant lock collision (impossible with tenant_id-keyed advisory locks unless code regressed) |
| Containment | Set stop_sell=true on the offending property×date via emergency runbook; freeze all writes via Cloud Run min-replicas=0 if necessary |
| Recovery | Re-run reconciliation; emit corrective room.reassigned.v1 events as needed; notify affected reservations through reservation-service |
| Postmortem | Mandatory within 5 business days; include Jepsen replay |
2.2 Advisory-lock timeout / deadlock
| Detection | inventory_allocation_lock_wait_ms p99 > 500 ms; inventory_allocation_lock_acquisitions_total{outcome="timeout"} non-zero |
| Alert | RESV-INV-006 (P2) above 2% over 30 min |
| Causes | Long transaction holding a row lock (e.g., a slow audit insert in the same tx); group-hold lock-order regression; Postgres wait-graph cycle |
| Mitigation | Allocation transactions are bounded to <50 ms work; canonical lock order enforced by GroupAtomicHoldPlanner; per-statement timeout 800 ms |
| Recovery | Retry transparently with exponential backoff (max 3); after 3, return MELMASTOON.INVENTORY.LOCK_TIMEOUT 503 with Retry-After. Saga compensates with allocation.failed.v1. |
2.3 Outbox lag
| Detection | inventory_outbox_lag_seconds p99 > 5 s |
| Alert | RESV-INV-004 (P1) above 30 s for 5 min |
| Causes | Pub/Sub publish failures; relay process stuck on a single bad row; pool exhaustion |
| Recovery | Inspect outbox for next_attempt_at > now and attempts > 5; quarantine the bad row to outbox_dead; restart relay; if Pub/Sub region down, fail over to secondary region's publisher with same ordering key |
2.4 Inbox / Pub/Sub backlog
| Detection | Subscription backlog > 10k messages |
| Alert | RESV-INV-014 (P1) |
| Causes | Spike in reservation events (e.g., bulk import); slow handler; downstream DB slow |
| Recovery | Scale inventory-service to max replicas; throttle ingress at BFFs; if a poison message is the cause, identify via DLQ and inbox_processed.last_error |
2.5 DLQ arrivals
| Detection | DLQ depth > 0 on any inbound subject |
| Alert | RESV-INV-007 (P1) |
| Causes | Schema drift (producer published unknown major version); idempotency-key collision; downstream DB constraint violation due to upstream bug |
| Recovery | Quarantine DLQ messages to a triage queue; replay into inbox_processed with corrective payload after engineering review; emit corresponding outbound event if the affected business state was reachable |
2.6 Hold expiry sweeper stalled
| Detection | inventory_hold_expiry_lag_seconds p99 > 30 s; sweeper has not run in > 90 s |
| Alert | RESV-INV-005 (P1) |
| Causes | Cloud Scheduler outage; sweeper instance unhealthy; query slow due to missing index after a migration |
| Recovery | Restart Cloud Run service; verify partial index room_allocations_held_expiry_idx; manually invoke /internal/jobs/expire-holds via emergency credentials; reservation-service holds will start releasing themselves once their own timer fires, but inventory rows would remain held — manual catch-up sweep clears them |
2.7 Calendar horizon shortfall
| Detection | inventory_calendar_horizon_days_min{property} < 30 |
| Alert | RESV-INV-008 (P2) |
| Causes | extend-calendar-horizon job failed; new tenant onboarded but property.created.v1 not consumed; property.room.created.v1 arrived but extension stalled |
| Recovery | Run job manually; backfill missing dates; investigate failed job logs; root-cause if job container crashed |
2.8 Partition rotation failure
| Detection | inventory_partition_rotation_total{outcome="failure"} > 0 |
| Alert | RESV-INV-009 (P1) |
| Causes | Permission drift on schema; disk full; lock contention on parent table during rotation window |
| Recovery | Re-run job; if forward partition missing, default partition will receive writes (slow but functional) until repair; export old partitions to GCS and detach as designed |
2.9 Calendar exhaustion (writes against missing partition)
| Detection | Postgres errors no partition of relation "room_type_inventory_daily" found for row (rare; default partition catches these) |
| Alert | linked to RESV-INV-008 + RESV-INV-009 |
| Causes | Combined horizon shortfall + partition rotation failure |
| Recovery | Default partition receives writes; immediately create missing partitions and INSERT INTO … SELECT … FROM default_partition WHERE …; truncate default partition |
2.10 OOO during active reservation — reaccommodation backlog
| Detection | inventory_reaccommodation_pub_latency_ms p95 > 2 s; reaccommodation events accumulating without consumer ack |
| Alert | linked to RESV-INV-007 if DLQ on reservation-service consumer |
| Cause | reservation-service slow to handle reaccommodation_required.v1; large batch of OOO events |
| Recovery | Inspect reservation-service consumer; emit retries; do not auto-resolve from inventory side — reservation-service owns the room-change sub-saga |
2.11 Group-hold partial-fail thrashing
| Detection | inventory_group_hold_total{outcome="partial_fail"} spiking on a single tenant |
| Alert | P3 (notify gm) |
| Cause | Sales operator repeatedly attempting a 10-room hold that conflicts with stop-sell or block; or honest race on tight inventory |
| Recovery | Educate operator via UI hint; service behavior is correct — no remediation needed beyond visibility |
2.12 Walk-in vs saga race for the last room
| Detection | inventory_allocation_failed_total{reasonCode="insufficient_availability"} spike correlated with channel mix |
| Cause | Operator booked the last room via walk-in 50 ms before a channel manager's reservation arrived |
| Recovery | Service behavior is correct (atomic). UI should surface "channel hold lost; suggest rebook" message on the channel side. |
2.13 Sweeper-vs-confirm race
| Detection | inventory_sweeper_skipped_due_to_advance_total > 0 (sweeper hit a row whose status had already moved past held) |
| Alert | none — informational |
| Cause | normal: confirm event arrived during the sweeper window |
| Recovery | none — sweeper logic is SELECT … FOR UPDATE SKIP LOCKED + state guard |
2.14 Cloud SQL primary failover
| Detection | Connection errors; replica promotion notification | | Alert | platform-level (Cloud SQL alert), correlated with RESV-INV-002 / 004 | | Cause | maintenance, AZ outage, or regression | | Recovery | Connection pool reconnects to new primary; in-flight transactions fail and saga compensates; relay resumes from outbox; idempotency keys prevent double-effect |
2.15 Memorystore unavailable
| Detection | Cache error rate > 50% | | Alert | RESV-INV-003 (search slow) | | Cause | maintenance, eviction storm | | Recovery | Service degrades to direct Postgres queries; throughput drops, latency rises but functionality preserved (the cache is a courtesy layer; the source of truth is Postgres) |
2.16 Cross-tenant exposure attempt
| Detection | MELMASTOON.TENANT.MISMATCH errors > baseline; or RLS 403 on direct DB access |
| Alert | platform Security alert |
| Cause | client bug (BFF passing wrong header), credential reuse, hostile probe |
| Recovery | Block the offending principal; review audit log; do not silently allow |
2.17 Offline-arbitration loss spike
| Detection | inventory_offline_arbitration_lost_total / pushed_total > 1% per tenant per 24 h |
| Alert | RESV-INV-013 (P2) |
| Cause | Operator working offline for too long; clock drift on device; calendar horizon shortened |
| Recovery | Notify the property's GM via the desktop UI; suggest reconnecting more often; review device clock |
2.18 Bug in domain rejection logic (false INSUFFICIENT_AVAILABILITY)
| Detection | Spike in allocation.failed.v1 with reasonCode='insufficient_availability' while room_type_inventory_daily.available > 0 (reconciliation job will catch this within 30 min) |
| Alert | RESV-INV-011 (P2) reconciliation drift |
| Recovery | Roll back recent inventory-service deploy; investigate domain regression with property-based tests |
2.19 Schema migration regression
| Detection | Smoke test failure on canary; or P0 RESV-INV-001 immediately after migration | | Recovery | Roll back Cloud Run revision; revert migration only if backwards-compatible-down available — otherwise quick-patch forward | | Mitigation | All migrations are expand → backfill → contract; CI rejects destructive single-PR migrations (see MIGRATION_PLAN) |
2.20 Time skew between services
| Detection | holdUntil boundaries firing on the wrong side; sweeper expiring too early or too late |
| Cause | Container time drift; timezone misconfiguration on a worker |
| Recovery | All hosts run NTP; service derives "now" from Clock port (mockable in tests); inventory_clock_drift_seconds panel verifies < 500 ms |
3. Manual emergency levers
| Lever | Effect | Who | Audit |
|---|---|---|---|
POST /internal/admin/stop-sell-property | Sets stop_sell=true for all dates of a property | platform-engineer (break-glass IAM, two-person rule) | Required |
POST /internal/admin/freeze-tenant-writes | Rejects all writes for a tenant with MELMASTOON.INVENTORY.MAINTENANCE_FREEZE | platform-engineer | Required |
POST /internal/admin/replay-outbox | Re-emit a specific outbox row | platform-engineer | Required |
POST /internal/admin/reconcile-property | Force a calendar reconciliation for one property | sre | Required |
All endpoints require audit-service correlation IDs and write to audit_log synchronously.
4. Cross-references
- Alerts wiring: OBSERVABILITY §6
- Concurrency tests that prove containment: TESTING_STRATEGY §4.3 + §5
- Constraint design: DATA_MODEL §4 + §5
- Saga compensation: APPLICATION_LOGIC §3
- Reservation failure modes: reservation-service FAILURE_MODES