FAILURE_MODES — inventory-service

Sibling: OBSERVABILITY · APPLICATION_LOGIC · DATA_MODEL · SECURITY_MODEL · TESTING_STRATEGY

The inventory ledger sits underneath every booking. A failure here cascades to every property, every channel, and every payment flow. This document enumerates the realistic failure modes, their detection signal, and the runbook response. Each row maps to an alert in OBSERVABILITY §6.

1. Severity ladder

Sev	Definition	Examples
P0	Correctness or tenancy breach; user-visible double-booking; cross-tenant leak	False overbooking event observed; row visible across tenants
P1	Hot-path degradation; saga stalled; outbox lag > 30 s	Allocation latency p99 > 200 ms 10 min; DLQ arrivals; sweeper stalled
P2	Background job failure; cold-path slowness; warning-class business signal	Calendar horizon dropping; reconciliation drift; overbooking-policy alert
P3	Operator-driven anomaly; non-blocking	Manual release spike; offline-arbitration loss > 1% per tenant per 24 h

2. Failure catalog

2.1 False overbooking event observed

Field	Value
Detection	Counter `inventory_overbooking_actual_total` increments. Should be impossible (DB CHECK + EXCLUDE constraints).
Alert	RESV-INV-001 (P0) — page on-call and tech lead
Likely causes	(a) Constraint accidentally relaxed in a migration; (b) raw-SQL admin path bypassing constraints; (c) BYPASSRLS tooling forgot to re-enable invariants; (d) cross-tenant lock collision (impossible with `tenant_id`-keyed advisory locks unless code regressed)
Containment	Set `stop_sell=true` on the offending property×date via emergency runbook; freeze all writes via Cloud Run min-replicas=0 if necessary
Recovery	Re-run reconciliation; emit corrective `room.reassigned.v1` events as needed; notify affected reservations through `reservation-service`
Postmortem	Mandatory within 5 business days; include Jepsen replay

2.2 Advisory-lock timeout / deadlock

| Detection | inventory_allocation_lock_wait_ms p99 > 500 ms; inventory_allocation_lock_acquisitions_total{outcome="timeout"} non-zero | | Alert | RESV-INV-006 (P2) above 2% over 30 min | | Causes | Long transaction holding a row lock (e.g., a slow audit insert in the same tx); group-hold lock-order regression; Postgres wait-graph cycle | | Mitigation | Allocation transactions are bounded to <50 ms work; canonical lock order enforced by GroupAtomicHoldPlanner; per-statement timeout 800 ms | | Recovery | Retry transparently with exponential backoff (max 3); after 3, return MELMASTOON.INVENTORY.LOCK_TIMEOUT 503 with Retry-After. Saga compensates with allocation.failed.v1. |

2.3 Outbox lag

| Detection | inventory_outbox_lag_seconds p99 > 5 s | | Alert | RESV-INV-004 (P1) above 30 s for 5 min | | Causes | Pub/Sub publish failures; relay process stuck on a single bad row; pool exhaustion | | Recovery | Inspect outbox for next_attempt_at > now and attempts > 5; quarantine the bad row to outbox_dead; restart relay; if Pub/Sub region down, fail over to secondary region's publisher with same ordering key |

2.4 Inbox / Pub/Sub backlog

2.5 DLQ arrivals

| Detection | DLQ depth > 0 on any inbound subject | | Alert | RESV-INV-007 (P1) | | Causes | Schema drift (producer published unknown major version); idempotency-key collision; downstream DB constraint violation due to upstream bug | | Recovery | Quarantine DLQ messages to a triage queue; replay into inbox_processed with corrective payload after engineering review; emit corresponding outbound event if the affected business state was reachable |

2.6 Hold expiry sweeper stalled

| Detection | inventory_hold_expiry_lag_seconds p99 > 30 s; sweeper has not run in > 90 s | | Alert | RESV-INV-005 (P1) | | Causes | Cloud Scheduler outage; sweeper instance unhealthy; query slow due to missing index after a migration | | Recovery | Restart Cloud Run service; verify partial index room_allocations_held_expiry_idx; manually invoke /internal/jobs/expire-holds via emergency credentials; reservation-service holds will start releasing themselves once their own timer fires, but inventory rows would remain held — manual catch-up sweep clears them |

2.7 Calendar horizon shortfall

| Detection | inventory_calendar_horizon_days_min{property} < 30 | | Alert | RESV-INV-008 (P2) | | Causes | extend-calendar-horizon job failed; new tenant onboarded but property.created.v1 not consumed; property.room.created.v1 arrived but extension stalled | | Recovery | Run job manually; backfill missing dates; investigate failed job logs; root-cause if job container crashed |

2.8 Partition rotation failure

| Detection | inventory_partition_rotation_total{outcome="failure"} > 0 | | Alert | RESV-INV-009 (P1) | | Causes | Permission drift on schema; disk full; lock contention on parent table during rotation window | | Recovery | Re-run job; if forward partition missing, default partition will receive writes (slow but functional) until repair; export old partitions to GCS and detach as designed |

2.9 Calendar exhaustion (writes against missing partition)

| Detection | Postgres errors no partition of relation "room_type_inventory_daily" found for row (rare; default partition catches these) | | Alert | linked to RESV-INV-008 + RESV-INV-009 | | Causes | Combined horizon shortfall + partition rotation failure | | Recovery | Default partition receives writes; immediately create missing partitions and INSERT INTO … SELECT … FROM default_partition WHERE …; truncate default partition |

2.10 OOO during active reservation — reaccommodation backlog

| Detection | inventory_reaccommodation_pub_latency_ms p95 > 2 s; reaccommodation events accumulating without consumer ack | | Alert | linked to RESV-INV-007 if DLQ on reservation-service consumer | | Cause | reservation-service slow to handle reaccommodation_required.v1; large batch of OOO events | | Recovery | Inspect reservation-service consumer; emit retries; do not auto-resolve from inventory side — reservation-service owns the room-change sub-saga |

2.11 Group-hold partial-fail thrashing

| Detection | inventory_group_hold_total{outcome="partial_fail"} spiking on a single tenant | | Alert | P3 (notify gm) | | Cause | Sales operator repeatedly attempting a 10-room hold that conflicts with stop-sell or block; or honest race on tight inventory | | Recovery | Educate operator via UI hint; service behavior is correct — no remediation needed beyond visibility |

2.12 Walk-in vs saga race for the last room

| Detection | inventory_allocation_failed_total{reasonCode="insufficient_availability"} spike correlated with channel mix | | Cause | Operator booked the last room via walk-in 50 ms before a channel manager's reservation arrived | | Recovery | Service behavior is correct (atomic). UI should surface "channel hold lost; suggest rebook" message on the channel side. |

2.13 Sweeper-vs-confirm race

2.14 Cloud SQL primary failover

| Detection | Connection errors; replica promotion notification | | Alert | platform-level (Cloud SQL alert), correlated with RESV-INV-002 / 004 | | Cause | maintenance, AZ outage, or regression | | Recovery | Connection pool reconnects to new primary; in-flight transactions fail and saga compensates; relay resumes from outbox; idempotency keys prevent double-effect |

2.15 Memorystore unavailable

2.16 Cross-tenant exposure attempt

2.17 Offline-arbitration loss spike

2.18 Bug in domain rejection logic (false `INSUFFICIENT_AVAILABILITY`)

| Detection | Spike in allocation.failed.v1 with reasonCode='insufficient_availability' while room_type_inventory_daily.available > 0 (reconciliation job will catch this within 30 min) | | Alert | RESV-INV-011 (P2) reconciliation drift | | Recovery | Roll back recent inventory-service deploy; investigate domain regression with property-based tests |

2.19 Schema migration regression

| Detection | Smoke test failure on canary; or P0 RESV-INV-001 immediately after migration | | Recovery | Roll back Cloud Run revision; revert migration only if backwards-compatible-down available — otherwise quick-patch forward | | Mitigation | All migrations are expand → backfill → contract; CI rejects destructive single-PR migrations (see MIGRATION_PLAN) |

2.20 Time skew between services

| Detection | holdUntil boundaries firing on the wrong side; sweeper expiring too early or too late | | Cause | Container time drift; timezone misconfiguration on a worker | | Recovery | All hosts run NTP; service derives "now" from Clock port (mockable in tests); inventory_clock_drift_seconds panel verifies < 500 ms |

3. Manual emergency levers

Lever	Effect	Who	Audit
`POST /internal/admin/stop-sell-property`	Sets `stop_sell=true` for all dates of a property	platform-engineer (break-glass IAM, two-person rule)	Required
`POST /internal/admin/freeze-tenant-writes`	Rejects all writes for a tenant with `MELMASTOON.INVENTORY.MAINTENANCE_FREEZE`	platform-engineer	Required
`POST /internal/admin/replay-outbox`	Re-emit a specific outbox row	platform-engineer	Required
`POST /internal/admin/reconcile-property`	Force a calendar reconciliation for one property	sre	Required

All endpoints require audit-service correlation IDs and write to audit_log synchronously.

4. Cross-references

Alerts wiring: OBSERVABILITY §6
Concurrency tests that prove containment: TESTING_STRATEGY §4.3 + §5
Constraint design: DATA_MODEL §4 + §5
Saga compensation: APPLICATION_LOGIC §3
Reservation failure modes: reservation-service FAILURE_MODES

1. Severity ladder​

2. Failure catalog​

2.1 False overbooking event observed​

2.2 Advisory-lock timeout / deadlock​

2.3 Outbox lag​

2.4 Inbox / Pub/Sub backlog​

2.5 DLQ arrivals​

2.6 Hold expiry sweeper stalled​

2.7 Calendar horizon shortfall​

2.8 Partition rotation failure​

2.9 Calendar exhaustion (writes against missing partition)​

2.10 OOO during active reservation — reaccommodation backlog​

2.11 Group-hold partial-fail thrashing​

2.12 Walk-in vs saga race for the last room​

2.13 Sweeper-vs-confirm race​

2.14 Cloud SQL primary failover​

2.15 Memorystore unavailable​

2.16 Cross-tenant exposure attempt​

2.17 Offline-arbitration loss spike​

2.18 Bug in domain rejection logic (false INSUFFICIENT_AVAILABILITY)​

2.19 Schema migration regression​

2.20 Time skew between services​

3. Manual emergency levers​

4. Cross-references​