SERVICE_RISK_REGISTER — inventory-service
Sibling: SERVICE_READINESS · SECURITY_MODEL · FAILURE_MODES · TESTING_STRATEGY
inventory-service is the platform's correctness-critical service. Its top risk is overbooking — issuing two committed allocations for the same physical room-night, or for more rooms of a type than the property has. Every mitigation, test, and runbook is shaped around making that risk effectively zero.
Risks are scored on Likelihood (L) and Impact (I) from 1 (low) to 5 (catastrophic). Score = L × I. Anything ≥ 12 is escalated to the platform risk register.
1. Top risks
| # | Risk | L | I | Score | Owner | Mitigations | Residual |
|---|---|---|---|---|---|---|---|
| R1 | False overbooking — committed allocations exceed total + overbooking_cap, or two committed allocations target the same physical room-night | 1 | 5 | 5 | tech-lead | Domain invariants I2, I3, I8; DB CHECK available ≥ 0; EXCLUDE USING gist (room_id WITH =, stay_range WITH &&) WHERE status IN ('held','committed'); advisory lock per (tenant, property, room_type, date); Jepsen-style consistency check at every release-candidate; alert RESV-INV-001 (P0) on any non-zero inventory_overbooking_actual_total; counter must always be 0 | very low |
| R2 | Hold leak — held allocations never release, occupying inventory that is actually free | 2 | 4 | 8 | tech-lead | Hold-expiry sweeper every 30 s; holdUntil invariant I4; alert RESV-INV-005 if sweeper lag > 60 s; manual DELETE /allocations/:id available; reservation-service emits hold_expired.v1 independently as a defense-in-depth | low |
| R3 | OOO/OOS during active reservation with no operator follow-through — guest arrives, room is broken, no alternative arranged | 3 | 4 | 12 | product + ops | block.created.v1 triggers reaccommodation_required.v1; reservation-service runs room-change sub-saga; backoffice UI surfaces unresolved reaccommodations; SLO RESV-INV-008 watches reaccommodation publication latency | medium |
| R4 | Cross-tenant leak — query or write touches another tenant's inventory | 1 | 5 | 5 | security | Five-layer defense: JWT/X-Tenant-Id, Postgres RLS, domain TenantId invariant, advisory-lock key includes tenant_id, outbox tagging; tenant-isolation.spec.ts mandatory test | very low |
| R5 | Calendar exhaustion — no future inventory rows because horizon-extender failed | 2 | 4 | 8 | sre | Daily extender; alert RESV-INV-008 < 30 days; default partition catches missing partition writes; reconciliation job fixes drift | low |
| R6 | Partition rotation failure — old partitions never detached / new partitions never created; query latency degrades | 2 | 3 | 6 | sre | Daily rotator; alert RESV-INV-009; default partition keeps writes flowing; manual replay procedure documented | low |
| R7 | Advisory-lock contention storm — long transactions hold locks; allocations time out across the platform | 2 | 4 | 8 | tech-lead | Per-statement timeout 800 ms; allocation tx scoped to <50 ms; retry with backoff; LOCK_TIMEOUT 503 + saga compensation; alert RESV-INV-006 | low |
| R8 | Schema drift in produced events — consumer breaks because a field semantics changed | 2 | 4 | 8 | events-owner | Additive-only minor versions; vN+1 topic for breaking changes; Pact contracts for every consumer; CI fails on drift | low |
| R9 | DLQ poison message — a broken consumed event blocks the inbox subscription | 3 | 3 | 9 | events-owner | DLQs configured per subscription; alert RESV-INV-007; runbook to quarantine + replay | medium |
| R10 | Operator error — wrong room manually held, wrong block created, OOO marked on the wrong unit | 4 | 2 | 8 | product + ops | Audit log on every state change; backoffice UI confirmation flows; manual-release spike alert RESV-INV-012; gm review report nightly | medium |
| R11 | Offline desktop drift — operator works offline for too long; arbitration loss rate climbs | 3 | 2 | 6 | product | Snapshot-staleness banner > 6 h; arbitration loss alert RESV-INV-013; UI nudges reconnect; offline allocations capped (single-room, ≤14 nights, no group) | low |
| R12 | Cloud SQL primary unavailability | 2 | 4 | 8 | sre | Cloud SQL HA with automatic failover; Cloud Run reconnects; outbox + idempotency keys preserve correctness; saga compensates on in-flight rollbacks | low |
| R13 | Pub/Sub regional outage affecting outbox publication | 2 | 3 | 6 | sre | Outbox table buffers events; relay resumes when Pub/Sub recovers; secondary region failover documented | low |
| R14 | Migration regression — backwards-incompatible migration breaks allocation hot path | 2 | 5 | 10 | tech-lead | Expand → backfill → contract; CI rejects destructive single-PR migrations; canary 5%; auto-rollback on RESV-INV-001 | low |
| R15 | Ledger growth — room_type_inventory_daily and room_allocations grow unbounded | 3 | 2 | 6 | sre | Monthly partitioning; old partitions exported to GCS + detached after 24 months retention; analytical store via BigQuery | low |
| R16 | Misconfigured overbooking policy — owner enables overbooking on a property by accident | 2 | 3 | 6 | product | OCC + audit log on every policy change; alert RESV-INV-010 fires on first overbooking event; default policy enabled=false | low |
| R17 | Misuse of break-glass admin endpoints | 2 | 4 | 8 | security | Two-person IAM rule; sync audit log; break-glass usage reviewed weekly | low |
| R18 | Missing index after migration — allocation latency degrades silently | 2 | 3 | 6 | tech-lead | EXPLAIN ANALYZE in CI for hot queries; alert RESV-INV-002; pgaudit query-time histograms | low |
2. Risk treatment cadence
| Cadence | Activity |
|---|---|
| Per release | Re-evaluate scores affected by changes; new risks added by tech-lead |
| Monthly | SRE review of R5, R6, R7, R12, R13, R15, R18 |
| Quarterly | Full register review with product + security + architecture |
| After incident | Post-mortem updates the relevant row; new mitigations added |
3. Defense-in-depth summary against R1 (false overbooking)
The single most-discussed risk warrants its own column-by-column visualization:
| Layer | Mechanism | What it catches |
|---|---|---|
| Domain | RoomTypeInventory.reserve() enforces available ≥ 0 | logic regressions in code |
| Application | Advisory lock on (tenant, property, room_type, date) keyed in canonical order | concurrent allocators racing |
| Database | CHECK (held + committed + ooo_blocked ≤ total + overbooking_cap) on room_type_inventory_daily | bypass via raw SQL |
| Database | EXCLUDE USING gist (room_id WITH =, stay_range WITH &&) WHERE status IN ('held','committed') on room_allocations | per-physical-room overlap |
| Application | Inbox idempotency on event_id | double processing of same event |
| Observability | inventory_overbooking_actual_total always-zero counter | post-hoc detection in <60 s |
| Test | Jepsen-style consistency check | proves the property under fault injection |
Five mutually-reinforcing layers; any single layer can be removed and the property still holds. Two layers must fail simultaneously for overbooking to occur, which the alert ladder will then detect within one minute.
4. Cross-references
- Concrete failure scenarios: FAILURE_MODES
- Test coverage that backs claimed mitigations: TESTING_STRATEGY
- Constraints + advisory lock function: DATA_MODEL §4–§7
- Threat model & tenant isolation: SECURITY_MODEL
- Reservation risk register: reservation-service SERVICE_RISK_REGISTER