Skip to main content

SERVICE_RISK_REGISTER — inventory-service

Sibling: SERVICE_READINESS · SECURITY_MODEL · FAILURE_MODES · TESTING_STRATEGY

inventory-service is the platform's correctness-critical service. Its top risk is overbooking — issuing two committed allocations for the same physical room-night, or for more rooms of a type than the property has. Every mitigation, test, and runbook is shaped around making that risk effectively zero.

Risks are scored on Likelihood (L) and Impact (I) from 1 (low) to 5 (catastrophic). Score = L × I. Anything ≥ 12 is escalated to the platform risk register.


1. Top risks

#RiskLIScoreOwnerMitigationsResidual
R1False overbooking — committed allocations exceed total + overbooking_cap, or two committed allocations target the same physical room-night155tech-leadDomain invariants I2, I3, I8; DB CHECK available ≥ 0; EXCLUDE USING gist (room_id WITH =, stay_range WITH &&) WHERE status IN ('held','committed'); advisory lock per (tenant, property, room_type, date); Jepsen-style consistency check at every release-candidate; alert RESV-INV-001 (P0) on any non-zero inventory_overbooking_actual_total; counter must always be 0very low
R2Hold leak — held allocations never release, occupying inventory that is actually free248tech-leadHold-expiry sweeper every 30 s; holdUntil invariant I4; alert RESV-INV-005 if sweeper lag > 60 s; manual DELETE /allocations/:id available; reservation-service emits hold_expired.v1 independently as a defense-in-depthlow
R3OOO/OOS during active reservation with no operator follow-through — guest arrives, room is broken, no alternative arranged3412product + opsblock.created.v1 triggers reaccommodation_required.v1; reservation-service runs room-change sub-saga; backoffice UI surfaces unresolved reaccommodations; SLO RESV-INV-008 watches reaccommodation publication latencymedium
R4Cross-tenant leak — query or write touches another tenant's inventory155securityFive-layer defense: JWT/X-Tenant-Id, Postgres RLS, domain TenantId invariant, advisory-lock key includes tenant_id, outbox tagging; tenant-isolation.spec.ts mandatory testvery low
R5Calendar exhaustion — no future inventory rows because horizon-extender failed248sreDaily extender; alert RESV-INV-008 < 30 days; default partition catches missing partition writes; reconciliation job fixes driftlow
R6Partition rotation failure — old partitions never detached / new partitions never created; query latency degrades236sreDaily rotator; alert RESV-INV-009; default partition keeps writes flowing; manual replay procedure documentedlow
R7Advisory-lock contention storm — long transactions hold locks; allocations time out across the platform248tech-leadPer-statement timeout 800 ms; allocation tx scoped to <50 ms; retry with backoff; LOCK_TIMEOUT 503 + saga compensation; alert RESV-INV-006low
R8Schema drift in produced events — consumer breaks because a field semantics changed248events-ownerAdditive-only minor versions; vN+1 topic for breaking changes; Pact contracts for every consumer; CI fails on driftlow
R9DLQ poison message — a broken consumed event blocks the inbox subscription339events-ownerDLQs configured per subscription; alert RESV-INV-007; runbook to quarantine + replaymedium
R10Operator error — wrong room manually held, wrong block created, OOO marked on the wrong unit428product + opsAudit log on every state change; backoffice UI confirmation flows; manual-release spike alert RESV-INV-012; gm review report nightlymedium
R11Offline desktop drift — operator works offline for too long; arbitration loss rate climbs326productSnapshot-staleness banner > 6 h; arbitration loss alert RESV-INV-013; UI nudges reconnect; offline allocations capped (single-room, ≤14 nights, no group)low
R12Cloud SQL primary unavailability248sreCloud SQL HA with automatic failover; Cloud Run reconnects; outbox + idempotency keys preserve correctness; saga compensates on in-flight rollbackslow
R13Pub/Sub regional outage affecting outbox publication236sreOutbox table buffers events; relay resumes when Pub/Sub recovers; secondary region failover documentedlow
R14Migration regression — backwards-incompatible migration breaks allocation hot path2510tech-leadExpand → backfill → contract; CI rejects destructive single-PR migrations; canary 5%; auto-rollback on RESV-INV-001low
R15Ledger growthroom_type_inventory_daily and room_allocations grow unbounded326sreMonthly partitioning; old partitions exported to GCS + detached after 24 months retention; analytical store via BigQuerylow
R16Misconfigured overbooking policy — owner enables overbooking on a property by accident236productOCC + audit log on every policy change; alert RESV-INV-010 fires on first overbooking event; default policy enabled=falselow
R17Misuse of break-glass admin endpoints248securityTwo-person IAM rule; sync audit log; break-glass usage reviewed weeklylow
R18Missing index after migration — allocation latency degrades silently236tech-leadEXPLAIN ANALYZE in CI for hot queries; alert RESV-INV-002; pgaudit query-time histogramslow

2. Risk treatment cadence

CadenceActivity
Per releaseRe-evaluate scores affected by changes; new risks added by tech-lead
MonthlySRE review of R5, R6, R7, R12, R13, R15, R18
QuarterlyFull register review with product + security + architecture
After incidentPost-mortem updates the relevant row; new mitigations added

3. Defense-in-depth summary against R1 (false overbooking)

The single most-discussed risk warrants its own column-by-column visualization:

LayerMechanismWhat it catches
DomainRoomTypeInventory.reserve() enforces available ≥ 0logic regressions in code
ApplicationAdvisory lock on (tenant, property, room_type, date) keyed in canonical orderconcurrent allocators racing
DatabaseCHECK (held + committed + ooo_blocked ≤ total + overbooking_cap) on room_type_inventory_dailybypass via raw SQL
DatabaseEXCLUDE USING gist (room_id WITH =, stay_range WITH &&) WHERE status IN ('held','committed') on room_allocationsper-physical-room overlap
ApplicationInbox idempotency on event_iddouble processing of same event
Observabilityinventory_overbooking_actual_total always-zero counterpost-hoc detection in <60 s
TestJepsen-style consistency checkproves the property under fault injection

Five mutually-reinforcing layers; any single layer can be removed and the property still holds. Two layers must fail simultaneously for overbooking to occur, which the alert ladder will then detect within one minute.


4. Cross-references