Skip to main content

OBSERVABILITY — inventory-service

Sibling: APPLICATION_LOGIC · DEPLOYMENT_TOPOLOGY · FAILURE_MODES · TESTING_STRATEGY

Strategic anchors: 02 §13 Observability · 04 §10 Event observability

OpenTelemetry is initialized before Nest's NestFactory in main.ts (verified by smoke test). Traces, metrics, and structured logs all flow through the same Cloud OTel Collector and into SigNoz / Cloud Monitoring. The single most important commitment of this service is "zero false-overbooking events, ever" — observability is what proves it (and proves it again every release).


1. Required span attributes

Every span emitted by inventory-service carries:

AttributeTypeNotes
tenant_idstringalways; unknown for /internal/health
property_idstringwhen applicable
room_type_idstringwhen applicable
allocation_idstringwhen applicable
block_idstringwhen applicable
reservation_idstringwhen applicable
use_casestringe.g. place_hold_allocation
actor.kindenumstaff / system / consumer / gm / owner / scheduler / pubsub
actor.idstringhash for staff; pubsub-pusher@… for system
request_idstringULID; mirrors Idempotency-Key if present
idempotency.replayedbooltrue if this request hit the dedupe
db.lock.keyintthe advisory-lock hash, when acquired
db.lock.acquired_msfloathow long lock acquisition took

Sensitive attributes (room number internals, lock vendor ids) are scrubbed at the OTel processor.


2. Structured log fields

Every log line is JSON with:

{
"timestamp": "2026-05-04T08:21:11.412Z",
"level": "info",
"service": "inventory-service",
"version": "<git-sha>",
"tenantId": "tnt_…",
"propertyId": "ppt_…",
"useCase": "place_hold_allocation",
"allocationId": "inv_…",
"reservationId": "rsv_…",
"traceId": "00-…",
"requestId": "01J…",
"msg": "allocation held",
"durationMs": 47
}

Required fields on every record: service, version, tenantId, traceId, requestId. A pre-commit lint rule enforces this on every logger.* call site.


3. SLIs and SLOs

SLISLO targetWindowSource
Allocation latency p99< 200 ms30 d rollinginventory_allocation_duration_ms histogram
Availability search latency p99 (cold)< 300 ms30 dinventory_availability_search_duration_ms{cache="miss"}
Availability search latency p99 (cached via search-agg)< 50 ms30 dupstream metric, cross-cited
False overbooking events0any 30 dinventory_overbooking_actual_total (must always be 0; if it goes positive, P0 page)
Outbox lag p99< 5 s30 dinventory_outbox_lag_seconds
Hold-expiry sweeper lag p99< 30 s30 dinventory_hold_expiry_lag_seconds
Calendar horizon shortfallevery property ≥ 30 dayscontinuousinventory_calendar_horizon_days_min per property
API availability (200/2xx ratio)99.95%30 dCloud Run + LB metrics
OOO reaccommodation publication latency p95< 2 s after block created30 dinventory_reaccommodation_pub_latency_ms
Sync snapshot pull p95< 800 ms30 dinventory_snapshot_pull_latency_ms
Offline-arbitration loss rate< 1% per tenant per 24 hrollinginventory_offline_arbitration_lost_total / …pushed_total

The inventory_overbooking_actual_total counter increments only when DB invariant held + committed > total + overbooking_cap is detected at write time — and that DB CHECK constraint should make it impossible. Any non-zero value indicates a P0 incident.


4. RED & USE metrics

4.1 RED (per use case)

For each use case (search_availability, place_hold_allocation, commit_allocation, release_allocation, create_block, release_block, group_hold, walk_in_allocate, update_overbooking_policy):

inventory_<use_case>_requests_total{tenant, outcome}
inventory_<use_case>_duration_ms{tenant, outcome} (histogram)
inventory_<use_case>_errors_total{tenant, code}

outcome ∈ {success, business_rejection, system_error}. business_rejection covers INSUFFICIENT_AVAILABILITY, STOP_SELL_ACTIVE, OVERBOOKING_CAP_EXCEEDED, LOCK_TIMEOUT, STALE_VERSION, ROOM_NOT_IN_TYPE.

4.2 USE (per resource)

ResourceUtilizationSaturationErrors
Cloud Run instance CPUcpu_utilizationpending_requestscrash_count
Drizzle / pg poolpool_in_usepool_waitingpool_acquire_timeout_total
Pub/Sub publisherpublish_ratepublish_queue_depthpublish_error_total
Pub/Sub inbox subscriptionmessages_processed_ratesubscription_backlognack_total, dlq_arrivals_total
Postgres advisory lock poolheld_locks_countlock_wait_p99_mslock_timeout_total

4.3 Counters worth highlighting

inventory_allocation_lock_acquisitions_total{outcome}
inventory_allocation_lock_wait_ms (histogram)
inventory_allocation_committed_total{tenant, property, room_type}
inventory_allocation_released_total{tenant, reason_code}
inventory_block_created_total{tenant, reason}
inventory_overbooking_alert_total{tenant, property}
inventory_calendar_extension_rows_inserted_total
inventory_partition_rotation_total{outcome}
inventory_offline_allocations_pushed_total{outcome}

5. Dashboards

Three core dashboards in Cloud Monitoring + SigNoz:

5.1 inventory-service: service health

  • p50/p95/p99 latency per endpoint and per use case.
  • Error rate per code.
  • Cloud Run instance count, CPU, memory.
  • Pool saturation panels (DB, Pub/Sub).
  • Outbox lag panel with the 5 s SLO line.
  • Alert ladder status panel (RESV-INV-001..014).

5.2 inventory-service: allocation flow

  • Allocations placed / committed / released per minute, stacked by tenant.
  • Lock acquisition wait p99 per property.
  • Reaccommodation publish latency.
  • Hold-expiry sweeper backlog and runs.
  • Group-hold size distribution.
  • Walk-in vs saga-driven mix.

5.3 inventory-service: integrity

  • inventory_overbooking_actual_total — must be flat at 0 forever. Alert RESV-INV-001 fires on any increase.
  • Calendar horizon panel: per-property days-of-runway with red threshold at 30 days.
  • Reconciliation drift panel: nightly job's diff between availability_calendars.summary and SUM of room_type_inventory_daily.
  • DLQ arrivals panel; should be near zero.

6. Alert ladder

Each alert maps to a runbook under runbooks/inventory/.

AlertConditionSeverityRunbook
RESV-INV-001inventory_overbooking_actual_total increasesP0 (page on-call + tech lead)runbooks/inventory/false-overbooking.md
RESV-INV-002Allocation latency p99 > 200 ms for 10 minP1runbooks/inventory/allocation-slow.md
RESV-INV-003Availability search p99 > 300 ms (cold) for 15 minP2runbooks/inventory/search-slow.md
RESV-INV-004Outbox lag > 30 s p99 for 5 minP1runbooks/inventory/outbox-lag.md
RESV-INV-005Hold-expiry sweeper lag > 60 sP1runbooks/inventory/hold-expiry-stalled.md
RESV-INV-006Lock timeout rate > 2% over 30 minP2runbooks/inventory/lock-contention.md
RESV-INV-007DLQ arrivals > 0 for any inbound subjectP1runbooks/inventory/dlq-triage.md
RESV-INV-008Calendar horizon < 30 days for any active propertyP2runbooks/inventory/calendar-horizon-short.md
RESV-INV-009Partition rotation job failedP1runbooks/inventory/partition-rotation.md
RESV-INV-010inventory_overbooking_alert_total > 0 for any tenant in 1 minP2 (notify owner)runbooks/inventory/overbooking-policy-fired.md
RESV-INV-011Reconciliation drift > 0 in any property-dayP2runbooks/inventory/reconcile-drift.md
RESV-INV-012Manual DELETE /allocations rate > 5/h per actorP3 (notify gm)runbooks/inventory/manual-release-spike.md
RESV-INV-013Offline-arbitration loss > 1% per tenant in 24 hP2runbooks/inventory/offline-arbitration-loss.md
RESV-INV-014Pub/Sub backlog > 10k messages on any subscriptionP1runbooks/inventory/pubsub-backlog.md

Every page includes the runbook URL in the alert body.


7. Synthetic checks

ProbeFrequencyAsserts
POST /availability/search (canary tenant, canary property, fixed dates)60 s200 OK; latency < 300 ms
POST /allocations walk-in then DELETE /allocations/:id (canary)5 minend-to-end allocation + release lifecycle
GET /internal/health, /internal/ready30 s200 OK
Pub/Sub end-to-end: publish a synthetic reservation.held.v1 for canary tenant; assert inventory.allocation.confirmed.v1 arrives within 3 s5 minsaga inbound→outbound integrity

Synthetic failures escalate via the same alert ladder.


8. Distributed tracing examples

A booking saga trace (parent: reservation-service):

reservation.holdReservation() 120ms
├─ reservation-service.HoldUseCase 90ms
└─ pubsub.publish reservation.held.v1 25ms
└─ inventory-service.PlaceHoldAllocationUseCase 145ms
├─ db.advisory_lock acquire (3 nights) 18ms
├─ db.RoomTypeInventory.loadForUpdate 9ms
├─ domain.RoomTypeInventory.reserve <1ms
├─ domain.RoomPicker.choose 2ms
├─ db.RoomAllocation.insert 7ms
├─ db.outbox.insert (allocation.confirmed) 4ms
└─ db.commit (lock release) 5ms

Every span carries tenant_id and causation_event_id. Failures attach error.code and error.message.


9. Cross-references