OBSERVABILITY — inventory-service

Sibling: APPLICATION_LOGIC · DEPLOYMENT_TOPOLOGY · FAILURE_MODES · TESTING_STRATEGY

Strategic anchors: 02 §13 Observability · 04 §10 Event observability

OpenTelemetry is initialized before Nest's NestFactory in main.ts (verified by smoke test). Traces, metrics, and structured logs all flow through the same Cloud OTel Collector and into SigNoz / Cloud Monitoring. The single most important commitment of this service is "zero false-overbooking events, ever" — observability is what proves it (and proves it again every release).

1. Required span attributes

Every span emitted by inventory-service carries:

Attribute	Type	Notes
`tenant_id`	string	always; `unknown` for `/internal/health`
`property_id`	string	when applicable
`room_type_id`	string	when applicable
`allocation_id`	string	when applicable
`block_id`	string	when applicable
`reservation_id`	string	when applicable
`use_case`	string	e.g. `place_hold_allocation`
`actor.kind`	enum	`staff` / `system` / `consumer` / `gm` / `owner` / `scheduler` / `pubsub`
`actor.id`	string	hash for staff; `pubsub-pusher@…` for system
`request_id`	string	ULID; mirrors `Idempotency-Key` if present
`idempotency.replayed`	bool	true if this request hit the dedupe
`db.lock.key`	int	the advisory-lock hash, when acquired
`db.lock.acquired_ms`	float	how long lock acquisition took

Sensitive attributes (room number internals, lock vendor ids) are scrubbed at the OTel processor.

2. Structured log fields

Every log line is JSON with:

{
  "timestamp": "2026-05-04T08:21:11.412Z",
  "level": "info",
  "service": "inventory-service",
  "version": "<git-sha>",
  "tenantId": "tnt_…",
  "propertyId": "ppt_…",
  "useCase": "place_hold_allocation",
  "allocationId": "inv_…",
  "reservationId": "rsv_…",
  "traceId": "00-…",
  "requestId": "01J…",
  "msg": "allocation held",
  "durationMs": 47
}

Required fields on every record: service, version, tenantId, traceId, requestId. A pre-commit lint rule enforces this on every logger.* call site.

3. SLIs and SLOs

SLI	SLO target	Window	Source
Allocation latency p99	< 200 ms	30 d rolling	`inventory_allocation_duration_ms` histogram
Availability search latency p99 (cold)	< 300 ms	30 d	`inventory_availability_search_duration_ms{cache="miss"}`
Availability search latency p99 (cached via search-agg)	< 50 ms	30 d	upstream metric, cross-cited
False overbooking events	0	any 30 d	`inventory_overbooking_actual_total` (must always be 0; if it goes positive, P0 page)
Outbox lag p99	< 5 s	30 d	`inventory_outbox_lag_seconds`
Hold-expiry sweeper lag p99	< 30 s	30 d	`inventory_hold_expiry_lag_seconds`
Calendar horizon shortfall	every property ≥ 30 days	continuous	`inventory_calendar_horizon_days_min` per property
API availability (200/2xx ratio)	99.95%	30 d	Cloud Run + LB metrics
OOO reaccommodation publication latency p95	< 2 s after block created	30 d	`inventory_reaccommodation_pub_latency_ms`
Sync snapshot pull p95	< 800 ms	30 d	`inventory_snapshot_pull_latency_ms`
Offline-arbitration loss rate	< 1% per tenant per 24 h	rolling	`inventory_offline_arbitration_lost_total` / `…pushed_total`

The inventory_overbooking_actual_total counter increments only when DB invariant held + committed > total + overbooking_cap is detected at write time — and that DB CHECK constraint should make it impossible. Any non-zero value indicates a P0 incident.

4. RED & USE metrics

4.1 RED (per use case)

For each use case (search_availability, place_hold_allocation, commit_allocation, release_allocation, create_block, release_block, group_hold, walk_in_allocate, update_overbooking_policy):

inventory_<use_case>_requests_total{tenant, outcome}
inventory_<use_case>_duration_ms{tenant, outcome} (histogram)
inventory_<use_case>_errors_total{tenant, code}

outcome ∈ {success, business_rejection, system_error}. business_rejection covers INSUFFICIENT_AVAILABILITY, STOP_SELL_ACTIVE, OVERBOOKING_CAP_EXCEEDED, LOCK_TIMEOUT, STALE_VERSION, ROOM_NOT_IN_TYPE.

4.2 USE (per resource)

Resource	Utilization	Saturation	Errors
Cloud Run instance CPU	`cpu_utilization`	`pending_requests`	`crash_count`
Drizzle / pg pool	`pool_in_use`	`pool_waiting`	`pool_acquire_timeout_total`
Pub/Sub publisher	`publish_rate`	`publish_queue_depth`	`publish_error_total`
Pub/Sub inbox subscription	`messages_processed_rate`	`subscription_backlog`	`nack_total`, `dlq_arrivals_total`
Postgres advisory lock pool	`held_locks_count`	`lock_wait_p99_ms`	`lock_timeout_total`

4.3 Counters worth highlighting

inventory_allocation_lock_acquisitions_total{outcome}
inventory_allocation_lock_wait_ms (histogram)
inventory_allocation_committed_total{tenant, property, room_type}
inventory_allocation_released_total{tenant, reason_code}
inventory_block_created_total{tenant, reason}
inventory_overbooking_alert_total{tenant, property}
inventory_calendar_extension_rows_inserted_total
inventory_partition_rotation_total{outcome}
inventory_offline_allocations_pushed_total{outcome}

5. Dashboards

Three core dashboards in Cloud Monitoring + SigNoz:

5.1 `inventory-service: service health`

p50/p95/p99 latency per endpoint and per use case.
Error rate per code.
Cloud Run instance count, CPU, memory.
Pool saturation panels (DB, Pub/Sub).
Outbox lag panel with the 5 s SLO line.
Alert ladder status panel (RESV-INV-001..014).

5.2 `inventory-service: allocation flow`

Allocations placed / committed / released per minute, stacked by tenant.
Lock acquisition wait p99 per property.
Reaccommodation publish latency.
Hold-expiry sweeper backlog and runs.
Group-hold size distribution.
Walk-in vs saga-driven mix.

5.3 `inventory-service: integrity`

inventory_overbooking_actual_total — must be flat at 0 forever. Alert RESV-INV-001 fires on any increase.
Calendar horizon panel: per-property days-of-runway with red threshold at 30 days.
Reconciliation drift panel: nightly job's diff between availability_calendars.summary and SUM of room_type_inventory_daily.
DLQ arrivals panel; should be near zero.

6. Alert ladder

Each alert maps to a runbook under runbooks/inventory/.

Alert	Condition	Severity	Runbook
RESV-INV-001	`inventory_overbooking_actual_total` increases	P0 (page on-call + tech lead)	`runbooks/inventory/false-overbooking.md`
RESV-INV-002	Allocation latency p99 > 200 ms for 10 min	P1	`runbooks/inventory/allocation-slow.md`
RESV-INV-003	Availability search p99 > 300 ms (cold) for 15 min	P2	`runbooks/inventory/search-slow.md`
RESV-INV-004	Outbox lag > 30 s p99 for 5 min	P1	`runbooks/inventory/outbox-lag.md`
RESV-INV-005	Hold-expiry sweeper lag > 60 s	P1	`runbooks/inventory/hold-expiry-stalled.md`
RESV-INV-006	Lock timeout rate > 2% over 30 min	P2	`runbooks/inventory/lock-contention.md`
RESV-INV-007	DLQ arrivals > 0 for any inbound subject	P1	`runbooks/inventory/dlq-triage.md`
RESV-INV-008	Calendar horizon < 30 days for any active property	P2	`runbooks/inventory/calendar-horizon-short.md`
RESV-INV-009	Partition rotation job failed	P1	`runbooks/inventory/partition-rotation.md`
RESV-INV-010	`inventory_overbooking_alert_total` > 0 for any tenant in 1 min	P2 (notify owner)	`runbooks/inventory/overbooking-policy-fired.md`
RESV-INV-011	Reconciliation drift > 0 in any property-day	P2	`runbooks/inventory/reconcile-drift.md`
RESV-INV-012	Manual `DELETE /allocations` rate > 5/h per actor	P3 (notify gm)	`runbooks/inventory/manual-release-spike.md`
RESV-INV-013	Offline-arbitration loss > 1% per tenant in 24 h	P2	`runbooks/inventory/offline-arbitration-loss.md`
RESV-INV-014	Pub/Sub backlog > 10k messages on any subscription	P1	`runbooks/inventory/pubsub-backlog.md`

Every page includes the runbook URL in the alert body.

7. Synthetic checks

Probe	Frequency	Asserts
`POST /availability/search` (canary tenant, canary property, fixed dates)	60 s	200 OK; latency < 300 ms
`POST /allocations` walk-in then `DELETE /allocations/:id` (canary)	5 min	end-to-end allocation + release lifecycle
`GET /internal/health`, `/internal/ready`	30 s	200 OK
Pub/Sub end-to-end: publish a synthetic `reservation.held.v1` for canary tenant; assert `inventory.allocation.confirmed.v1` arrives within 3 s	5 min	saga inbound→outbound integrity

Synthetic failures escalate via the same alert ladder.

8. Distributed tracing examples

A booking saga trace (parent: reservation-service):

reservation.holdReservation()                  120ms
  ├─ reservation-service.HoldUseCase           90ms
  └─ pubsub.publish reservation.held.v1        25ms
       └─ inventory-service.PlaceHoldAllocationUseCase   145ms
             ├─ db.advisory_lock acquire (3 nights)      18ms
             ├─ db.RoomTypeInventory.loadForUpdate       9ms
             ├─ domain.RoomTypeInventory.reserve         <1ms
             ├─ domain.RoomPicker.choose                 2ms
             ├─ db.RoomAllocation.insert                 7ms
             ├─ db.outbox.insert (allocation.confirmed)  4ms
             └─ db.commit (lock release)                 5ms

Every span carries tenant_id and causation_event_id. Failures attach error.code and error.message.

9. Cross-references

Use cases that emit each metric: APPLICATION_LOGIC §2
Failure modes that drive alert ladder: FAILURE_MODES
Test coverage that backs SLO claims: TESTING_STRATEGY §4 + §7
Reservation observability: reservation-service OBSERVABILITY

1. Required span attributes​

2. Structured log fields​

3. SLIs and SLOs​

4. RED & USE metrics​

4.1 RED (per use case)​

4.2 USE (per resource)​

4.3 Counters worth highlighting​

5. Dashboards​

5.1 inventory-service: service health​

5.2 inventory-service: allocation flow​

5.3 inventory-service: integrity​

6. Alert ladder​

7. Synthetic checks​

8. Distributed tracing examples​

9. Cross-references​