OBSERVABILITY — inventory-service
Sibling: APPLICATION_LOGIC · DEPLOYMENT_TOPOLOGY · FAILURE_MODES · TESTING_STRATEGY
Strategic anchors: 02 §13 Observability · 04 §10 Event observability
OpenTelemetry is initialized before Nest's NestFactory in main.ts (verified by smoke test). Traces, metrics, and structured logs all flow through the same Cloud OTel Collector and into SigNoz / Cloud Monitoring. The single most important commitment of this service is "zero false-overbooking events, ever" — observability is what proves it (and proves it again every release).
1. Required span attributes
Every span emitted by inventory-service carries:
| Attribute | Type | Notes |
|---|---|---|
tenant_id | string | always; unknown for /internal/health |
property_id | string | when applicable |
room_type_id | string | when applicable |
allocation_id | string | when applicable |
block_id | string | when applicable |
reservation_id | string | when applicable |
use_case | string | e.g. place_hold_allocation |
actor.kind | enum | staff / system / consumer / gm / owner / scheduler / pubsub |
actor.id | string | hash for staff; pubsub-pusher@… for system |
request_id | string | ULID; mirrors Idempotency-Key if present |
idempotency.replayed | bool | true if this request hit the dedupe |
db.lock.key | int | the advisory-lock hash, when acquired |
db.lock.acquired_ms | float | how long lock acquisition took |
Sensitive attributes (room number internals, lock vendor ids) are scrubbed at the OTel processor.
2. Structured log fields
Every log line is JSON with:
{
"timestamp": "2026-05-04T08:21:11.412Z",
"level": "info",
"service": "inventory-service",
"version": "<git-sha>",
"tenantId": "tnt_…",
"propertyId": "ppt_…",
"useCase": "place_hold_allocation",
"allocationId": "inv_…",
"reservationId": "rsv_…",
"traceId": "00-…",
"requestId": "01J…",
"msg": "allocation held",
"durationMs": 47
}
Required fields on every record: service, version, tenantId, traceId, requestId. A pre-commit lint rule enforces this on every logger.* call site.
3. SLIs and SLOs
| SLI | SLO target | Window | Source |
|---|---|---|---|
| Allocation latency p99 | < 200 ms | 30 d rolling | inventory_allocation_duration_ms histogram |
| Availability search latency p99 (cold) | < 300 ms | 30 d | inventory_availability_search_duration_ms{cache="miss"} |
| Availability search latency p99 (cached via search-agg) | < 50 ms | 30 d | upstream metric, cross-cited |
| False overbooking events | 0 | any 30 d | inventory_overbooking_actual_total (must always be 0; if it goes positive, P0 page) |
| Outbox lag p99 | < 5 s | 30 d | inventory_outbox_lag_seconds |
| Hold-expiry sweeper lag p99 | < 30 s | 30 d | inventory_hold_expiry_lag_seconds |
| Calendar horizon shortfall | every property ≥ 30 days | continuous | inventory_calendar_horizon_days_min per property |
| API availability (200/2xx ratio) | 99.95% | 30 d | Cloud Run + LB metrics |
| OOO reaccommodation publication latency p95 | < 2 s after block created | 30 d | inventory_reaccommodation_pub_latency_ms |
| Sync snapshot pull p95 | < 800 ms | 30 d | inventory_snapshot_pull_latency_ms |
| Offline-arbitration loss rate | < 1% per tenant per 24 h | rolling | inventory_offline_arbitration_lost_total / …pushed_total |
The inventory_overbooking_actual_total counter increments only when DB invariant held + committed > total + overbooking_cap is detected at write time — and that DB CHECK constraint should make it impossible. Any non-zero value indicates a P0 incident.
4. RED & USE metrics
4.1 RED (per use case)
For each use case (search_availability, place_hold_allocation, commit_allocation, release_allocation, create_block, release_block, group_hold, walk_in_allocate, update_overbooking_policy):
inventory_<use_case>_requests_total{tenant, outcome}
inventory_<use_case>_duration_ms{tenant, outcome} (histogram)
inventory_<use_case>_errors_total{tenant, code}
outcome ∈ {success, business_rejection, system_error}. business_rejection covers INSUFFICIENT_AVAILABILITY, STOP_SELL_ACTIVE, OVERBOOKING_CAP_EXCEEDED, LOCK_TIMEOUT, STALE_VERSION, ROOM_NOT_IN_TYPE.
4.2 USE (per resource)
| Resource | Utilization | Saturation | Errors |
|---|---|---|---|
| Cloud Run instance CPU | cpu_utilization | pending_requests | crash_count |
| Drizzle / pg pool | pool_in_use | pool_waiting | pool_acquire_timeout_total |
| Pub/Sub publisher | publish_rate | publish_queue_depth | publish_error_total |
| Pub/Sub inbox subscription | messages_processed_rate | subscription_backlog | nack_total, dlq_arrivals_total |
| Postgres advisory lock pool | held_locks_count | lock_wait_p99_ms | lock_timeout_total |
4.3 Counters worth highlighting
inventory_allocation_lock_acquisitions_total{outcome}
inventory_allocation_lock_wait_ms (histogram)
inventory_allocation_committed_total{tenant, property, room_type}
inventory_allocation_released_total{tenant, reason_code}
inventory_block_created_total{tenant, reason}
inventory_overbooking_alert_total{tenant, property}
inventory_calendar_extension_rows_inserted_total
inventory_partition_rotation_total{outcome}
inventory_offline_allocations_pushed_total{outcome}
5. Dashboards
Three core dashboards in Cloud Monitoring + SigNoz:
5.1 inventory-service: service health
- p50/p95/p99 latency per endpoint and per use case.
- Error rate per
code. - Cloud Run instance count, CPU, memory.
- Pool saturation panels (DB, Pub/Sub).
- Outbox lag panel with the 5 s SLO line.
- Alert ladder status panel (RESV-INV-001..014).
5.2 inventory-service: allocation flow
- Allocations placed / committed / released per minute, stacked by tenant.
- Lock acquisition wait p99 per property.
- Reaccommodation publish latency.
- Hold-expiry sweeper backlog and runs.
- Group-hold size distribution.
- Walk-in vs saga-driven mix.
5.3 inventory-service: integrity
inventory_overbooking_actual_total— must be flat at 0 forever. Alert RESV-INV-001 fires on any increase.- Calendar horizon panel: per-property days-of-runway with red threshold at 30 days.
- Reconciliation drift panel: nightly job's diff between
availability_calendars.summaryand SUM ofroom_type_inventory_daily. - DLQ arrivals panel; should be near zero.
6. Alert ladder
Each alert maps to a runbook under runbooks/inventory/.
| Alert | Condition | Severity | Runbook |
|---|---|---|---|
| RESV-INV-001 | inventory_overbooking_actual_total increases | P0 (page on-call + tech lead) | runbooks/inventory/false-overbooking.md |
| RESV-INV-002 | Allocation latency p99 > 200 ms for 10 min | P1 | runbooks/inventory/allocation-slow.md |
| RESV-INV-003 | Availability search p99 > 300 ms (cold) for 15 min | P2 | runbooks/inventory/search-slow.md |
| RESV-INV-004 | Outbox lag > 30 s p99 for 5 min | P1 | runbooks/inventory/outbox-lag.md |
| RESV-INV-005 | Hold-expiry sweeper lag > 60 s | P1 | runbooks/inventory/hold-expiry-stalled.md |
| RESV-INV-006 | Lock timeout rate > 2% over 30 min | P2 | runbooks/inventory/lock-contention.md |
| RESV-INV-007 | DLQ arrivals > 0 for any inbound subject | P1 | runbooks/inventory/dlq-triage.md |
| RESV-INV-008 | Calendar horizon < 30 days for any active property | P2 | runbooks/inventory/calendar-horizon-short.md |
| RESV-INV-009 | Partition rotation job failed | P1 | runbooks/inventory/partition-rotation.md |
| RESV-INV-010 | inventory_overbooking_alert_total > 0 for any tenant in 1 min | P2 (notify owner) | runbooks/inventory/overbooking-policy-fired.md |
| RESV-INV-011 | Reconciliation drift > 0 in any property-day | P2 | runbooks/inventory/reconcile-drift.md |
| RESV-INV-012 | Manual DELETE /allocations rate > 5/h per actor | P3 (notify gm) | runbooks/inventory/manual-release-spike.md |
| RESV-INV-013 | Offline-arbitration loss > 1% per tenant in 24 h | P2 | runbooks/inventory/offline-arbitration-loss.md |
| RESV-INV-014 | Pub/Sub backlog > 10k messages on any subscription | P1 | runbooks/inventory/pubsub-backlog.md |
Every page includes the runbook URL in the alert body.
7. Synthetic checks
| Probe | Frequency | Asserts |
|---|---|---|
POST /availability/search (canary tenant, canary property, fixed dates) | 60 s | 200 OK; latency < 300 ms |
POST /allocations walk-in then DELETE /allocations/:id (canary) | 5 min | end-to-end allocation + release lifecycle |
GET /internal/health, /internal/ready | 30 s | 200 OK |
Pub/Sub end-to-end: publish a synthetic reservation.held.v1 for canary tenant; assert inventory.allocation.confirmed.v1 arrives within 3 s | 5 min | saga inbound→outbound integrity |
Synthetic failures escalate via the same alert ladder.
8. Distributed tracing examples
A booking saga trace (parent: reservation-service):
reservation.holdReservation() 120ms
├─ reservation-service.HoldUseCase 90ms
└─ pubsub.publish reservation.held.v1 25ms
└─ inventory-service.PlaceHoldAllocationUseCase 145ms
├─ db.advisory_lock acquire (3 nights) 18ms
├─ db.RoomTypeInventory.loadForUpdate 9ms
├─ domain.RoomTypeInventory.reserve <1ms
├─ domain.RoomPicker.choose 2ms
├─ db.RoomAllocation.insert 7ms
├─ db.outbox.insert (allocation.confirmed) 4ms
└─ db.commit (lock release) 5ms
Every span carries tenant_id and causation_event_id. Failures attach error.code and error.message.
9. Cross-references
- Use cases that emit each metric: APPLICATION_LOGIC §2
- Failure modes that drive alert ladder: FAILURE_MODES
- Test coverage that backs SLO claims: TESTING_STRATEGY §4 + §7
- Reservation observability: reservation-service OBSERVABILITY