OBSERVABILITY — reservation-service

Sibling: SECURITY_MODEL · DEPLOYMENT_TOPOLOGY · FAILURE_MODES

Strategic anchors: 02 §13 Observability · standards/SERVICE_TEMPLATE

The platform observability stack is OpenTelemetry (traces + metrics + logs) → Cloud Operations + SigNoz. reservation-service initializes @ghasi/telemetry before NestFactory in main.ts so every span includes baseline attributes and every log record is structured JSON.

1. Required span attributes (every span)

Attribute	Source	Notes
`tenant.id`	request middleware	always present except `/internal/health` and `/internal/ready`
`property.id`	request middleware	for property-scoped endpoints
`actor.type`, `actor.id`	identity resolver	`guest`, `staff`, `system`, `partner`
`reservation.id`	controller / use case	added as soon as known
`reservation.status`	use case	added on read/save
`reservation.channel`	use case	for funnel slicing
`saga.step`	saga handler	`await_inventory`, `await_payment`, etc.
`event.subject`	inbox/outbox handlers	full subject string
`event.id`, `event.causation_id`	inbox handler	for cross-service correlation
`idempotency.key`	controller	from `Idempotency-Key` header
`db.system=postgresql`, `db.statement.fingerprint`	drizzle adapter	hash of normalized SQL

traceparent is propagated end-to-end via the W3C standard; every emitted Pub/Sub message carries it as a message attribute and every consumed message restores the parent context before handler execution.

2. Structured log fields (every record)

Mandatory: timestamp, severity, message, service.name, service.version, trace_id, span_id, tenant_id, request_id, actor.type, actor.id. Optional but expected on errors: error.code (MELMASTOON.RESERVATION.…), error.kind, error.stack, reservation.id, reservation.status.

PII never appears in logs. The structured logger automatically drops keys named email, phone, phoneE164, documentNumber, password, cardNumber, and any field with the pii:true schema marker.

3. SLIs and SLOs

SLI	SLO target	Window	Burn-rate alerts
Booking-saga latency p99 (held → confirmed)	< 5 s including external payment capture	30 d	2× burn for 5 min and 14× for 1 h
Hold-expiry job lag (max `now() − expires_at` for any unswept hold)	< 30 s	30 d	2× for 10 min
Modification success rate	> 99.5%	30 d	2× for 30 min
API availability (5xx rate on `/api/v1/reservations/*`)	99.95%	30 d	2× for 5 min and 14× for 1 h
Outbox publish lag p99 (`published_at − created_at`)	< 30 s	7 d	2× for 10 min
Inbox processing lag p99 (`processed_at − publishedAt`)	< 60 s	7 d	2× for 15 min
OCC conflict rate on `Reservation.save`	< 1%	24 h	warning > 2% for 30 min
Walk-in completion latency p95 (start → checked_in)	< 8 s in normal connectivity	7 d	warning > 12 s for 30 min

4. RED + USE metrics

Per endpoint and per use case, OpenTelemetry exports:

Metric	Type	Tags
`reservation_http_requests_total`	counter	`route`, `method`, `status_class`, `tenant_id`
`reservation_http_request_duration_seconds`	histogram	`route`, `method`, `tenant_id`
`reservation_use_case_duration_seconds`	histogram	`use_case`, `outcome`
`reservation_state_transitions_total`	counter	`from_state`, `to_state`, `cause`
`reservation_saga_step_duration_seconds`	histogram	`step`, `outcome`
`reservation_outbox_lag_seconds`	gauge	(single value, refreshed every 15 s)
`reservation_inbox_lag_seconds`	gauge	`subscription`
`reservation_holds_active`	gauge	`tenant_id`, `property_id`
`reservation_hold_expirations_total`	counter	`reason` (`ttl_elapsed`, `staff_override`)
`reservation_occ_conflicts_total`	counter	`aggregate`
`reservation_pubsub_publish_failures_total`	counter	`subject`
`reservation_inbox_dedupe_hits_total`	counter	`subject`

DB pool gauges (db_pool_in_use, db_pool_idle, db_pool_waiters) come from the Drizzle/pg pool adapter.

5. Dashboards

Three dashboards live in SigNoz/Cloud Monitoring under the reservation-service folder:

Service health — RED on every endpoint; saga step heatmap; outbox/inbox lag; DB pool usage; error-code breakdown.
Booking funnel — quote → hold → confirm conversion rates by tenant/channel; saga p50/p95/p99 latency; abandonment timeline.
Operations (front-desk view) — arrivals today vs forecast per property; in-house count; modifications-in-flight; offline desktop reconciliation conflicts; walk-in throughput.

All dashboards filter by tenant_id and property_id and respect viewer's data-residency entitlements (rendered server-side by the dashboard service).

6. Alerts and runbooks

Each alert has a named runbook under runbooks/reservation/ in the documentation repo.

Alert	Trigger	Runbook
`RESV-001 BookingSagaLatencyHigh`	p99 > 5 s for 5 min	`runbooks/reservation/booking-saga-latency.md`
`RESV-002 HoldExpirySweeperStalled`	`reservation_outbox_lag_seconds` for `hold_expired.v1` > 60 s for 5 min	`runbooks/reservation/hold-expiry-stalled.md`
`RESV-003 ModificationFailureRateHigh`	`>0.5%` failures over 30 min	`runbooks/reservation/modifications-failing.md`
`RESV-004 OutboxPublishLagHigh`	p99 > 30 s for 10 min	`runbooks/reservation/outbox-lag.md`
`RESV-005 InboxDLQGrowing`	DLQ size > 10 for 10 min on any subscription	`runbooks/reservation/inbox-dlq.md`
`RESV-006 PaymentEventReplayFailing`	repeated `MELMASTOON.RESERVATION.STALE_VERSION` on payment.captured handler	`runbooks/reservation/payment-replay.md`
`RESV-007 LockIssuanceDegraded`	`requires_manual_key=true` rate > 5% for 1 h	`runbooks/reservation/lock-degraded.md`
`RESV-008 OverbookingDetected`	any `MELMASTOON.RESERVATION.OVERBOOKING_BLOCKED` raised by domain (means our defense-in-depth caught a bypass)	`runbooks/reservation/overbooking-defense.md` (P1)
`RESV-009 SuspectedFraudHoldBacklog`	held-for-review > 30 unresolved over 1 h per tenant	`runbooks/reservation/fraud-review-backlog.md`
`RESV-010 HoldsExceededTenantLimit`	`MELMASTOON.RESERVATION.HOLD_LIMIT_EXCEEDED` rate > 1/min per tenant	`runbooks/reservation/hold-limit.md`

Pages route to the on-call rotation defined in runbooks/_oncall.md. P1 alerts (RESV-008) page immediately and open an incident in Slack #inc-reservation.

7. Tracing patterns

Saga trace correlation: the booking saga starts at the BFF entry; the traceparent flows through every Pub/Sub message; SigNoz can render the entire saga as one trace tree spanning bff-tenant-booking-service, reservation-service, inventory-service, payment-gateway-service, lock-integration-service, notification-service.
Inbox handler trace: when a Pub/Sub message arrives, the handler restores traceparent from message attributes and starts a child span reservation.inbox.<subject> linked to the producer span. Causation chains are visible in trace search (event.causation_id).
Outbox span: every aggregate save creates reservation.outbox.write and reservation.outbox.publish spans with the same aggregate_id.

8. Replay & backfill observability

The replay runbook reuses Pub/Sub seek-to-timestamp; the inbox dedupe table (reservation.inbox_processed) makes replay safe.
During replay, watch reservation_inbox_dedupe_hits_total (should spike) and reservation_use_case_duration_seconds{use_case=…} for the replayed handlers (should not exceed normal p99).
Audit replay: any operator-initiated replay emits melmastoon.audit.replay_initiated.v1 with the operator id and time range.

9. Synthetic checks

Cloud Monitoring synthetic checks every 60 s from three regions:

POST /api/v1/reservations/quotes against a synthetic tenant — must return 200 with a quote.
POST /api/v1/reservations/holds then immediate confirm via mocked payment webhook — must complete in < 5 s.
GET /internal/health — must return 200 and db.ok=true, pubsub.ok=true.

Synthetic check failures are paged at the same severity as real availability alerts but tagged source=synthetic so the on-call can distinguish.

10. Cross-references

Trace propagation across Pub/Sub: 04 §10
Audit / Merkle anchoring: 07 §9
Failure-mode runbook index: FAILURE_MODES

1. Required span attributes (every span)​

2. Structured log fields (every record)​

3. SLIs and SLOs​

4. RED + USE metrics​

5. Dashboards​

6. Alerts and runbooks​

7. Tracing patterns​

8. Replay & backfill observability​

9. Synthetic checks​

10. Cross-references​