Skip to main content

OBSERVABILITY — reservation-service

Sibling: SECURITY_MODEL · DEPLOYMENT_TOPOLOGY · FAILURE_MODES

Strategic anchors: 02 §13 Observability · standards/SERVICE_TEMPLATE

The platform observability stack is OpenTelemetry (traces + metrics + logs) → Cloud Operations + SigNoz. reservation-service initializes @ghasi/telemetry before NestFactory in main.ts so every span includes baseline attributes and every log record is structured JSON.


1. Required span attributes (every span)

AttributeSourceNotes
tenant.idrequest middlewarealways present except /internal/health and /internal/ready
property.idrequest middlewarefor property-scoped endpoints
actor.type, actor.ididentity resolverguest, staff, system, partner
reservation.idcontroller / use caseadded as soon as known
reservation.statususe caseadded on read/save
reservation.channeluse casefor funnel slicing
saga.stepsaga handlerawait_inventory, await_payment, etc.
event.subjectinbox/outbox handlersfull subject string
event.id, event.causation_idinbox handlerfor cross-service correlation
idempotency.keycontrollerfrom Idempotency-Key header
db.system=postgresql, db.statement.fingerprintdrizzle adapterhash of normalized SQL

traceparent is propagated end-to-end via the W3C standard; every emitted Pub/Sub message carries it as a message attribute and every consumed message restores the parent context before handler execution.


2. Structured log fields (every record)

Mandatory: timestamp, severity, message, service.name, service.version, trace_id, span_id, tenant_id, request_id, actor.type, actor.id. Optional but expected on errors: error.code (MELMASTOON.RESERVATION.…), error.kind, error.stack, reservation.id, reservation.status.

PII never appears in logs. The structured logger automatically drops keys named email, phone, phoneE164, documentNumber, password, cardNumber, and any field with the pii:true schema marker.


3. SLIs and SLOs

SLISLO targetWindowBurn-rate alerts
Booking-saga latency p99 (held → confirmed)< 5 s including external payment capture30 d2× burn for 5 min and 14× for 1 h
Hold-expiry job lag (max now() − expires_at for any unswept hold)< 30 s30 d2× for 10 min
Modification success rate> 99.5%30 d2× for 30 min
API availability (5xx rate on /api/v1/reservations/*)99.95%30 d2× for 5 min and 14× for 1 h
Outbox publish lag p99 (published_at − created_at)< 30 s7 d2× for 10 min
Inbox processing lag p99 (processed_at − publishedAt)< 60 s7 d2× for 15 min
OCC conflict rate on Reservation.save< 1%24 hwarning > 2% for 30 min
Walk-in completion latency p95 (start → checked_in)< 8 s in normal connectivity7 dwarning > 12 s for 30 min

4. RED + USE metrics

Per endpoint and per use case, OpenTelemetry exports:

MetricTypeTags
reservation_http_requests_totalcounterroute, method, status_class, tenant_id
reservation_http_request_duration_secondshistogramroute, method, tenant_id
reservation_use_case_duration_secondshistogramuse_case, outcome
reservation_state_transitions_totalcounterfrom_state, to_state, cause
reservation_saga_step_duration_secondshistogramstep, outcome
reservation_outbox_lag_secondsgauge(single value, refreshed every 15 s)
reservation_inbox_lag_secondsgaugesubscription
reservation_holds_activegaugetenant_id, property_id
reservation_hold_expirations_totalcounterreason (ttl_elapsed, staff_override)
reservation_occ_conflicts_totalcounteraggregate
reservation_pubsub_publish_failures_totalcountersubject
reservation_inbox_dedupe_hits_totalcountersubject

DB pool gauges (db_pool_in_use, db_pool_idle, db_pool_waiters) come from the Drizzle/pg pool adapter.


5. Dashboards

Three dashboards live in SigNoz/Cloud Monitoring under the reservation-service folder:

  1. Service health — RED on every endpoint; saga step heatmap; outbox/inbox lag; DB pool usage; error-code breakdown.
  2. Booking funnel — quote → hold → confirm conversion rates by tenant/channel; saga p50/p95/p99 latency; abandonment timeline.
  3. Operations (front-desk view) — arrivals today vs forecast per property; in-house count; modifications-in-flight; offline desktop reconciliation conflicts; walk-in throughput.

All dashboards filter by tenant_id and property_id and respect viewer's data-residency entitlements (rendered server-side by the dashboard service).


6. Alerts and runbooks

Each alert has a named runbook under runbooks/reservation/ in the documentation repo.

AlertTriggerRunbook
RESV-001 BookingSagaLatencyHighp99 > 5 s for 5 minrunbooks/reservation/booking-saga-latency.md
RESV-002 HoldExpirySweeperStalledreservation_outbox_lag_seconds for hold_expired.v1 > 60 s for 5 minrunbooks/reservation/hold-expiry-stalled.md
RESV-003 ModificationFailureRateHigh>0.5% failures over 30 minrunbooks/reservation/modifications-failing.md
RESV-004 OutboxPublishLagHighp99 > 30 s for 10 minrunbooks/reservation/outbox-lag.md
RESV-005 InboxDLQGrowingDLQ size > 10 for 10 min on any subscriptionrunbooks/reservation/inbox-dlq.md
RESV-006 PaymentEventReplayFailingrepeated MELMASTOON.RESERVATION.STALE_VERSION on payment.captured handlerrunbooks/reservation/payment-replay.md
RESV-007 LockIssuanceDegradedrequires_manual_key=true rate > 5% for 1 hrunbooks/reservation/lock-degraded.md
RESV-008 OverbookingDetectedany MELMASTOON.RESERVATION.OVERBOOKING_BLOCKED raised by domain (means our defense-in-depth caught a bypass)runbooks/reservation/overbooking-defense.md (P1)
RESV-009 SuspectedFraudHoldBacklogheld-for-review > 30 unresolved over 1 h per tenantrunbooks/reservation/fraud-review-backlog.md
RESV-010 HoldsExceededTenantLimitMELMASTOON.RESERVATION.HOLD_LIMIT_EXCEEDED rate > 1/min per tenantrunbooks/reservation/hold-limit.md

Pages route to the on-call rotation defined in runbooks/_oncall.md. P1 alerts (RESV-008) page immediately and open an incident in Slack #inc-reservation.


7. Tracing patterns

  • Saga trace correlation: the booking saga starts at the BFF entry; the traceparent flows through every Pub/Sub message; SigNoz can render the entire saga as one trace tree spanning bff-tenant-booking-service, reservation-service, inventory-service, payment-gateway-service, lock-integration-service, notification-service.
  • Inbox handler trace: when a Pub/Sub message arrives, the handler restores traceparent from message attributes and starts a child span reservation.inbox.<subject> linked to the producer span. Causation chains are visible in trace search (event.causation_id).
  • Outbox span: every aggregate save creates reservation.outbox.write and reservation.outbox.publish spans with the same aggregate_id.

8. Replay & backfill observability

  • The replay runbook reuses Pub/Sub seek-to-timestamp; the inbox dedupe table (reservation.inbox_processed) makes replay safe.
  • During replay, watch reservation_inbox_dedupe_hits_total (should spike) and reservation_use_case_duration_seconds{use_case=…} for the replayed handlers (should not exceed normal p99).
  • Audit replay: any operator-initiated replay emits melmastoon.audit.replay_initiated.v1 with the operator id and time range.

9. Synthetic checks

Cloud Monitoring synthetic checks every 60 s from three regions:

  • POST /api/v1/reservations/quotes against a synthetic tenant — must return 200 with a quote.
  • POST /api/v1/reservations/holds then immediate confirm via mocked payment webhook — must complete in < 5 s.
  • GET /internal/health — must return 200 and db.ok=true, pubsub.ok=true.

Synthetic check failures are paged at the same severity as real availability alerts but tagged source=synthetic so the on-call can distinguish.


10. Cross-references