OBSERVABILITY — reservation-service
Sibling: SECURITY_MODEL · DEPLOYMENT_TOPOLOGY · FAILURE_MODES
Strategic anchors: 02 §13 Observability · standards/SERVICE_TEMPLATE
The platform observability stack is OpenTelemetry (traces + metrics + logs) → Cloud Operations + SigNoz. reservation-service initializes @ghasi/telemetry before NestFactory in main.ts so every span includes baseline attributes and every log record is structured JSON.
1. Required span attributes (every span)
| Attribute | Source | Notes |
|---|---|---|
tenant.id | request middleware | always present except /internal/health and /internal/ready |
property.id | request middleware | for property-scoped endpoints |
actor.type, actor.id | identity resolver | guest, staff, system, partner |
reservation.id | controller / use case | added as soon as known |
reservation.status | use case | added on read/save |
reservation.channel | use case | for funnel slicing |
saga.step | saga handler | await_inventory, await_payment, etc. |
event.subject | inbox/outbox handlers | full subject string |
event.id, event.causation_id | inbox handler | for cross-service correlation |
idempotency.key | controller | from Idempotency-Key header |
db.system=postgresql, db.statement.fingerprint | drizzle adapter | hash of normalized SQL |
traceparent is propagated end-to-end via the W3C standard; every emitted Pub/Sub message carries it as a message attribute and every consumed message restores the parent context before handler execution.
2. Structured log fields (every record)
Mandatory: timestamp, severity, message, service.name, service.version, trace_id, span_id, tenant_id, request_id, actor.type, actor.id. Optional but expected on errors: error.code (MELMASTOON.RESERVATION.…), error.kind, error.stack, reservation.id, reservation.status.
PII never appears in logs. The structured logger automatically drops keys named email, phone, phoneE164, documentNumber, password, cardNumber, and any field with the pii:true schema marker.
3. SLIs and SLOs
| SLI | SLO target | Window | Burn-rate alerts |
|---|---|---|---|
| Booking-saga latency p99 (held → confirmed) | < 5 s including external payment capture | 30 d | 2× burn for 5 min and 14× for 1 h |
Hold-expiry job lag (max now() − expires_at for any unswept hold) | < 30 s | 30 d | 2× for 10 min |
| Modification success rate | > 99.5% | 30 d | 2× for 30 min |
API availability (5xx rate on /api/v1/reservations/*) | 99.95% | 30 d | 2× for 5 min and 14× for 1 h |
Outbox publish lag p99 (published_at − created_at) | < 30 s | 7 d | 2× for 10 min |
Inbox processing lag p99 (processed_at − publishedAt) | < 60 s | 7 d | 2× for 15 min |
OCC conflict rate on Reservation.save | < 1% | 24 h | warning > 2% for 30 min |
| Walk-in completion latency p95 (start → checked_in) | < 8 s in normal connectivity | 7 d | warning > 12 s for 30 min |
4. RED + USE metrics
Per endpoint and per use case, OpenTelemetry exports:
| Metric | Type | Tags |
|---|---|---|
reservation_http_requests_total | counter | route, method, status_class, tenant_id |
reservation_http_request_duration_seconds | histogram | route, method, tenant_id |
reservation_use_case_duration_seconds | histogram | use_case, outcome |
reservation_state_transitions_total | counter | from_state, to_state, cause |
reservation_saga_step_duration_seconds | histogram | step, outcome |
reservation_outbox_lag_seconds | gauge | (single value, refreshed every 15 s) |
reservation_inbox_lag_seconds | gauge | subscription |
reservation_holds_active | gauge | tenant_id, property_id |
reservation_hold_expirations_total | counter | reason (ttl_elapsed, staff_override) |
reservation_occ_conflicts_total | counter | aggregate |
reservation_pubsub_publish_failures_total | counter | subject |
reservation_inbox_dedupe_hits_total | counter | subject |
DB pool gauges (db_pool_in_use, db_pool_idle, db_pool_waiters) come from the Drizzle/pg pool adapter.
5. Dashboards
Three dashboards live in SigNoz/Cloud Monitoring under the reservation-service folder:
- Service health — RED on every endpoint; saga step heatmap; outbox/inbox lag; DB pool usage; error-code breakdown.
- Booking funnel — quote → hold → confirm conversion rates by tenant/channel; saga p50/p95/p99 latency; abandonment timeline.
- Operations (front-desk view) — arrivals today vs forecast per property; in-house count; modifications-in-flight; offline desktop reconciliation conflicts; walk-in throughput.
All dashboards filter by tenant_id and property_id and respect viewer's data-residency entitlements (rendered server-side by the dashboard service).
6. Alerts and runbooks
Each alert has a named runbook under runbooks/reservation/ in the documentation repo.
| Alert | Trigger | Runbook |
|---|---|---|
RESV-001 BookingSagaLatencyHigh | p99 > 5 s for 5 min | runbooks/reservation/booking-saga-latency.md |
RESV-002 HoldExpirySweeperStalled | reservation_outbox_lag_seconds for hold_expired.v1 > 60 s for 5 min | runbooks/reservation/hold-expiry-stalled.md |
RESV-003 ModificationFailureRateHigh | >0.5% failures over 30 min | runbooks/reservation/modifications-failing.md |
RESV-004 OutboxPublishLagHigh | p99 > 30 s for 10 min | runbooks/reservation/outbox-lag.md |
RESV-005 InboxDLQGrowing | DLQ size > 10 for 10 min on any subscription | runbooks/reservation/inbox-dlq.md |
RESV-006 PaymentEventReplayFailing | repeated MELMASTOON.RESERVATION.STALE_VERSION on payment.captured handler | runbooks/reservation/payment-replay.md |
RESV-007 LockIssuanceDegraded | requires_manual_key=true rate > 5% for 1 h | runbooks/reservation/lock-degraded.md |
RESV-008 OverbookingDetected | any MELMASTOON.RESERVATION.OVERBOOKING_BLOCKED raised by domain (means our defense-in-depth caught a bypass) | runbooks/reservation/overbooking-defense.md (P1) |
RESV-009 SuspectedFraudHoldBacklog | held-for-review > 30 unresolved over 1 h per tenant | runbooks/reservation/fraud-review-backlog.md |
RESV-010 HoldsExceededTenantLimit | MELMASTOON.RESERVATION.HOLD_LIMIT_EXCEEDED rate > 1/min per tenant | runbooks/reservation/hold-limit.md |
Pages route to the on-call rotation defined in runbooks/_oncall.md. P1 alerts (RESV-008) page immediately and open an incident in Slack #inc-reservation.
7. Tracing patterns
- Saga trace correlation: the booking saga starts at the BFF entry; the
traceparentflows through every Pub/Sub message; SigNoz can render the entire saga as one trace tree spanningbff-tenant-booking-service,reservation-service,inventory-service,payment-gateway-service,lock-integration-service,notification-service. - Inbox handler trace: when a Pub/Sub message arrives, the handler restores
traceparentfrom message attributes and starts a child spanreservation.inbox.<subject>linked to the producer span. Causation chains are visible in trace search (event.causation_id). - Outbox span: every aggregate save creates
reservation.outbox.writeandreservation.outbox.publishspans with the sameaggregate_id.
8. Replay & backfill observability
- The replay runbook reuses Pub/Sub seek-to-timestamp; the inbox dedupe table (
reservation.inbox_processed) makes replay safe. - During replay, watch
reservation_inbox_dedupe_hits_total(should spike) andreservation_use_case_duration_seconds{use_case=…}for the replayed handlers (should not exceed normal p99). - Audit replay: any operator-initiated replay emits
melmastoon.audit.replay_initiated.v1with the operator id and time range.
9. Synthetic checks
Cloud Monitoring synthetic checks every 60 s from three regions:
POST /api/v1/reservations/quotesagainst a synthetic tenant — must return 200 with a quote.POST /api/v1/reservations/holdsthen immediate confirm via mocked payment webhook — must complete in < 5 s.GET /internal/health— must return 200 anddb.ok=true,pubsub.ok=true.
Synthetic check failures are paged at the same severity as real availability alerts but tagged source=synthetic so the on-call can distinguish.
10. Cross-references
- Trace propagation across Pub/Sub: 04 §10
- Audit / Merkle anchoring: 07 §9
- Failure-mode runbook index: FAILURE_MODES