Skip to main content

OBSERVABILITY — bff-tenant-booking-service

Sibling: SECURITY_MODEL · FAILURE_MODES · API_CONTRACTS

Cross-cutting: 02 Enterprise Architecture · §11 Observability

1. Stack

ConcernTool
TracingOpenTelemetry → Cloud Trace + Tempo
MetricsOpenTelemetry → Cloud Monitoring + Prometheus → Grafana
LogsPino → Cloud Logging → BigQuery (audit lake)
ProfilingCloud Profiler (continuous, 1% sample)
Synthetic monitoringCloud Monitoring uptime checks + Playwright canary on stage
AlertingCloud Monitoring + PagerDuty

OpenTelemetry SDK initialized before NestFactory in main.ts per SERVICE_TEMPLATE.

2. SLIs and SLOs

SLISLOWindowError budget
bootstrap_p95_latency< 350 ms (warm)28 d5%
availability_p95_latency< 600 ms28 d5%
quote_p95_latency< 250 ms28 d1%
hold_p95_latency< 600 ms28 d1%
confirm_p95_latency< 1.2 s28 d1%
confirmation_first_byte_p95< 800 ms28 d5%
availability (uptime)99.95%28 d21 m / 28 d
bootstrap_success_ratio≥ 99.9%28 d0.1%
quote_to_hold_success_ratio≥ 95%28 d5%
hold_to_confirm_success_ratio≥ 90%28 d10%
confirm_idempotent_correctness100%continuous0

3. RED metrics (per route)

For every route the BFF emits the canonical RED metrics:

bff_tenant_booking_request_total{tenant_id, route, method, status_class}
bff_tenant_booking_request_duration_seconds{tenant_id, route, method, status_class} (histogram)
bff_tenant_booking_request_inflight{route}
bff_tenant_booking_errors_total{tenant_id, route, error_code}

4. USE metrics (resources)

bff_tenant_booking_redis_pool_inuse / total
bff_tenant_booking_postgres_pool_inuse / total
bff_tenant_booking_outbox_depth
bff_tenant_booking_outbox_lag_seconds
bff_tenant_booking_circuit_breaker_state{upstream}
bff_tenant_booking_cache_hit_total{cache}
bff_tenant_booking_cache_miss_total{cache}
bff_tenant_booking_singleflight_followers_total{key_prefix}
bff_tenant_booking_handoff_consume_total{outcome}
bff_tenant_booking_payment_return_total{outcome}
bff_tenant_booking_draft_active_count
bff_tenant_booking_draft_abandoned_total{reason}
bff_tenant_booking_draft_converted_total

5. Trace attributes (mandatory)

Every span carries:

KeySourceCardinality
service.namebff-tenant-booking-servicelow
tenant.idresolved per requestmedium (per tenant)
tenant.slugresolved per requestmedium
request.idheader / mintedhigh (sampled)
session.idcookie / mintedhigh (sampled)
route.namecontroller mappinglow
client.surfaceheaderlow
cache.outcomeper-cache lookuplow
upstream.nameper upstream calllow
upstream.deadline_msper upstream calllow
circuit.stateper upstream calllow
flow.state.from / flow.state.toflow transitionslow
bff.draft.idwhen in flowhigh (sampled)
handoff.idwhen consumedhigh (sampled)
handoff.replayedboollow
payment.intent.idwhen presenthigh (sampled)
idempotency.keywhen presenthigh (sampled)
ai.provenance.idwhen AI surfacedhigh (sampled)

6. Log fields (mandatory)

{
"ts": "2026-04-23T09:14:22.041Z",
"level": "info",
"service": "bff-tenant-booking-service",
"instance": "bff-tenant-asia-south1-7f8d9c-x4z2",
"traceId": "00-...",
"spanId": "...",
"requestId": "req_01H...",
"tenantId": "tnt_01H...",
"tenantSlug": "kabul-grand-hotel",
"sessionId": "tnt_session_01H...",
"route": "POST /hold",
"statusCode": 201,
"latencyMs": 412,
"cacheOutcome": "MISS",
"upstream": [{"name":"reservation-service","latencyMs":380,"status":"ok"}],
"msg": "hold_created",
"draftId": "bdr_01H..."
}

PII fields (email, phone, name) are NEVER logged in raw form; only hashes.

7. Dashboards

7.1 "Funnel health" (executive)

  • Conversion rate by tenant (last 7 d)
  • Abandoned-cart % by stage
  • Handoff arrival → conversion
  • Top 5 error codes per stage
  • Revenue captured (display currency, by tenant)

7.2 "Service SLO" (SRE)

  • p50/p95/p99 latency by route
  • Error rate by route + status class
  • SLO burn rates (1 h, 6 h, 1 d, 7 d)
  • Upstream dependency health
  • Circuit breaker state timeline
  • Cache hit ratios

7.3 "Booking flow" (product + on-call)

  • Live flow-state distribution
  • Step-completion durations p50/p95
  • Drop-off histogram
  • Handoff arrival count
  • Confirmation views
  • AI surface display vs interaction

7.4 "Per-tenant" (CSM)

Drill-down per tenant:

  • Daily bookings + revenue
  • Funnel chart (search → quote → hold → confirm)
  • Top error codes
  • Latency distribution
  • Bot-detector activity

8. Alerts (P1)

AlertConditionAction
bff-tenant-booking_p95_latency_burn_fastroute p95 > 2× SLO for 5 minPage on-call; check upstream + cache hit
bff-tenant-booking_error_rate_burn_fastoverall 5xx > 1% for 5 minPage on-call
bff-tenant-booking_quote_to_hold_dropsuccess_ratio < 90% for 15 minPage Frontend Platform; check pricing + inventory
bff-tenant-booking_hold_to_confirm_dropsuccess_ratio < 80% for 15 minPage on-call; payment-gateway likely involved
bff-tenant-booking_handoff_invalid_spikeinvalid signature rate > 2/min for 5 minPage Security; possible HMAC compromise or rotation skew
bff-tenant-booking_outbox_lag> 60 s for 5 minPage SRE; Pub/Sub or DB issue
bff-tenant-booking_circuit_openany upstream circuit open > 5 minPage on-call
bff-tenant-booking_redis_failoverfailover eventAuto-acknowledged; verify session continuity
bff-tenant-booking_postgres_downuptime check fails for 2 minPage SRE
bff-tenant-booking_dlq_depthDLQ > 50Page SRE
bff-tenant-booking_pii_in_telemetrysynthetic probe finds raw PIIPage Security; halt outbox publisher

Each alert links to a runbook in runbooks.melmastoon.ghasi.io/bff-tenant/<short-name>.

9. Alerts (P2)

AlertCondition
bff-tenant-booking_cache_hit_ratio_dropbootstrap cache hit < 80% for 30 min
bff-tenant-booking_singleflight_collapsefollowers / leader > 1.5 sustained
bff-tenant-booking_abandoned_rate_anomalyabandoned_rate > p95 baseline + 20% for 1 h
bff-tenant-booking_currency_change_spikecurrency changes/sec > 3× baseline (potential UX bug)

10. Synthetic monitoring

Cloud Monitoring uptime checks every 60 s:

  • GET /health/ready from 4 regions
  • GET /bff/tenant-booking/v1/staging-test/bootstrap (canary tenant)
  • Playwright nightly canary: full happy-path booking against stage (search → quote → hold → guest → payment-stub → confirm).

11. Trace sampling

  • Default head sampler: 5%.
  • Tail-based sampler in OTel collector elevates to 100% for spans with status_class = 5xx, error_code != null, or duration > p99.
  • During incident: SRE toggles head sample to 100% via bff-tenant-flags Memorystore key (timer-bound auto-revert at 1 h).

12. Log retention

Log classRetention
audit.*7 years (regulatory)
request.*30 d hot, 90 d cold
error.*90 d hot, 1 y cold
debug.*7 d

All logs export to BigQuery via Log Router for the audit lake.

13. Correlation IDs

Every BFF response carries X-Request-Id and traceparent. The same X-Request-Id propagates as the upstream request id when calling reservation-service, pricing-service, etc., so a single trace spans the booking saga.

14. Cost observability

  • Cloud Billing budget alerts at 50/80/100/120% of monthly budget.
  • Per-tenant inferred-cost dashboard (RPS × ms × cost-per-CPU-second).
  • Pub/Sub egress monitored separately; sampling adjustable per-event.

15. SLO error-budget policy

When 28-day error budget is consumed:

ConsumptionAction
25%Awareness — Slack notification
50%Investigate — TODO ticket auto-created
75%Freeze non-critical changes; SRE pair with Frontend Platform
100%Hard freeze; revert recent changes; no new deploys until budget recovers