OBSERVABILITY — bff-tenant-booking-service

Sibling: SECURITY_MODEL · FAILURE_MODES · API_CONTRACTS

Cross-cutting: 02 Enterprise Architecture · §11 Observability

1. Stack

Concern	Tool
Tracing	OpenTelemetry → Cloud Trace + Tempo
Metrics	OpenTelemetry → Cloud Monitoring + Prometheus → Grafana
Logs	Pino → Cloud Logging → BigQuery (audit lake)
Profiling	Cloud Profiler (continuous, 1% sample)
Synthetic monitoring	Cloud Monitoring uptime checks + Playwright canary on stage
Alerting	Cloud Monitoring + PagerDuty

OpenTelemetry SDK initialized before NestFactory in main.ts per SERVICE_TEMPLATE.

2. SLIs and SLOs

SLI	SLO	Window	Error budget
`bootstrap_p95_latency`	< 350 ms (warm)	28 d	5%
`availability_p95_latency`	< 600 ms	28 d	5%
`quote_p95_latency`	< 250 ms	28 d	1%
`hold_p95_latency`	< 600 ms	28 d	1%
`confirm_p95_latency`	< 1.2 s	28 d	1%
`confirmation_first_byte_p95`	< 800 ms	28 d	5%
`availability` (uptime)	99.95%	28 d	21 m / 28 d
`bootstrap_success_ratio`	≥ 99.9%	28 d	0.1%
`quote_to_hold_success_ratio`	≥ 95%	28 d	5%
`hold_to_confirm_success_ratio`	≥ 90%	28 d	10%
`confirm_idempotent_correctness`	100%	continuous	0

3. RED metrics (per route)

For every route the BFF emits the canonical RED metrics:

bff_tenant_booking_request_total{tenant_id, route, method, status_class}
bff_tenant_booking_request_duration_seconds{tenant_id, route, method, status_class} (histogram)
bff_tenant_booking_request_inflight{route}
bff_tenant_booking_errors_total{tenant_id, route, error_code}

4. USE metrics (resources)

bff_tenant_booking_redis_pool_inuse / total
bff_tenant_booking_postgres_pool_inuse / total
bff_tenant_booking_outbox_depth
bff_tenant_booking_outbox_lag_seconds
bff_tenant_booking_circuit_breaker_state{upstream}
bff_tenant_booking_cache_hit_total{cache}
bff_tenant_booking_cache_miss_total{cache}
bff_tenant_booking_singleflight_followers_total{key_prefix}
bff_tenant_booking_handoff_consume_total{outcome}
bff_tenant_booking_payment_return_total{outcome}
bff_tenant_booking_draft_active_count
bff_tenant_booking_draft_abandoned_total{reason}
bff_tenant_booking_draft_converted_total

5. Trace attributes (mandatory)

Every span carries:

Key	Source	Cardinality
`service.name`	`bff-tenant-booking-service`	low
`tenant.id`	resolved per request	medium (per tenant)
`tenant.slug`	resolved per request	medium
`request.id`	header / minted	high (sampled)
`session.id`	cookie / minted	high (sampled)
`route.name`	controller mapping	low
`client.surface`	header	low
`cache.outcome`	per-cache lookup	low
`upstream.name`	per upstream call	low
`upstream.deadline_ms`	per upstream call	low
`circuit.state`	per upstream call	low
`flow.state.from` / `flow.state.to`	flow transitions	low
`bff.draft.id`	when in flow	high (sampled)
`handoff.id`	when consumed	high (sampled)
`handoff.replayed`	bool	low
`payment.intent.id`	when present	high (sampled)
`idempotency.key`	when present	high (sampled)
`ai.provenance.id`	when AI surfaced	high (sampled)

6. Log fields (mandatory)

{
  "ts": "2026-04-23T09:14:22.041Z",
  "level": "info",
  "service": "bff-tenant-booking-service",
  "instance": "bff-tenant-asia-south1-7f8d9c-x4z2",
  "traceId": "00-...",
  "spanId": "...",
  "requestId": "req_01H...",
  "tenantId": "tnt_01H...",
  "tenantSlug": "kabul-grand-hotel",
  "sessionId": "tnt_session_01H...",
  "route": "POST /hold",
  "statusCode": 201,
  "latencyMs": 412,
  "cacheOutcome": "MISS",
  "upstream": [{"name":"reservation-service","latencyMs":380,"status":"ok"}],
  "msg": "hold_created",
  "draftId": "bdr_01H..."
}

PII fields (email, phone, name) are NEVER logged in raw form; only hashes.

7. Dashboards

7.1 "Funnel health" (executive)

Conversion rate by tenant (last 7 d)
Abandoned-cart % by stage
Handoff arrival → conversion
Top 5 error codes per stage
Revenue captured (display currency, by tenant)

7.2 "Service SLO" (SRE)

p50/p95/p99 latency by route
Error rate by route + status class
SLO burn rates (1 h, 6 h, 1 d, 7 d)
Upstream dependency health
Circuit breaker state timeline
Cache hit ratios

7.3 "Booking flow" (product + on-call)

Live flow-state distribution
Step-completion durations p50/p95
Drop-off histogram
Handoff arrival count
Confirmation views
AI surface display vs interaction

7.4 "Per-tenant" (CSM)

Drill-down per tenant:

Daily bookings + revenue
Funnel chart (search → quote → hold → confirm)
Top error codes
Latency distribution
Bot-detector activity

8. Alerts (P1)

Alert	Condition	Action
`bff-tenant-booking_p95_latency_burn_fast`	route p95 > 2× SLO for 5 min	Page on-call; check upstream + cache hit
`bff-tenant-booking_error_rate_burn_fast`	overall 5xx > 1% for 5 min	Page on-call
`bff-tenant-booking_quote_to_hold_drop`	success_ratio < 90% for 15 min	Page Frontend Platform; check `pricing` + `inventory`
`bff-tenant-booking_hold_to_confirm_drop`	success_ratio < 80% for 15 min	Page on-call; payment-gateway likely involved
`bff-tenant-booking_handoff_invalid_spike`	invalid signature rate > 2/min for 5 min	Page Security; possible HMAC compromise or rotation skew
`bff-tenant-booking_outbox_lag`	> 60 s for 5 min	Page SRE; Pub/Sub or DB issue
`bff-tenant-booking_circuit_open`	any upstream circuit open > 5 min	Page on-call
`bff-tenant-booking_redis_failover`	failover event	Auto-acknowledged; verify session continuity
`bff-tenant-booking_postgres_down`	uptime check fails for 2 min	Page SRE
`bff-tenant-booking_dlq_depth`	DLQ > 50	Page SRE
`bff-tenant-booking_pii_in_telemetry`	synthetic probe finds raw PII	Page Security; halt outbox publisher

Each alert links to a runbook in runbooks.melmastoon.ghasi.io/bff-tenant/<short-name>.

9. Alerts (P2)

Alert	Condition
`bff-tenant-booking_cache_hit_ratio_drop`	bootstrap cache hit < 80% for 30 min
`bff-tenant-booking_singleflight_collapse`	followers / leader > 1.5 sustained
`bff-tenant-booking_abandoned_rate_anomaly`	abandoned_rate > p95 baseline + 20% for 1 h
`bff-tenant-booking_currency_change_spike`	currency changes/sec > 3× baseline (potential UX bug)

10. Synthetic monitoring

Cloud Monitoring uptime checks every 60 s:

GET /health/ready from 4 regions
GET /bff/tenant-booking/v1/staging-test/bootstrap (canary tenant)
Playwright nightly canary: full happy-path booking against stage (search → quote → hold → guest → payment-stub → confirm).

11. Trace sampling

Default head sampler: 5%.
Tail-based sampler in OTel collector elevates to 100% for spans with status_class = 5xx, error_code != null, or duration > p99.
During incident: SRE toggles head sample to 100% via bff-tenant-flags Memorystore key (timer-bound auto-revert at 1 h).

12. Log retention

Log class	Retention
`audit.*`	7 years (regulatory)
`request.*`	30 d hot, 90 d cold
`error.*`	90 d hot, 1 y cold
`debug.*`	7 d

All logs export to BigQuery via Log Router for the audit lake.

13. Correlation IDs

Every BFF response carries X-Request-Id and traceparent. The same X-Request-Id propagates as the upstream request id when calling reservation-service, pricing-service, etc., so a single trace spans the booking saga.

14. Cost observability

Cloud Billing budget alerts at 50/80/100/120% of monthly budget.
Per-tenant inferred-cost dashboard (RPS × ms × cost-per-CPU-second).
Pub/Sub egress monitored separately; sampling adjustable per-event.

15. SLO error-budget policy

When 28-day error budget is consumed:

Consumption	Action
25%	Awareness — Slack notification
50%	Investigate — TODO ticket auto-created
75%	Freeze non-critical changes; SRE pair with Frontend Platform
100%	Hard freeze; revert recent changes; no new deploys until budget recovers

1. Stack​

2. SLIs and SLOs​

3. RED metrics (per route)​

4. USE metrics (resources)​

5. Trace attributes (mandatory)​

6. Log fields (mandatory)​

7. Dashboards​

7.1 "Funnel health" (executive)​

7.2 "Service SLO" (SRE)​

7.3 "Booking flow" (product + on-call)​

7.4 "Per-tenant" (CSM)​

8. Alerts (P1)​

9. Alerts (P2)​

10. Synthetic monitoring​

11. Trace sampling​

12. Log retention​

13. Correlation IDs​

14. Cost observability​

15. SLO error-budget policy​