OBSERVABILITY — bff-tenant-booking-service
Sibling: SECURITY_MODEL · FAILURE_MODES · API_CONTRACTS
Cross-cutting: 02 Enterprise Architecture · §11 Observability
1. Stack
| Concern | Tool |
|---|---|
| Tracing | OpenTelemetry → Cloud Trace + Tempo |
| Metrics | OpenTelemetry → Cloud Monitoring + Prometheus → Grafana |
| Logs | Pino → Cloud Logging → BigQuery (audit lake) |
| Profiling | Cloud Profiler (continuous, 1% sample) |
| Synthetic monitoring | Cloud Monitoring uptime checks + Playwright canary on stage |
| Alerting | Cloud Monitoring + PagerDuty |
OpenTelemetry SDK initialized before NestFactory in main.ts per SERVICE_TEMPLATE.
2. SLIs and SLOs
| SLI | SLO | Window | Error budget |
|---|---|---|---|
bootstrap_p95_latency | < 350 ms (warm) | 28 d | 5% |
availability_p95_latency | < 600 ms | 28 d | 5% |
quote_p95_latency | < 250 ms | 28 d | 1% |
hold_p95_latency | < 600 ms | 28 d | 1% |
confirm_p95_latency | < 1.2 s | 28 d | 1% |
confirmation_first_byte_p95 | < 800 ms | 28 d | 5% |
availability (uptime) | 99.95% | 28 d | 21 m / 28 d |
bootstrap_success_ratio | ≥ 99.9% | 28 d | 0.1% |
quote_to_hold_success_ratio | ≥ 95% | 28 d | 5% |
hold_to_confirm_success_ratio | ≥ 90% | 28 d | 10% |
confirm_idempotent_correctness | 100% | continuous | 0 |
3. RED metrics (per route)
For every route the BFF emits the canonical RED metrics:
bff_tenant_booking_request_total{tenant_id, route, method, status_class}
bff_tenant_booking_request_duration_seconds{tenant_id, route, method, status_class} (histogram)
bff_tenant_booking_request_inflight{route}
bff_tenant_booking_errors_total{tenant_id, route, error_code}
4. USE metrics (resources)
bff_tenant_booking_redis_pool_inuse / total
bff_tenant_booking_postgres_pool_inuse / total
bff_tenant_booking_outbox_depth
bff_tenant_booking_outbox_lag_seconds
bff_tenant_booking_circuit_breaker_state{upstream}
bff_tenant_booking_cache_hit_total{cache}
bff_tenant_booking_cache_miss_total{cache}
bff_tenant_booking_singleflight_followers_total{key_prefix}
bff_tenant_booking_handoff_consume_total{outcome}
bff_tenant_booking_payment_return_total{outcome}
bff_tenant_booking_draft_active_count
bff_tenant_booking_draft_abandoned_total{reason}
bff_tenant_booking_draft_converted_total
5. Trace attributes (mandatory)
Every span carries:
| Key | Source | Cardinality |
|---|---|---|
service.name | bff-tenant-booking-service | low |
tenant.id | resolved per request | medium (per tenant) |
tenant.slug | resolved per request | medium |
request.id | header / minted | high (sampled) |
session.id | cookie / minted | high (sampled) |
route.name | controller mapping | low |
client.surface | header | low |
cache.outcome | per-cache lookup | low |
upstream.name | per upstream call | low |
upstream.deadline_ms | per upstream call | low |
circuit.state | per upstream call | low |
flow.state.from / flow.state.to | flow transitions | low |
bff.draft.id | when in flow | high (sampled) |
handoff.id | when consumed | high (sampled) |
handoff.replayed | bool | low |
payment.intent.id | when present | high (sampled) |
idempotency.key | when present | high (sampled) |
ai.provenance.id | when AI surfaced | high (sampled) |
6. Log fields (mandatory)
{
"ts": "2026-04-23T09:14:22.041Z",
"level": "info",
"service": "bff-tenant-booking-service",
"instance": "bff-tenant-asia-south1-7f8d9c-x4z2",
"traceId": "00-...",
"spanId": "...",
"requestId": "req_01H...",
"tenantId": "tnt_01H...",
"tenantSlug": "kabul-grand-hotel",
"sessionId": "tnt_session_01H...",
"route": "POST /hold",
"statusCode": 201,
"latencyMs": 412,
"cacheOutcome": "MISS",
"upstream": [{"name":"reservation-service","latencyMs":380,"status":"ok"}],
"msg": "hold_created",
"draftId": "bdr_01H..."
}
PII fields (email, phone, name) are NEVER logged in raw form; only hashes.
7. Dashboards
7.1 "Funnel health" (executive)
- Conversion rate by tenant (last 7 d)
- Abandoned-cart % by stage
- Handoff arrival → conversion
- Top 5 error codes per stage
- Revenue captured (display currency, by tenant)
7.2 "Service SLO" (SRE)
- p50/p95/p99 latency by route
- Error rate by route + status class
- SLO burn rates (1 h, 6 h, 1 d, 7 d)
- Upstream dependency health
- Circuit breaker state timeline
- Cache hit ratios
7.3 "Booking flow" (product + on-call)
- Live flow-state distribution
- Step-completion durations p50/p95
- Drop-off histogram
- Handoff arrival count
- Confirmation views
- AI surface display vs interaction
7.4 "Per-tenant" (CSM)
Drill-down per tenant:
- Daily bookings + revenue
- Funnel chart (search → quote → hold → confirm)
- Top error codes
- Latency distribution
- Bot-detector activity
8. Alerts (P1)
| Alert | Condition | Action |
|---|---|---|
bff-tenant-booking_p95_latency_burn_fast | route p95 > 2× SLO for 5 min | Page on-call; check upstream + cache hit |
bff-tenant-booking_error_rate_burn_fast | overall 5xx > 1% for 5 min | Page on-call |
bff-tenant-booking_quote_to_hold_drop | success_ratio < 90% for 15 min | Page Frontend Platform; check pricing + inventory |
bff-tenant-booking_hold_to_confirm_drop | success_ratio < 80% for 15 min | Page on-call; payment-gateway likely involved |
bff-tenant-booking_handoff_invalid_spike | invalid signature rate > 2/min for 5 min | Page Security; possible HMAC compromise or rotation skew |
bff-tenant-booking_outbox_lag | > 60 s for 5 min | Page SRE; Pub/Sub or DB issue |
bff-tenant-booking_circuit_open | any upstream circuit open > 5 min | Page on-call |
bff-tenant-booking_redis_failover | failover event | Auto-acknowledged; verify session continuity |
bff-tenant-booking_postgres_down | uptime check fails for 2 min | Page SRE |
bff-tenant-booking_dlq_depth | DLQ > 50 | Page SRE |
bff-tenant-booking_pii_in_telemetry | synthetic probe finds raw PII | Page Security; halt outbox publisher |
Each alert links to a runbook in runbooks.melmastoon.ghasi.io/bff-tenant/<short-name>.
9. Alerts (P2)
| Alert | Condition |
|---|---|
bff-tenant-booking_cache_hit_ratio_drop | bootstrap cache hit < 80% for 30 min |
bff-tenant-booking_singleflight_collapse | followers / leader > 1.5 sustained |
bff-tenant-booking_abandoned_rate_anomaly | abandoned_rate > p95 baseline + 20% for 1 h |
bff-tenant-booking_currency_change_spike | currency changes/sec > 3× baseline (potential UX bug) |
10. Synthetic monitoring
Cloud Monitoring uptime checks every 60 s:
GET /health/readyfrom 4 regionsGET /bff/tenant-booking/v1/staging-test/bootstrap(canary tenant)- Playwright nightly canary: full happy-path booking against
stage(search → quote → hold → guest → payment-stub → confirm).
11. Trace sampling
- Default head sampler: 5%.
- Tail-based sampler in OTel collector elevates to 100% for spans with status_class = 5xx, error_code != null, or duration > p99.
- During incident: SRE toggles head sample to 100% via
bff-tenant-flagsMemorystore key (timer-bound auto-revert at 1 h).
12. Log retention
| Log class | Retention |
|---|---|
audit.* | 7 years (regulatory) |
request.* | 30 d hot, 90 d cold |
error.* | 90 d hot, 1 y cold |
debug.* | 7 d |
All logs export to BigQuery via Log Router for the audit lake.
13. Correlation IDs
Every BFF response carries X-Request-Id and traceparent. The same X-Request-Id propagates as the upstream request id when calling reservation-service, pricing-service, etc., so a single trace spans the booking saga.
14. Cost observability
- Cloud Billing budget alerts at 50/80/100/120% of monthly budget.
- Per-tenant inferred-cost dashboard (RPS × ms × cost-per-CPU-second).
- Pub/Sub egress monitored separately; sampling adjustable per-event.
15. SLO error-budget policy
When 28-day error budget is consumed:
| Consumption | Action |
|---|---|
| 25% | Awareness — Slack notification |
| 50% | Investigate — TODO ticket auto-created |
| 75% | Freeze non-critical changes; SRE pair with Frontend Platform |
| 100% | Hard freeze; revert recent changes; no new deploys until budget recovers |