FAILURE_MODES — bff-tenant-booking-service
Sibling: APPLICATION_LOGIC · OBSERVABILITY · SECURITY_MODEL · TESTING_STRATEGY
Cross-cutting: 02 Enterprise Architecture · §10 Failure Posture · Standards · ERROR_CODES
This document catalogues what breaks, how the user (consumer-facing tenant guest) experiences it, how we detect it, and how we mitigate. The booking BFF carries money risk — failures here can result in over-charges, missed reservations, or stranded inventory holds. The most dangerous classes are: payment-return double processing, handoff replay, schema drift on quote, and inventory-confirm race.
User impact column legend: G = guest on the tenant booking surface (web or mobile). The Electron desktop and consumer meta surface are never affected by this BFF.
1. Failure catalogue
1.1 Upstream service failures
| # | Failure | Detection | User impact (G) | Mitigation | Runbook |
|---|---|---|---|---|---|
| F-1 | tenant-service slug resolution down | Per-upstream alert | Bootstrap returns 502; cached slugs continue to work for 1 h | 1 h slug cache absorbs short outages; cache miss + outage → 503 + MELMASTOON.BFF.UPSTREAM_UNAVAILABLE | bff-tenant/slug-resolver-down |
| F-2 | theme-config-service down | Per-upstream alert | Bootstrap serves last-good cache up to 30 min stale; banner: "theme may be outdated" | Memorystore cache TTL 5 min + extended stale-while-error 30 min; CSS sheet served from CDN regardless | bff-tenant/theme-down |
| F-3 | inventory-service slow / down | Per-upstream alert | /availability returns partial result with stale=true; rooms shown without "X left" badge | Stale cache up to 60 s; banner; availability_p95_latency alert | bff-tenant/inventory-degraded |
| F-4 | pricing-service quote 5xx | Per-upstream alert | /quote returns 502; user retries | No retry on quote (non-idempotent); user sees provider-specific banner | bff-tenant/pricing-quote-down |
| F-5 | pricing-service cheapest fanout slow | Per-upstream alert | /availability shows rooms without prices | 800 ms deadline; null-price branch; priceUnavailable=true per room | bff-tenant/pricing-cheapest-degraded |
| F-6 | reservation-service hold 5xx | Per-route alert | /hold returns 502; user sees "couldn't reserve, try again" banner | Idempotency-Key prevents double-hold on retry; circuit opens after 3 failures / 15 s | bff-tenant/reservation-hold-down |
| F-7 | reservation-service confirm 5xx | Per-route alert + missing-confirm metric | /return returns 504 after timeout; user sees "we're confirming, please wait" | Idem-Key (confirm:<rsvId>:<providerRef>) prevents double-confirm; UI polls confirmation page; on-call investigates within 5 min for any unresolved | bff-tenant/reservation-confirm-down |
| F-8 | payment-gateway-service 5xx | Per-route alert | /payment-intent returns 502; user sees provider error | Banner with "try a different method"; circuit opens; no retry (provider state ambiguous) | bff-tenant/payment-gateway-down |
| F-9 | payment-gateway-service verifyReturn ambiguous (2xx but no decisive status) | Logged + payment_return_ambiguous_total metric | /return returns 504; UI redirects to confirmation polling | UI polls /confirmation/{rsv} every 3 s for 60 s; if reservation status flips to confirmed we honor; otherwise user contacts support | bff-tenant/payment-ambiguous |
| F-10 | billing-service down on confirmation | Per-upstream alert | Confirmation page renders without folio summary; folioUnavailable=true | Soft-fail: render reservation block; show "folio loading"; refresh button | bff-tenant/billing-soft-fail |
| F-11 | lock-integration-service down on confirmation | Per-upstream alert | Confirmation page renders without key-credential placeholder | Soft-fail: omit placeholder; mention "your key will be available 24h before arrival" | bff-tenant/lock-soft-fail |
| F-12 | ai-orchestrator-service down | Per-upstream alert | AI surfaces (recommendations, policy summary) hidden | Silent degrade; meta.aiUnavailable=true; no banner | bff-tenant/ai-down |
| F-13 | Schema drift from any upstream | Zod parse failure on response | Affected route 502 + MELMASTOON.BFF.SCHEMA_DRIFT | Parser logs full payload (sanitized) at WARN; on-call within 15 min; provider rolled back or BFF schema patched | bff/schema-drift |
| F-14 | bff-consumer-service /internal/handoff/{id}/consume reachability lost | Internal mTLS alert | Handoff arrival cannot mark consumer-side consumed | We mark our side consumed regardless; consumer-side eventual reconciliation by sweep job | bff-tenant/internal-handoff-callback |
1.2 Stateful dependency failures
| # | Failure | Detection | User impact (G) | Mitigation | Runbook |
|---|---|---|---|---|---|
| F-15 | Memorystore cache tier down | Health check + Redis client errors | Higher upstream load; latency rises; bootstrap cache disabled | Read-through to upstream; readiness reflects state; circuit on Redis after 5% errors | bff-tenant/memorystore-cache-down |
| F-16 | Memorystore session tier down | Health check + Redis client errors | Mutating endpoints fail; 503 + MELMASTOON.BFF.CACHE_UNAVAILABLE; user retries | Cannot serve booking flow without session tier; readiness reflects DOWN; auto-failover to standby | bff-tenant/memorystore-session-down |
| F-17 | Memorystore failover (standby promoted) | Failover event from GCP | < 30 s elevated latency; in-flight drafts may be lost | Drafts that hadn't yet snapshotted are lost; clients reload via /draft/{id} 404 → restart funnel; idempotency keys preserve correctness across retries | bff-tenant/memorystore-failover |
| F-18 | Cloud SQL primary down | Postgres health check | /handoff/consume and idempotency writes fail; mutating endpoints 503 | HA failover (~ 60 s); idempotency keys absorb retries | bff-tenant/postgres-down |
| F-19 | Outbox grows (Pub/Sub publisher down) | Outbox-depth alert at 5k / 50k / 250k | None visible; storage pressure risk | outbox-relay retries with backoff; on-call investigates | platform/outbox-backlog |
| F-20 | Pub/Sub publish 100% failure | Publish error alert | None visible | Outbox queues; no event loss; recovers when Pub/Sub healthy | platform/pubsub-publish-down |
1.3 Edge / ingress failures
| # | Failure | Detection | User impact (G) | Mitigation | Runbook |
|---|---|---|---|---|---|
| F-21 | Custom-domain TLS cert expiry | Cloud Monitoring uptime check on TLS handshake | All requests on that domain fail | Cert Manager auto-renewal; alert at T-30 / T-7 / T-1 days | bff-tenant/custom-domain-tls |
| F-22 | Custom-domain DNS regression (CNAME removed by tenant) | Synthetic uptime check fails | Tenant booking site unreachable | We detect within 5 min; alert tenant via CSM channel; suggest DNS restoration | bff-tenant/custom-domain-dns |
| F-23 | Cloud Armor WAF rule false-positive on legitimate booking flow | Spike in 403 from booking traffic | Some users blocked mid-flow | Staging soak before prod rollout; rollback via Terraform | bff-tenant/waf-fp |
| F-24 | CDN cache poisoning | Synthetic monitor cross-locale fail | Some users see another tenant's content (catastrophic) | Vary header strict; tenant-scoped cache key never normalized; per-tenant cert SNI; nightly cross-tenant probe | bff-tenant/cdn-cross-tenant |
| F-25 | Bootstrap cached for too long across theme publish | theme.published.v1 consumer lag | Stale theme until invalidate processed | Inbox lag alert > 30 s; manual invalidate command available | bff-tenant/theme-cache-stale |
1.4 Application-layer failures
| # | Failure | Detection | User impact (G) | Mitigation | Runbook |
|---|---|---|---|---|---|
| F-26 | Payment-return double processing creates double reservation | Reservation-service counter mismatch | Guest charged twice and has 2 reservations | Idem-Key on confirm + Postgres unique constraint on (reservationId, providerReference); chaos test gates this | bff-tenant/double-confirm |
| F-27 | Handoff replay accepted | bff-tenant-booking_handoff_replayed_total spike | None (rejected); but increases telemetry noise | Postgres unique PK on handoff_arrival_log.id; constant-time MAC | bff-tenant/handoff-replay-spike |
| F-28 | HMAC key rotation skew (old key removed too early) | Surge in HANDOFF_SIGNATURE_INVALID from arrivals minted < 7 days ago | Recent handoff links broken | 7-day overlap window enforced by drill; rollback path is to add old key back to keyring | bff-tenant/hmac-rotation-skew |
| F-29 | BookingDraft state-machine divergence (illegal transition) | flow.error_encountered.v1 with INVALID_FLOW_TRANSITION | User sees "session expired, please restart" | State machine in domain layer is single-source-of-truth; unit tests exhaustive; rollback to last known good draft via /draft/{id} | bff-tenant/draft-state-divergence |
| F-30 | Optimistic-concurrency conflict on BookingDraft.patch | bff-tenant-booking_draft_conflict_total | UI sees 412; auto-refetches and retries | Clients use ETag-equivalent expectedUpdatedAt; auto-retry once with fresh state | bff-tenant/draft-conflict |
| F-31 | Single-flight collapse on bootstrap during deploy | Spike in singleflight_followers_total | Brief upstream pressure on theme-config-service | Single-flight wrapper falls back to per-request execution; alarm triggers post-deploy review | bff-tenant/singleflight-collapse |
| F-32 | Cookie size bloat | cookie.size.bytes p99 > 3500 | Browsers may reject; session lost | Session blob server-side; cookie carries only tnt_<ulid> | bff-tenant/cookie-bloat |
| F-33 | View-model breaking change without /v2 | Contract test failure on PR | None (CI gate) | Pact provider verification; client app pacts; CI fails before merge | bff/contract-drift |
| F-34 | Memory leak (long soak) | Heap RSS upward trend > 10% / 4 h | Pod restarts | Cloud Run min instance + readiness check; rolling restart | bff-tenant/memory-leak |
| F-35 | Outbox publisher DLQ spike | DLQ depth alert | None | DLQ has 7-day retention; SRE inspects payload; manual replay | platform/dlq-spike |
| F-36 | Currency change mid-flow not honored | Synthetic E2E flagging mismatched totals | Wrong totals shown briefly | Re-quote on next quote request; clients required to call quote on currency change | bff-tenant/currency-change-bug |
1.5 Cost / quota
| # | Failure | Detection | User impact (G) | Mitigation | Runbook |
|---|---|---|---|---|---|
| F-37 | Cloud Run cost spike | Billing alert | None directly | Cloud Armor ratchet; capacity audit | platform/cost-spike |
| F-38 | Pub/Sub cost spike from telemetry | Billing alert | None | Reduce sample rates via flag; identify hot subjects | bff-tenant/telemetry-cost |
2. Failure decision tree
incoming request
│
├── Cloud Armor block? ── yes ──► 403 (no telemetry)
│
├── Custom domain unclaimed? ── yes ──► 404
│
├── Tenant slug unknown? ── yes ──► 404 + SLUG_UNKNOWN
│
├── Tenant suspended? ── yes ──► 503 + TENANT.SUSPENDED
│
├── Memorystore session tier down? ── yes ──► 503 + CACHE_UNAVAILABLE
│
├── route fanout
│ ├── all upstreams ok ──► 200 with full VM
│ ├── pricing partial ──► 200 with partial=true
│ ├── theme-cache stale ──► 200 with stale=true banner
│ ├── reservation-service down (hold) ──► 502 + UPSTREAM_UNAVAILABLE
│ ├── reservation-service down (confirm) ──► 504 + UPSTREAM_TIMEOUT (idem-key absorbs retry)
│ ├── payment-gateway down ──► 502 + UPSTREAM_UNAVAILABLE
│ └── schema drift detected ──► 502 + SCHEMA_DRIFT + alert
│
└── outbox enqueue
├── Postgres ok ──► 200 (telemetry async)
└── Postgres down ──► best-effort buffer; readiness reflects DOWN
3. Blast radius matrix
| Failure | Tenant booking surface (G) | Consumer surface | Backoffice surface | Other tenants |
|---|---|---|---|---|
| Memorystore down | Severe (mutating endpoints fail) | None | None | Same (cross-tenant Memorystore) |
| Cloud SQL down | Mutating endpoints fail | None | None | Same |
reservation-service down | Hold + confirm fail | None | Backoffice may also be impacted | All tenants |
theme-config-service down | Bootstrap stale | None | Backoffice may need theme | All tenants |
| Custom-domain DNS regression | One tenant down | None | None | Just that tenant |
| HMAC key rotation skew | Handoff arrivals broken | None | None | All tenants |
| WAF FP | Some bookings blocked | None | None | Variable |
4. Recovery objectives
| Objective | Target |
|---|---|
| RPO | 5 min |
| RTO | 30 min |
| MTTD (P1) | < 2 min |
| MTTA (P1) | < 5 min |
| MTTM (P1) | < 30 min |
| Confirm-idempotent correctness | 100% (no double-confirm tolerated) |
5. Game-day exercises
Quarterly. Recent runs in services/bff-tenant-booking-service/_chaos/ with date, owner, and remediation backlog. Particular focus on: payment-return ambiguity, handoff key rotation, custom-domain TLS, and memorystore failover with active drafts.