Skip to main content

FAILURE_MODES — bff-tenant-booking-service

Sibling: APPLICATION_LOGIC · OBSERVABILITY · SECURITY_MODEL · TESTING_STRATEGY

Cross-cutting: 02 Enterprise Architecture · §10 Failure Posture · Standards · ERROR_CODES

This document catalogues what breaks, how the user (consumer-facing tenant guest) experiences it, how we detect it, and how we mitigate. The booking BFF carries money risk — failures here can result in over-charges, missed reservations, or stranded inventory holds. The most dangerous classes are: payment-return double processing, handoff replay, schema drift on quote, and inventory-confirm race.

User impact column legend: G = guest on the tenant booking surface (web or mobile). The Electron desktop and consumer meta surface are never affected by this BFF.

1. Failure catalogue

1.1 Upstream service failures

#FailureDetectionUser impact (G)MitigationRunbook
F-1tenant-service slug resolution downPer-upstream alertBootstrap returns 502; cached slugs continue to work for 1 h1 h slug cache absorbs short outages; cache miss + outage → 503 + MELMASTOON.BFF.UPSTREAM_UNAVAILABLEbff-tenant/slug-resolver-down
F-2theme-config-service downPer-upstream alertBootstrap serves last-good cache up to 30 min stale; banner: "theme may be outdated"Memorystore cache TTL 5 min + extended stale-while-error 30 min; CSS sheet served from CDN regardlessbff-tenant/theme-down
F-3inventory-service slow / downPer-upstream alert/availability returns partial result with stale=true; rooms shown without "X left" badgeStale cache up to 60 s; banner; availability_p95_latency alertbff-tenant/inventory-degraded
F-4pricing-service quote 5xxPer-upstream alert/quote returns 502; user retriesNo retry on quote (non-idempotent); user sees provider-specific bannerbff-tenant/pricing-quote-down
F-5pricing-service cheapest fanout slowPer-upstream alert/availability shows rooms without prices800 ms deadline; null-price branch; priceUnavailable=true per roombff-tenant/pricing-cheapest-degraded
F-6reservation-service hold 5xxPer-route alert/hold returns 502; user sees "couldn't reserve, try again" bannerIdempotency-Key prevents double-hold on retry; circuit opens after 3 failures / 15 sbff-tenant/reservation-hold-down
F-7reservation-service confirm 5xxPer-route alert + missing-confirm metric/return returns 504 after timeout; user sees "we're confirming, please wait"Idem-Key (confirm:<rsvId>:<providerRef>) prevents double-confirm; UI polls confirmation page; on-call investigates within 5 min for any unresolvedbff-tenant/reservation-confirm-down
F-8payment-gateway-service 5xxPer-route alert/payment-intent returns 502; user sees provider errorBanner with "try a different method"; circuit opens; no retry (provider state ambiguous)bff-tenant/payment-gateway-down
F-9payment-gateway-service verifyReturn ambiguous (2xx but no decisive status)Logged + payment_return_ambiguous_total metric/return returns 504; UI redirects to confirmation pollingUI polls /confirmation/{rsv} every 3 s for 60 s; if reservation status flips to confirmed we honor; otherwise user contacts supportbff-tenant/payment-ambiguous
F-10billing-service down on confirmationPer-upstream alertConfirmation page renders without folio summary; folioUnavailable=trueSoft-fail: render reservation block; show "folio loading"; refresh buttonbff-tenant/billing-soft-fail
F-11lock-integration-service down on confirmationPer-upstream alertConfirmation page renders without key-credential placeholderSoft-fail: omit placeholder; mention "your key will be available 24h before arrival"bff-tenant/lock-soft-fail
F-12ai-orchestrator-service downPer-upstream alertAI surfaces (recommendations, policy summary) hiddenSilent degrade; meta.aiUnavailable=true; no bannerbff-tenant/ai-down
F-13Schema drift from any upstreamZod parse failure on responseAffected route 502 + MELMASTOON.BFF.SCHEMA_DRIFTParser logs full payload (sanitized) at WARN; on-call within 15 min; provider rolled back or BFF schema patchedbff/schema-drift
F-14bff-consumer-service /internal/handoff/{id}/consume reachability lostInternal mTLS alertHandoff arrival cannot mark consumer-side consumedWe mark our side consumed regardless; consumer-side eventual reconciliation by sweep jobbff-tenant/internal-handoff-callback

1.2 Stateful dependency failures

#FailureDetectionUser impact (G)MitigationRunbook
F-15Memorystore cache tier downHealth check + Redis client errorsHigher upstream load; latency rises; bootstrap cache disabledRead-through to upstream; readiness reflects state; circuit on Redis after 5% errorsbff-tenant/memorystore-cache-down
F-16Memorystore session tier downHealth check + Redis client errorsMutating endpoints fail; 503 + MELMASTOON.BFF.CACHE_UNAVAILABLE; user retriesCannot serve booking flow without session tier; readiness reflects DOWN; auto-failover to standbybff-tenant/memorystore-session-down
F-17Memorystore failover (standby promoted)Failover event from GCP< 30 s elevated latency; in-flight drafts may be lostDrafts that hadn't yet snapshotted are lost; clients reload via /draft/{id} 404 → restart funnel; idempotency keys preserve correctness across retriesbff-tenant/memorystore-failover
F-18Cloud SQL primary downPostgres health check/handoff/consume and idempotency writes fail; mutating endpoints 503HA failover (~ 60 s); idempotency keys absorb retriesbff-tenant/postgres-down
F-19Outbox grows (Pub/Sub publisher down)Outbox-depth alert at 5k / 50k / 250kNone visible; storage pressure riskoutbox-relay retries with backoff; on-call investigatesplatform/outbox-backlog
F-20Pub/Sub publish 100% failurePublish error alertNone visibleOutbox queues; no event loss; recovers when Pub/Sub healthyplatform/pubsub-publish-down

1.3 Edge / ingress failures

#FailureDetectionUser impact (G)MitigationRunbook
F-21Custom-domain TLS cert expiryCloud Monitoring uptime check on TLS handshakeAll requests on that domain failCert Manager auto-renewal; alert at T-30 / T-7 / T-1 daysbff-tenant/custom-domain-tls
F-22Custom-domain DNS regression (CNAME removed by tenant)Synthetic uptime check failsTenant booking site unreachableWe detect within 5 min; alert tenant via CSM channel; suggest DNS restorationbff-tenant/custom-domain-dns
F-23Cloud Armor WAF rule false-positive on legitimate booking flowSpike in 403 from booking trafficSome users blocked mid-flowStaging soak before prod rollout; rollback via Terraformbff-tenant/waf-fp
F-24CDN cache poisoningSynthetic monitor cross-locale failSome users see another tenant's content (catastrophic)Vary header strict; tenant-scoped cache key never normalized; per-tenant cert SNI; nightly cross-tenant probebff-tenant/cdn-cross-tenant
F-25Bootstrap cached for too long across theme publishtheme.published.v1 consumer lagStale theme until invalidate processedInbox lag alert > 30 s; manual invalidate command availablebff-tenant/theme-cache-stale

1.4 Application-layer failures

#FailureDetectionUser impact (G)MitigationRunbook
F-26Payment-return double processing creates double reservationReservation-service counter mismatchGuest charged twice and has 2 reservationsIdem-Key on confirm + Postgres unique constraint on (reservationId, providerReference); chaos test gates thisbff-tenant/double-confirm
F-27Handoff replay acceptedbff-tenant-booking_handoff_replayed_total spikeNone (rejected); but increases telemetry noisePostgres unique PK on handoff_arrival_log.id; constant-time MACbff-tenant/handoff-replay-spike
F-28HMAC key rotation skew (old key removed too early)Surge in HANDOFF_SIGNATURE_INVALID from arrivals minted < 7 days agoRecent handoff links broken7-day overlap window enforced by drill; rollback path is to add old key back to keyringbff-tenant/hmac-rotation-skew
F-29BookingDraft state-machine divergence (illegal transition)flow.error_encountered.v1 with INVALID_FLOW_TRANSITIONUser sees "session expired, please restart"State machine in domain layer is single-source-of-truth; unit tests exhaustive; rollback to last known good draft via /draft/{id}bff-tenant/draft-state-divergence
F-30Optimistic-concurrency conflict on BookingDraft.patchbff-tenant-booking_draft_conflict_totalUI sees 412; auto-refetches and retriesClients use ETag-equivalent expectedUpdatedAt; auto-retry once with fresh statebff-tenant/draft-conflict
F-31Single-flight collapse on bootstrap during deploySpike in singleflight_followers_totalBrief upstream pressure on theme-config-serviceSingle-flight wrapper falls back to per-request execution; alarm triggers post-deploy reviewbff-tenant/singleflight-collapse
F-32Cookie size bloatcookie.size.bytes p99 > 3500Browsers may reject; session lostSession blob server-side; cookie carries only tnt_<ulid>bff-tenant/cookie-bloat
F-33View-model breaking change without /v2Contract test failure on PRNone (CI gate)Pact provider verification; client app pacts; CI fails before mergebff/contract-drift
F-34Memory leak (long soak)Heap RSS upward trend > 10% / 4 hPod restartsCloud Run min instance + readiness check; rolling restartbff-tenant/memory-leak
F-35Outbox publisher DLQ spikeDLQ depth alertNoneDLQ has 7-day retention; SRE inspects payload; manual replayplatform/dlq-spike
F-36Currency change mid-flow not honoredSynthetic E2E flagging mismatched totalsWrong totals shown brieflyRe-quote on next quote request; clients required to call quote on currency changebff-tenant/currency-change-bug

1.5 Cost / quota

#FailureDetectionUser impact (G)MitigationRunbook
F-37Cloud Run cost spikeBilling alertNone directlyCloud Armor ratchet; capacity auditplatform/cost-spike
F-38Pub/Sub cost spike from telemetryBilling alertNoneReduce sample rates via flag; identify hot subjectsbff-tenant/telemetry-cost

2. Failure decision tree

incoming request

├── Cloud Armor block? ── yes ──► 403 (no telemetry)

├── Custom domain unclaimed? ── yes ──► 404

├── Tenant slug unknown? ── yes ──► 404 + SLUG_UNKNOWN

├── Tenant suspended? ── yes ──► 503 + TENANT.SUSPENDED

├── Memorystore session tier down? ── yes ──► 503 + CACHE_UNAVAILABLE

├── route fanout
│ ├── all upstreams ok ──► 200 with full VM
│ ├── pricing partial ──► 200 with partial=true
│ ├── theme-cache stale ──► 200 with stale=true banner
│ ├── reservation-service down (hold) ──► 502 + UPSTREAM_UNAVAILABLE
│ ├── reservation-service down (confirm) ──► 504 + UPSTREAM_TIMEOUT (idem-key absorbs retry)
│ ├── payment-gateway down ──► 502 + UPSTREAM_UNAVAILABLE
│ └── schema drift detected ──► 502 + SCHEMA_DRIFT + alert

└── outbox enqueue
├── Postgres ok ──► 200 (telemetry async)
└── Postgres down ──► best-effort buffer; readiness reflects DOWN

3. Blast radius matrix

FailureTenant booking surface (G)Consumer surfaceBackoffice surfaceOther tenants
Memorystore downSevere (mutating endpoints fail)NoneNoneSame (cross-tenant Memorystore)
Cloud SQL downMutating endpoints failNoneNoneSame
reservation-service downHold + confirm failNoneBackoffice may also be impactedAll tenants
theme-config-service downBootstrap staleNoneBackoffice may need themeAll tenants
Custom-domain DNS regressionOne tenant downNoneNoneJust that tenant
HMAC key rotation skewHandoff arrivals brokenNoneNoneAll tenants
WAF FPSome bookings blockedNoneNoneVariable

4. Recovery objectives

ObjectiveTarget
RPO5 min
RTO30 min
MTTD (P1)< 2 min
MTTA (P1)< 5 min
MTTM (P1)< 30 min
Confirm-idempotent correctness100% (no double-confirm tolerated)

5. Game-day exercises

Quarterly. Recent runs in services/bff-tenant-booking-service/_chaos/ with date, owner, and remediation backlog. Particular focus on: payment-return ambiguity, handoff key rotation, custom-domain TLS, and memorystore failover with active drafts.