Skip to main content

FAILURE_MODES — bff-consumer-service

Sibling: APPLICATION_LOGIC · OBSERVABILITY · SECURITY_MODEL · TESTING_STRATEGY

Cross-cutting: 02 Enterprise Architecture · §10 Failure Posture · Standards · ERROR_CODES

This document catalogues what breaks, how the user experiences the break, how we detect it, and how we mitigate it. Every row is paired with an alert and a runbook reference. The BFF is stateless on the hot path — almost every failure is recoverable by retry, cache fallback, or graceful degradation. The most dangerous failure is silent schema drift from upstream services.

User impact column legend: C = consumer web/mobile (this BFF's only direct consumer surface). The Electron desktop and tenant booking surface are never affected by this BFF's failures (they use other BFFs).

1. Failure catalogue

1.1 Upstream service failures

#FailureDetectionUser impact (C)MitigationRunbook
F-1search-aggregation-service 5xx > 1% / 1 minCloud Monitoring alert on upstream error rate per routeSearch list view returns 503 + MELMASTOON.BFF.UPSTREAM_UNAVAILABLE; map view degrades to last cached result with stale=true bannerCircuit breaker opens after 5 consecutive failures (4-second half-open probe). Cache-first responder serves last good payload up to 10 min stale-while-error.bff-consumer/upstream-search-down
F-2search-aggregation-service p95 > 1500 msLatency SLO burn alertSearch slower; partial results mark priceFromCheapest=null (pricing fanout was deadlined)Per-call deadline 1500 ms, hedged retry on second connection after 800 ms (idempotent only).bff-consumer/upstream-search-slow
F-3pricing-service /quotes/preview timeout > 10% of fanoutsPer-upstream success-rate alertListings show without "from $X" badge; banner: "prices loading"Async price-enrich pattern: search returns first; pricing fanout has 800 ms wall budget; missing prices marked null.bff-consumer/pricing-preview-degraded
F-4property-service 5xxPer-route alert on /hotels/{id}Hotel detail returns 502 + MELMASTOON.BFF.UPSTREAM_UNAVAILABLECache miss falls through; previously cached detail (TTL 5 min) is served stale up to 30 min when upstream is down.bff-consumer/property-detail-down
F-5theme-config-service slow / downPer-upstream alertBrand peek (logo + color) missing on detail card; placeholder shownDetail composition continues without brand peek; brand placeholder asset rendered. Cache TTL 15 min absorbs short outages.bff-consumer/theme-peek-degraded
F-6tenant-service slug → tenantId resolution failingLookup error rate alertHandoff requests fail with 502Slug → tenantId resolution is cached for 1 h; outage of <1 h is invisible. >1 h outage degrades handoff to error.bff-consumer/tenant-resolver-down
F-7bff-tenant-booking-service /internal/handoff/{id}/consume rejectsHandoff success-rate alertGuest sees 502 on landing page after redirect (handled by tenant BFF)This BFF only mints; consumer impact is on the tenant BFF side. We log replay rejection and increment a counter for SRE.bff-tenant/handoff-consume-rejected
F-8Schema drift from any upstream (e.g., search-aggregation-service adds non-nullable field)Zod parse failure on response with MELMASTOON.BFF.SCHEMA_DRIFT; alert on first occurrenceAffected route 502sParser logs full payload (sanitized) at WARN; on-call investigates within 15 min; provider rolled back or BFF schema patched.bff/schema-drift

1.2 Stateful dependency failures

#FailureDetectionUser impact (C)MitigationRunbook
F-9Memorystore (Redis) regional outageHealth check fail; Redis client error rate > 5%Sessions cannot be read/written; cache misses cascade to upstream; latency rises sharplySessions are best-effort: if Memorystore is down, requests proceed with an in-memory ephemeral session and a Set-Cookie retry banner. Upstream call rate triples → Cloud Run autoscale absorbs. Aggressive single-flight at the application layer prevents stampede.bff-consumer/memorystore-down
F-10Memorystore failover (replica promoted)Replica failover event from GCP< 30 s of elevated latency; some sessions reset to first-touch stateNo mitigation; client cookies regenerate session; idempotency keys preserve mutating-route correctness.bff-consumer/memorystore-failover
F-11Cloud SQL primary downPostgres health check failMutations (wishlist, handoff, telemetry, locale, currency) fail with 503; reads (search, detail) unaffectedCloud SQL HA failover (~ 60 s); Idempotency-Key on retried mutations ensures correctness. Outbox writes block until Postgres is back; in-memory buffer of 256 events absorbs short blip.bff-consumer/postgres-down
F-12Outbox table growing unbounded (Pub/Sub publisher down)Outbox-depth alert at 5k / 50k / 250k rowsNone visible to user; risk of disk pressureoutbox-relay worker retries with exponential backoff; on-call investigates Pub/Sub region health; manual flush script available.platform/outbox-backlog
F-13Pub/Sub publish 100% failurePublish error rate alertNone visibleAll telemetry queues in outbox; redrives when Pub/Sub recovers; no event loss. Handoff and search work unaffected (telemetry is async).platform/pubsub-publish-down

1.3 Edge / ingress failures

#FailureDetectionUser impact (C)MitigationRunbook
F-14Cloud CDN cache poisoned (incorrect Vary configuration)Synthetic monitor cross-locale check failsSome users see another locale's contentVary header explicitly enumerates Accept-Language, X-Currency, Accept-Encoding. Synthetic test runs every 5 min. Mitigation: purge CDN edge for affected paths; root-cause Vary regression.bff-consumer/cdn-poisoning
F-15Cloud Armor false-positive WAF ruleSpike in 403 from legitimate trafficSome users blocked; banner: "service temporarily unavailable"Cloud Armor rules deployed via Terraform with staging soak first; hotfix is rule-disable + rollout.bff-consumer/waf-false-positive
F-16Geo-block misconfigurationSpike in 451 from a regionRegion's traffic blockedGeo-block list maintained in Terraform with quarterly review; rollback via PR + Terraform apply.bff-consumer/geo-block-misfire
F-17TLS cert expiryCloud Monitoring uptime check fails on TLS handshakeAll requests failCert auto-renewal via GCLB; alert at T-30 days, T-7 days, T-1 day. Manual issue path documented.platform/tls-expiry

1.4 Application-layer failures

#FailureDetectionUser impact (C)MitigationRunbook
F-18HMAC signing key rotation skew (old key removed before all clients drained)Surge in MELMASTOON.BFF.CONSUMER.HANDOFF_INVALID on the tenant BFFRecent handoff links brokenHMAC verifier accepts both currentKey and previousKey for 7 days; rotation drill runs quarterly; SRE checklist enforces 7-day overlap.bff-consumer/hmac-rotation-skew
F-19Bot detector false-positive (real users CAPTCHA-challenged)False-positive rate > 0.5% via synthetic + sampled human trafficReal users hit reCAPTCHA challenge; conversion dropsDetector thresholds tunable per-tenant via Memorystore feature flag; rollback path is to lower aggression score; emergency kill-switch flag disableBotDetector.bff-consumer/bot-fp-spike
F-20Stampede control failure (single-flight collapse breaks)Upstream RPS to search-aggregation-service > 5× steady-state for the same query hashUpstream may overloadcache.stampede.failures metric; Single-flight wrapper has fallback to per-request execution but logs critical alert. Post-mortem required.bff-consumer/stampede-control
F-21Cookie too large (session bloat)cookie.size.bytes p99 alert > 3500 BBrowsers may reject; clients see lost sessionSession blob stored server-side in Memorystore; cookie carries only the ULID. Wishlist and recently-viewed never serialized into cookie.bff-consumer/cookie-bloat
F-22View-model breaking schema change pushed without /v2Contract test failure on PRNone in prod (CI gate)Pact provider verification gate; consumer apps publish pacts on every release; CI fails before merge.bff/contract-drift
F-23Memory leak (long-soak)Heap RSS upward trend > 10% / 4 hPod restarts; brief latency blipCloud Run min instance + readiness check; rolling restart; root-cause via heap snapshot. Long-soak load test in pre-prod.bff-consumer/memory-leak
F-24Outbox publisher dead-letter spike (poison messages)DLQ depth alertNoneDLQ has 7-day retention; SRE inspects payload; manual replay or quarantine.platform/dlq-spike

1.5 Cost / quota failures

#FailureDetectionUser impact (C)MitigationRunbook
F-25Cloud Run cost spike (>2× monthly forecast)Billing alert at 50% / 80% / 100% / 120% of monthly budgetNone directlyOn-call investigates traffic mix; suspected DDoS triggers Cloud Armor rate-limit ratchet.platform/cost-spike
F-26reCAPTCHA quota exhaustedreCAPTCHA error rate > 1%Bot challenges fail-open by policy (legitimate users not blocked)Quota raised in advance for campaigns; fall-back to in-house score-only check (lower confidence) configured behind a flag.bff-consumer/recaptcha-quota

2. Failure decision tree

incoming request

├── Cloud Armor block? ── yes ──► 403 (no telemetry)

├── Bot pre-screen? ── high score ──► 429 + reCAPTCHA challenge

├── Memorystore down? ── yes ──► proceed with ephemeral session, log warn

├── upstream fanout
│ ├── all upstreams ok ──► 200 with full VM
│ ├── pricing partial / slow ──► 200 with priceFromCheapest=null + meta.partial=true
│ ├── theme-peek down ──► 200 with brand placeholder
│ ├── property down ──► serve stale (≤ 30 min) OR 502
│ ├── search-agg down ──► serve stale OR 503
│ └── schema drift detected ──► 502 + alert + parser tracelog

└── outbox enqueue
├── Postgres ok ──► 200 (telemetry async)
└── Postgres down ──► best-effort in-memory queue (256), client banner

3. Blast radius matrix

FailureConsumer surfaceTenant booking surfaceBackoffice surfaceOther tenants
Memorystore downDegraded (sessions ephemeral)NoneNoneCross-tenant: same
Cloud SQL downMutating endpoints failNoneNoneCross-tenant: same
search-aggregation-service downSearch degradedNoneNoneAll consumer tenants
theme-config-service downBrand peek missingTenant booking BFF likely also impactedBackoffice BFF likely also impactedAll tenants
HMAC key rotation skewRecent handoffs brokenSees HANDOFF_INVALIDNoneAll consumer tenants
Bot FP spikeConversion dropNoneNoneVariable

4. Recovery objectives

ObjectiveTarget
RPO (data loss tolerance)5 min (Postgres PITR; Memorystore is cache)
RTO (regional failover)30 min (DNS + Cloud Run redeploy in DR region)
Mean time to detect (MTTD) for P1 alerts< 2 min
Mean time to acknowledge (MTTA) for P1 alerts< 5 min
Mean time to mitigate (MTTM) for P1 alerts< 30 min

5. Game-day exercises

Quarterly chaos game-days exercise the catalogue rows in randomized order. Findings feed back into runbooks. Recent exercises are tracked in services/bff-consumer-service/_chaos/ with date, owner, and remediation backlog.