FAILURE_MODES — bff-consumer-service
Sibling: APPLICATION_LOGIC · OBSERVABILITY · SECURITY_MODEL · TESTING_STRATEGY
Cross-cutting: 02 Enterprise Architecture · §10 Failure Posture · Standards · ERROR_CODES
This document catalogues what breaks, how the user experiences the break, how we detect it, and how we mitigate it. Every row is paired with an alert and a runbook reference. The BFF is stateless on the hot path — almost every failure is recoverable by retry, cache fallback, or graceful degradation. The most dangerous failure is silent schema drift from upstream services.
User impact column legend: C = consumer web/mobile (this BFF's only direct consumer surface). The Electron desktop and tenant booking surface are never affected by this BFF's failures (they use other BFFs).
1. Failure catalogue
1.1 Upstream service failures
| # | Failure | Detection | User impact (C) | Mitigation | Runbook |
|---|---|---|---|---|---|
| F-1 | search-aggregation-service 5xx > 1% / 1 min | Cloud Monitoring alert on upstream error rate per route | Search list view returns 503 + MELMASTOON.BFF.UPSTREAM_UNAVAILABLE; map view degrades to last cached result with stale=true banner | Circuit breaker opens after 5 consecutive failures (4-second half-open probe). Cache-first responder serves last good payload up to 10 min stale-while-error. | bff-consumer/upstream-search-down |
| F-2 | search-aggregation-service p95 > 1500 ms | Latency SLO burn alert | Search slower; partial results mark priceFromCheapest=null (pricing fanout was deadlined) | Per-call deadline 1500 ms, hedged retry on second connection after 800 ms (idempotent only). | bff-consumer/upstream-search-slow |
| F-3 | pricing-service /quotes/preview timeout > 10% of fanouts | Per-upstream success-rate alert | Listings show without "from $X" badge; banner: "prices loading" | Async price-enrich pattern: search returns first; pricing fanout has 800 ms wall budget; missing prices marked null. | bff-consumer/pricing-preview-degraded |
| F-4 | property-service 5xx | Per-route alert on /hotels/{id} | Hotel detail returns 502 + MELMASTOON.BFF.UPSTREAM_UNAVAILABLE | Cache miss falls through; previously cached detail (TTL 5 min) is served stale up to 30 min when upstream is down. | bff-consumer/property-detail-down |
| F-5 | theme-config-service slow / down | Per-upstream alert | Brand peek (logo + color) missing on detail card; placeholder shown | Detail composition continues without brand peek; brand placeholder asset rendered. Cache TTL 15 min absorbs short outages. | bff-consumer/theme-peek-degraded |
| F-6 | tenant-service slug → tenantId resolution failing | Lookup error rate alert | Handoff requests fail with 502 | Slug → tenantId resolution is cached for 1 h; outage of <1 h is invisible. >1 h outage degrades handoff to error. | bff-consumer/tenant-resolver-down |
| F-7 | bff-tenant-booking-service /internal/handoff/{id}/consume rejects | Handoff success-rate alert | Guest sees 502 on landing page after redirect (handled by tenant BFF) | This BFF only mints; consumer impact is on the tenant BFF side. We log replay rejection and increment a counter for SRE. | bff-tenant/handoff-consume-rejected |
| F-8 | Schema drift from any upstream (e.g., search-aggregation-service adds non-nullable field) | Zod parse failure on response with MELMASTOON.BFF.SCHEMA_DRIFT; alert on first occurrence | Affected route 502s | Parser logs full payload (sanitized) at WARN; on-call investigates within 15 min; provider rolled back or BFF schema patched. | bff/schema-drift |
1.2 Stateful dependency failures
| # | Failure | Detection | User impact (C) | Mitigation | Runbook |
|---|---|---|---|---|---|
| F-9 | Memorystore (Redis) regional outage | Health check fail; Redis client error rate > 5% | Sessions cannot be read/written; cache misses cascade to upstream; latency rises sharply | Sessions are best-effort: if Memorystore is down, requests proceed with an in-memory ephemeral session and a Set-Cookie retry banner. Upstream call rate triples → Cloud Run autoscale absorbs. Aggressive single-flight at the application layer prevents stampede. | bff-consumer/memorystore-down |
| F-10 | Memorystore failover (replica promoted) | Replica failover event from GCP | < 30 s of elevated latency; some sessions reset to first-touch state | No mitigation; client cookies regenerate session; idempotency keys preserve mutating-route correctness. | bff-consumer/memorystore-failover |
| F-11 | Cloud SQL primary down | Postgres health check fail | Mutations (wishlist, handoff, telemetry, locale, currency) fail with 503; reads (search, detail) unaffected | Cloud SQL HA failover (~ 60 s); Idempotency-Key on retried mutations ensures correctness. Outbox writes block until Postgres is back; in-memory buffer of 256 events absorbs short blip. | bff-consumer/postgres-down |
| F-12 | Outbox table growing unbounded (Pub/Sub publisher down) | Outbox-depth alert at 5k / 50k / 250k rows | None visible to user; risk of disk pressure | outbox-relay worker retries with exponential backoff; on-call investigates Pub/Sub region health; manual flush script available. | platform/outbox-backlog |
| F-13 | Pub/Sub publish 100% failure | Publish error rate alert | None visible | All telemetry queues in outbox; redrives when Pub/Sub recovers; no event loss. Handoff and search work unaffected (telemetry is async). | platform/pubsub-publish-down |
1.3 Edge / ingress failures
| # | Failure | Detection | User impact (C) | Mitigation | Runbook |
|---|---|---|---|---|---|
| F-14 | Cloud CDN cache poisoned (incorrect Vary configuration) | Synthetic monitor cross-locale check fails | Some users see another locale's content | Vary header explicitly enumerates Accept-Language, X-Currency, Accept-Encoding. Synthetic test runs every 5 min. Mitigation: purge CDN edge for affected paths; root-cause Vary regression. | bff-consumer/cdn-poisoning |
| F-15 | Cloud Armor false-positive WAF rule | Spike in 403 from legitimate traffic | Some users blocked; banner: "service temporarily unavailable" | Cloud Armor rules deployed via Terraform with staging soak first; hotfix is rule-disable + rollout. | bff-consumer/waf-false-positive |
| F-16 | Geo-block misconfiguration | Spike in 451 from a region | Region's traffic blocked | Geo-block list maintained in Terraform with quarterly review; rollback via PR + Terraform apply. | bff-consumer/geo-block-misfire |
| F-17 | TLS cert expiry | Cloud Monitoring uptime check fails on TLS handshake | All requests fail | Cert auto-renewal via GCLB; alert at T-30 days, T-7 days, T-1 day. Manual issue path documented. | platform/tls-expiry |
1.4 Application-layer failures
| # | Failure | Detection | User impact (C) | Mitigation | Runbook |
|---|---|---|---|---|---|
| F-18 | HMAC signing key rotation skew (old key removed before all clients drained) | Surge in MELMASTOON.BFF.CONSUMER.HANDOFF_INVALID on the tenant BFF | Recent handoff links broken | HMAC verifier accepts both currentKey and previousKey for 7 days; rotation drill runs quarterly; SRE checklist enforces 7-day overlap. | bff-consumer/hmac-rotation-skew |
| F-19 | Bot detector false-positive (real users CAPTCHA-challenged) | False-positive rate > 0.5% via synthetic + sampled human traffic | Real users hit reCAPTCHA challenge; conversion drops | Detector thresholds tunable per-tenant via Memorystore feature flag; rollback path is to lower aggression score; emergency kill-switch flag disableBotDetector. | bff-consumer/bot-fp-spike |
| F-20 | Stampede control failure (single-flight collapse breaks) | Upstream RPS to search-aggregation-service > 5× steady-state for the same query hash | Upstream may overload | cache.stampede.failures metric; Single-flight wrapper has fallback to per-request execution but logs critical alert. Post-mortem required. | bff-consumer/stampede-control |
| F-21 | Cookie too large (session bloat) | cookie.size.bytes p99 alert > 3500 B | Browsers may reject; clients see lost session | Session blob stored server-side in Memorystore; cookie carries only the ULID. Wishlist and recently-viewed never serialized into cookie. | bff-consumer/cookie-bloat |
| F-22 | View-model breaking schema change pushed without /v2 | Contract test failure on PR | None in prod (CI gate) | Pact provider verification gate; consumer apps publish pacts on every release; CI fails before merge. | bff/contract-drift |
| F-23 | Memory leak (long-soak) | Heap RSS upward trend > 10% / 4 h | Pod restarts; brief latency blip | Cloud Run min instance + readiness check; rolling restart; root-cause via heap snapshot. Long-soak load test in pre-prod. | bff-consumer/memory-leak |
| F-24 | Outbox publisher dead-letter spike (poison messages) | DLQ depth alert | None | DLQ has 7-day retention; SRE inspects payload; manual replay or quarantine. | platform/dlq-spike |
1.5 Cost / quota failures
| # | Failure | Detection | User impact (C) | Mitigation | Runbook |
|---|---|---|---|---|---|
| F-25 | Cloud Run cost spike (>2× monthly forecast) | Billing alert at 50% / 80% / 100% / 120% of monthly budget | None directly | On-call investigates traffic mix; suspected DDoS triggers Cloud Armor rate-limit ratchet. | platform/cost-spike |
| F-26 | reCAPTCHA quota exhausted | reCAPTCHA error rate > 1% | Bot challenges fail-open by policy (legitimate users not blocked) | Quota raised in advance for campaigns; fall-back to in-house score-only check (lower confidence) configured behind a flag. | bff-consumer/recaptcha-quota |
2. Failure decision tree
incoming request
│
├── Cloud Armor block? ── yes ──► 403 (no telemetry)
│
├── Bot pre-screen? ── high score ──► 429 + reCAPTCHA challenge
│
├── Memorystore down? ── yes ──► proceed with ephemeral session, log warn
│
├── upstream fanout
│ ├── all upstreams ok ──► 200 with full VM
│ ├── pricing partial / slow ──► 200 with priceFromCheapest=null + meta.partial=true
│ ├── theme-peek down ──► 200 with brand placeholder
│ ├── property down ──► serve stale (≤ 30 min) OR 502
│ ├── search-agg down ──► serve stale OR 503
│ └── schema drift detected ──► 502 + alert + parser tracelog
│
└── outbox enqueue
├── Postgres ok ──► 200 (telemetry async)
└── Postgres down ──► best-effort in-memory queue (256), client banner
3. Blast radius matrix
| Failure | Consumer surface | Tenant booking surface | Backoffice surface | Other tenants |
|---|---|---|---|---|
| Memorystore down | Degraded (sessions ephemeral) | None | None | Cross-tenant: same |
| Cloud SQL down | Mutating endpoints fail | None | None | Cross-tenant: same |
search-aggregation-service down | Search degraded | None | None | All consumer tenants |
theme-config-service down | Brand peek missing | Tenant booking BFF likely also impacted | Backoffice BFF likely also impacted | All tenants |
| HMAC key rotation skew | Recent handoffs broken | Sees HANDOFF_INVALID | None | All consumer tenants |
| Bot FP spike | Conversion drop | None | None | Variable |
4. Recovery objectives
| Objective | Target |
|---|---|
| RPO (data loss tolerance) | 5 min (Postgres PITR; Memorystore is cache) |
| RTO (regional failover) | 30 min (DNS + Cloud Run redeploy in DR region) |
| Mean time to detect (MTTD) for P1 alerts | < 2 min |
| Mean time to acknowledge (MTTA) for P1 alerts | < 5 min |
| Mean time to mitigate (MTTM) for P1 alerts | < 30 min |
5. Game-day exercises
Quarterly chaos game-days exercise the catalogue rows in randomized order. Findings feed back into runbooks. Recent exercises are tracked in services/bff-consumer-service/_chaos/ with date, owner, and remediation backlog.