FAILURE_MODES — bff-consumer-service

Sibling: APPLICATION_LOGIC · OBSERVABILITY · SECURITY_MODEL · TESTING_STRATEGY

Cross-cutting: 02 Enterprise Architecture · §10 Failure Posture · Standards · ERROR_CODES

This document catalogues what breaks, how the user experiences the break, how we detect it, and how we mitigate it. Every row is paired with an alert and a runbook reference. The BFF is stateless on the hot path — almost every failure is recoverable by retry, cache fallback, or graceful degradation. The most dangerous failure is silent schema drift from upstream services.

User impact column legend: C = consumer web/mobile (this BFF's only direct consumer surface). The Electron desktop and tenant booking surface are never affected by this BFF's failures (they use other BFFs).

1. Failure catalogue

1.1 Upstream service failures

#	Failure	Detection	User impact (C)	Mitigation	Runbook
F-1	`search-aggregation-service` 5xx > 1% / 1 min	Cloud Monitoring alert on upstream error rate per route	Search list view returns 503 + `MELMASTOON.BFF.UPSTREAM_UNAVAILABLE`; map view degrades to last cached result with `stale=true` banner	Circuit breaker opens after 5 consecutive failures (4-second half-open probe). Cache-first responder serves last good payload up to 10 min `stale-while-error`.	`bff-consumer/upstream-search-down`
F-2	`search-aggregation-service` p95 > 1500 ms	Latency SLO burn alert	Search slower; partial results mark `priceFromCheapest=null` (pricing fanout was deadlined)	Per-call deadline 1500 ms, hedged retry on second connection after 800 ms (idempotent only).	`bff-consumer/upstream-search-slow`
F-3	`pricing-service` `/quotes/preview` timeout > 10% of fanouts	Per-upstream success-rate alert	Listings show without "from $X" badge; banner: "prices loading"	Async price-enrich pattern: search returns first; pricing fanout has 800 ms wall budget; missing prices marked null.	`bff-consumer/pricing-preview-degraded`
F-4	`property-service` 5xx	Per-route alert on `/hotels/{id}`	Hotel detail returns 502 + `MELMASTOON.BFF.UPSTREAM_UNAVAILABLE`	Cache miss falls through; previously cached detail (TTL 5 min) is served stale up to 30 min when upstream is down.	`bff-consumer/property-detail-down`
F-5	`theme-config-service` slow / down	Per-upstream alert	Brand peek (logo + color) missing on detail card; placeholder shown	Detail composition continues without brand peek; brand placeholder asset rendered. Cache TTL 15 min absorbs short outages.	`bff-consumer/theme-peek-degraded`
F-6	`tenant-service` slug → tenantId resolution failing	Lookup error rate alert	Handoff requests fail with 502	Slug → tenantId resolution is cached for 1 h; outage of <1 h is invisible. >1 h outage degrades handoff to error.	`bff-consumer/tenant-resolver-down`
F-7	`bff-tenant-booking-service` `/internal/handoff/{id}/consume` rejects	Handoff success-rate alert	Guest sees 502 on landing page after redirect (handled by tenant BFF)	This BFF only mints; consumer impact is on the tenant BFF side. We log replay rejection and increment a counter for SRE.	`bff-tenant/handoff-consume-rejected`
F-8	Schema drift from any upstream (e.g., `search-aggregation-service` adds non-nullable field)	Zod parse failure on response with `MELMASTOON.BFF.SCHEMA_DRIFT`; alert on first occurrence	Affected route 502s	Parser logs full payload (sanitized) at WARN; on-call investigates within 15 min; provider rolled back or BFF schema patched.	`bff/schema-drift`

1.2 Stateful dependency failures

#	Failure	Detection	User impact (C)	Mitigation	Runbook
F-9	Memorystore (Redis) regional outage	Health check fail; Redis client error rate > 5%	Sessions cannot be read/written; cache misses cascade to upstream; latency rises sharply	Sessions are best-effort: if Memorystore is down, requests proceed with an in-memory ephemeral session and a `Set-Cookie` retry banner. Upstream call rate triples → Cloud Run autoscale absorbs. Aggressive single-flight at the application layer prevents stampede.	`bff-consumer/memorystore-down`
F-10	Memorystore failover (replica promoted)	Replica failover event from GCP	< 30 s of elevated latency; some sessions reset to first-touch state	No mitigation; client cookies regenerate session; idempotency keys preserve mutating-route correctness.	`bff-consumer/memorystore-failover`
F-11	Cloud SQL primary down	Postgres health check fail	Mutations (wishlist, handoff, telemetry, locale, currency) fail with 503; reads (search, detail) unaffected	Cloud SQL HA failover (~ 60 s); Idempotency-Key on retried mutations ensures correctness. Outbox writes block until Postgres is back; in-memory buffer of 256 events absorbs short blip.	`bff-consumer/postgres-down`
F-12	Outbox table growing unbounded (Pub/Sub publisher down)	Outbox-depth alert at 5k / 50k / 250k rows	None visible to user; risk of disk pressure	`outbox-relay` worker retries with exponential backoff; on-call investigates Pub/Sub region health; manual flush script available.	`platform/outbox-backlog`
F-13	Pub/Sub publish 100% failure	Publish error rate alert	None visible	All telemetry queues in outbox; redrives when Pub/Sub recovers; no event loss. Handoff and search work unaffected (telemetry is async).	`platform/pubsub-publish-down`

1.3 Edge / ingress failures

#	Failure	Detection	User impact (C)	Mitigation	Runbook
F-14	Cloud CDN cache poisoned (incorrect Vary configuration)	Synthetic monitor cross-locale check fails	Some users see another locale's content	Vary header explicitly enumerates `Accept-Language, X-Currency, Accept-Encoding`. Synthetic test runs every 5 min. Mitigation: purge CDN edge for affected paths; root-cause Vary regression.	`bff-consumer/cdn-poisoning`
F-15	Cloud Armor false-positive WAF rule	Spike in 403 from legitimate traffic	Some users blocked; banner: "service temporarily unavailable"	Cloud Armor rules deployed via Terraform with staging soak first; hotfix is rule-disable + rollout.	`bff-consumer/waf-false-positive`
F-16	Geo-block misconfiguration	Spike in 451 from a region	Region's traffic blocked	Geo-block list maintained in Terraform with quarterly review; rollback via PR + Terraform apply.	`bff-consumer/geo-block-misfire`
F-17	TLS cert expiry	Cloud Monitoring uptime check fails on TLS handshake	All requests fail	Cert auto-renewal via GCLB; alert at T-30 days, T-7 days, T-1 day. Manual issue path documented.	`platform/tls-expiry`

1.4 Application-layer failures

#	Failure	Detection	User impact (C)	Mitigation	Runbook
F-18	HMAC signing key rotation skew (old key removed before all clients drained)	Surge in `MELMASTOON.BFF.CONSUMER.HANDOFF_INVALID` on the tenant BFF	Recent handoff links broken	HMAC verifier accepts both `currentKey` and `previousKey` for 7 days; rotation drill runs quarterly; SRE checklist enforces 7-day overlap.	`bff-consumer/hmac-rotation-skew`
F-19	Bot detector false-positive (real users CAPTCHA-challenged)	False-positive rate > 0.5% via synthetic + sampled human traffic	Real users hit reCAPTCHA challenge; conversion drops	Detector thresholds tunable per-tenant via Memorystore feature flag; rollback path is to lower aggression score; emergency kill-switch flag `disableBotDetector`.	`bff-consumer/bot-fp-spike`
F-20	Stampede control failure (single-flight collapse breaks)	Upstream RPS to `search-aggregation-service` > 5× steady-state for the same query hash	Upstream may overload	`cache.stampede.failures` metric; Single-flight wrapper has fallback to per-request execution but logs critical alert. Post-mortem required.	`bff-consumer/stampede-control`
F-21	Cookie too large (session bloat)	`cookie.size.bytes` p99 alert > 3500 B	Browsers may reject; clients see lost session	Session blob stored server-side in Memorystore; cookie carries only the ULID. Wishlist and recently-viewed never serialized into cookie.	`bff-consumer/cookie-bloat`
F-22	View-model breaking schema change pushed without /v2	Contract test failure on PR	None in prod (CI gate)	Pact provider verification gate; consumer apps publish pacts on every release; CI fails before merge.	`bff/contract-drift`
F-23	Memory leak (long-soak)	Heap RSS upward trend > 10% / 4 h	Pod restarts; brief latency blip	Cloud Run min instance + readiness check; rolling restart; root-cause via heap snapshot. Long-soak load test in pre-prod.	`bff-consumer/memory-leak`
F-24	Outbox publisher dead-letter spike (poison messages)	DLQ depth alert	None	DLQ has 7-day retention; SRE inspects payload; manual replay or quarantine.	`platform/dlq-spike`

1.5 Cost / quota failures

#	Failure	Detection	User impact (C)	Mitigation	Runbook
F-25	Cloud Run cost spike (>2× monthly forecast)	Billing alert at 50% / 80% / 100% / 120% of monthly budget	None directly	On-call investigates traffic mix; suspected DDoS triggers Cloud Armor rate-limit ratchet.	`platform/cost-spike`
F-26	reCAPTCHA quota exhausted	reCAPTCHA error rate > 1%	Bot challenges fail-open by policy (legitimate users not blocked)	Quota raised in advance for campaigns; fall-back to in-house score-only check (lower confidence) configured behind a flag.	`bff-consumer/recaptcha-quota`

2. Failure decision tree

incoming request
  │
  ├── Cloud Armor block?  ── yes ──► 403 (no telemetry)
  │
  ├── Bot pre-screen?     ── high score ──► 429 + reCAPTCHA challenge
  │
  ├── Memorystore down?   ── yes ──► proceed with ephemeral session, log warn
  │
  ├── upstream fanout
  │     ├── all upstreams ok          ──► 200 with full VM
  │     ├── pricing partial / slow    ──► 200 with priceFromCheapest=null + meta.partial=true
  │     ├── theme-peek down           ──► 200 with brand placeholder
  │     ├── property down             ──► serve stale (≤ 30 min) OR 502
  │     ├── search-agg down           ──► serve stale OR 503
  │     └── schema drift detected     ──► 502 + alert + parser tracelog
  │
  └── outbox enqueue
        ├── Postgres ok               ──► 200 (telemetry async)
        └── Postgres down             ──► best-effort in-memory queue (256), client banner

3. Blast radius matrix

Failure	Consumer surface	Tenant booking surface	Backoffice surface	Other tenants
Memorystore down	Degraded (sessions ephemeral)	None	None	Cross-tenant: same
Cloud SQL down	Mutating endpoints fail	None	None	Cross-tenant: same
`search-aggregation-service` down	Search degraded	None	None	All consumer tenants
`theme-config-service` down	Brand peek missing	Tenant booking BFF likely also impacted	Backoffice BFF likely also impacted	All tenants
HMAC key rotation skew	Recent handoffs broken	Sees `HANDOFF_INVALID`	None	All consumer tenants
Bot FP spike	Conversion drop	None	None	Variable

4. Recovery objectives

Objective	Target
RPO (data loss tolerance)	5 min (Postgres PITR; Memorystore is cache)
RTO (regional failover)	30 min (DNS + Cloud Run redeploy in DR region)
Mean time to detect (MTTD) for P1 alerts	< 2 min
Mean time to acknowledge (MTTA) for P1 alerts	< 5 min
Mean time to mitigate (MTTM) for P1 alerts	< 30 min

5. Game-day exercises

Quarterly chaos game-days exercise the catalogue rows in randomized order. Findings feed back into runbooks. Recent exercises are tracked in services/bff-consumer-service/_chaos/ with date, owner, and remediation backlog.

1. Failure catalogue​

1.1 Upstream service failures​

1.2 Stateful dependency failures​

1.3 Edge / ingress failures​

1.4 Application-layer failures​

1.5 Cost / quota failures​

2. Failure decision tree​

3. Blast radius matrix​

4. Recovery objectives​

5. Game-day exercises​