Skip to main content

FAILURE_MODES — bff-backoffice-service

Sibling: APPLICATION_LOGIC · OBSERVABILITY · SECURITY_MODEL · SYNC_CONTRACT

Cross-cutting: 02 Enterprise Architecture · §10 Failure Posture · Standards · ERROR_CODES

The dominant property of this BFF's failure posture is that the desktop is offline-first. When this BFF (or any of its upstreams) is degraded, the desktop falls back to local SQLite reads and outbox-queued mutations. The user impact column O = backoffice operator at a hotel desk, on the Electron desktop. Failures here generally lower the quality of online experience but never block the operator from running the property.

1. Failure catalogue

1.1 Upstream service failures

#FailureDetectionUser impact (O)MitigationRunbook
F-1iam-service.refresh 5xxPer-route alertToken expires within 15 min; user re-signs inDPoP retry on next interval; if persistent, desktop falls back to offline modebff-backoffice/iam-refresh-down
F-2iam-service.attestStepUp 5xxPer-route alertSensitive actions (lock revoke, refund) blockedBanner shown; non-sensitive flows unaffectedbff-backoffice/mfa-down
F-3tenant-service slowPer-upstream alertOperator resolution slow; preferences mirror stale5-min cache absorbs; alert if cache miss + outagebff-backoffice/tenant-degraded
F-4reservation-service 5xxPer-route alertWorkbench reads + check-in/check-out fail (online); offline still works locallyBanner; mutation idempotency absorbs retriesbff-backoffice/reservation-down
F-5inventory-service 5xxPer-route alertOccupancy KPI widget unavailable; rest of dashboard intactPer-widget skeletonbff-backoffice/inventory-down
F-6pricing-service 5xxPer-route alertRate snapshot tile unavailablePer-widget skeletonbff-backoffice/pricing-down
F-7housekeeping-service 5xxPer-route alertHousekeeping board unavailable; transitions fail onlineBoard cache absorbs short outage; mutations queue offlinebff-backoffice/housekeeping-down
F-8maintenance-service 5xxPer-route alertMaintenance board unavailable; sameSamebff-backoffice/maintenance-down
F-9billing-service 5xxPer-route alertFolio summary tile unavailable; charge mutations failPer-widget skeleton; idem-key absorbs retriesbff-backoffice/billing-down
F-10lock-integration-service 5xxPer-route alertLock issue/revoke fails; alert raisedAudit row outcome=failure; manual workaround per ADR-0004bff-backoffice/lock-down
F-11ai-orchestrator-service 5xxPer-route alertAI surfaces hiddenSilent degradebff-backoffice/ai-down
F-12notification-service 5xxPer-route alertAlert inbox stale; ack failsCache absorbs; ack idempotent retrybff-backoffice/notif-down
F-13sync-service.handshake 5xxPer-route alertReconnect blocked; desktop stays offlineDesktop continues offline; retry handshake every 30 sbff-backoffice/sync-handshake-down
F-14analytics-service 5xxPer-upstream alertSome KPI tiles unavailablePer-widget skeletonbff-backoffice/analytics-down
F-15Schema drift from any upstreamZod parse failureAffected route 502 + SCHEMA_DRIFTOn-call within 15 min; provider rolled back or BFF schema patchedbff/schema-drift
F-16property-service 5xxPer-upstream alertProperty metadata staleCache absorbsbff-backoffice/property-down

1.2 Stateful dependency failures

#FailureDetectionUser impact (O)MitigationRunbook
F-17Memorystore cache tier downHealth + client errorsHigher upstream load; latency risesRead-through to upstream; readiness reflects statebff-backoffice/memorystore-cache-down
F-18Memorystore session tier downHealth + client errorsMutating endpoints 503; SSE bus downReads fall back direct; no session blob; readiness DOWN; auto-failover to standbybff-backoffice/memorystore-session-down
F-19Memorystore failoverFailover event< 30 s elevated latency; sessions preservedHA standby; session blobs intactbff-backoffice/memorystore-failover
F-20Cloud SQL primary downHealth checkMutating endpoints 503; idempotency unwritableHA failover (~ 60 s)bff-backoffice/postgres-down
F-21Outbox growsDepth alertNone visible; storage pressureRelay retries; SRE investigatesplatform/outbox-backlog
F-22Pub/Sub publish 100% failurePublish error alertNone visibleOutbox queues; recoversplatform/pubsub-publish-down

1.3 Edge / ingress failures

#FailureDetectionUser impact (O)MitigationRunbook
F-23TLS cert expiry on backoffice.melmastoon.ghasi.ioTLS handshake monitorDesktop unable to reach BFF; falls offlineCert Manager auto-renewal; alerts at T-30/7/1bff-backoffice/tls-cert-expiry
F-24Cloud Armor false-positive blocks legitimate device traffic403 spike from device IPsSome operators blocked; offline-mode survivesStage soak before prod; rollback Terraformbff-backoffice/waf-fp
F-25DNS regressionSynthetic uptime failDesktop fleet unreachable; offline fallbackCloud DNS managed; DR DNS as fallbackbff-backoffice/dns
F-26DDoS via SSE floodSSE active conns spikeInstance count climbs; cost spikePer-device 1-conn cap; Cloud Armor IP rules; circuit-breakbff-backoffice/sse-flood

1.4 Application-layer failures

#FailureDetectionUser impact (O)MitigationRunbook
F-27Lock-action audit log gaplock_audit_completeness < 100%Compliance + dispute riskHard freeze on lock proxy; SRE investigates within 30 minbff-backoffice/lock-audit-gap
F-28DPoP replay accepteddpop_replay_blocked_total < expected; or audit log shows replays succeedingToken theft enabledPage Security; halt traffic to affected revision; restore from prior known-goodbff-backoffice/dpop-replay-bypass
F-29MFA attestation replayed successfullymfa.attestation.replay_success (would be 0; nonzero is a bug)Sensitive action without fresh MFAPage Security; halt sensitive endpointsbff-backoffice/mfa-replay-bypass
F-30Cross-tenant data leakage via cache key collisionSynthetic cross-tenant probe failCatastrophic: operator sees another tenant's dataTenant-scoped cache keys; nightly probe; rollback on alertbff-backoffice/cross-tenant-leak
F-31Force-logout SSE event lostE2E latency > 5 s p95Revoked operator continues to actRefresh-time check (next refresh fails with SESSION_EXPIRED) acts as backstop; investigate within 1 hbff-backoffice/forcelogout-lag
F-32Idempotency key replay creates duplicate mutation upstreamidempotency_dedup_total < expected; reservation-service duplicate counter spikeDuplicate room assignment / double chargePage on-call; halt mutation proxy; investigatebff-backoffice/idem-bypass
F-33Dashboard composer infinite-await due to upstream that doesn't honor deadlinedashboard.partial.spike + p95 spikeMany widgets unavailableHard deadline enforced via withDeadline; bug in offending fetch; revertbff-backoffice/dashboard-await-leak
F-34SSE bus stops delivering after Memorystore failoversse_first_event_p95 spikeNew events not delivered until reconnectClients reconnect SSE on stale-detection; fallback pollingbff-backoffice/sse-bus-stall
F-35device_sync_status.cursor_version divergence from sync-serviceReconciliation job alertOperators see "stale sync" hintReset cache from sync-service authoritative on next handshakebff-backoffice/cursor-cache-divergence
F-36OperatorPreferences write-through fails (tenant-service down but BFF cache write succeeds)Reconciliation alertMirror diverges from authorityWrite-through transaction must succeed both; failure → 503; cache only updated after upstream OKbff-backoffice/prefs-divergence
F-37Memory leakHeap RSS upward trend > 10% / 4 hPod restarts (acceptable, no user impact via Cloud Run min-instances)Rolling restart; pre-stop drains SSEbff-backoffice/memory-leak
F-38DLQ spikeDLQ depth alertNoneDLQ 7 d retention; SRE inspects + replaysplatform/dlq-spike

1.5 Cost / quota

#FailureDetectionUser impact (O)MitigationRunbook
F-39Cloud Run cost spike (instance count run-up)Billing alertNone directlyCapacity audit; reduce SSE timeout; investigate device fleetplatform/cost-spike
F-40Pub/Sub egress cost spike from telemetryBilling alertNoneReduce sample rates via flagbff-backoffice/telemetry-cost
F-41DPoP cache (Memorystore) memory pressure from spike in unique jtiMemory utilization alertDPoP verify failuresTTL trim; switch to bloom filter for replay checkbff-backoffice/dpop-cache-pressure

2. Failure decision tree

incoming request

├── Cloud Armor block? ── yes ──► 403 (no telemetry)

├── No Authorization header? ── yes ──► 401 SESSION_REQUIRED

├── DPoP invalid? ── yes ──► 401 DPOP_INVALID

├── Token expired? ── yes ──► 401 SESSION_EXPIRED

├── Device mismatch? ── yes ──► 403 DEVICE_MISMATCH

├── Tenant suspended? ── yes ──► 503 TENANT.SUSPENDED

├── Property out of scope? ── yes ──► 403 PROPERTY_OUT_OF_SCOPE

├── Memorystore session down? ── yes ──► 503 CACHE_UNAVAILABLE

├── route fanout
│ ├── all upstreams ok ──► 200 with full VM
│ ├── widget fanout partial ──► 200 with partial=true
│ ├── reservation-service down (read) ──► route 502 + UPSTREAM_UNAVAILABLE
│ ├── reservation-service down (mutation) ──► 504 + UPSTREAM_TIMEOUT (idem-key absorbs retry)
│ ├── lock-integration-service down ──► 502 + UPSTREAM_UNAVAILABLE; audit row outcome=failure
│ ├── MFA required + missing ──► 401 MFA_REQUIRED
│ ├── MFA invalid/used ──► 401 MFA_INVALID_OR_USED
│ ├── schema drift detected ──► 502 SCHEMA_DRIFT + alert
│ └── ok ──► 200

└── outbox enqueue
├── Postgres ok ──► 200 (telemetry async)
└── Postgres down ──► best-effort; readiness reflects DOWN

3. Blast radius matrix

FailureBackoffice surface (O)Consumer surfaceTenant booking surfaceOther tenants
Memorystore downSevere (mutating fails)NoneNoneSame (cross-tenant Memorystore)
Cloud SQL downMutating failsNoneNoneSame
iam-service downRefresh fails; sessions holdSameSameAll tenants
reservation-service downMutations failNoneHold/confirm fail tooAll tenants
lock-integration-service downLock proxy failsNoneNoneAll tenants
sync-service downReconnect blocked; offline survivesNoneNoneAll tenants
TLS cert expiry on backoffice domainBackoffice unreachableNoneNoneAll tenants
WAF FPSome operators blockedNoneNoneVariable
DPoP replay bypassCatastrophic securityNoneNoneVariable
Cross-tenant cache leakCatastrophicNoneNoneVariable

4. Recovery objectives

ObjectiveTarget
RPO5 min
RTO30 min
MTTD (P1)< 2 min
MTTA (P1)< 5 min
MTTM (P1)< 30 min
Lock-audit completeness100% (no tolerance)
Mutation idempotency correctness100%
Force-logout E2E< 5 s p95

5. Game-day exercises

Quarterly. Recent runs in services/bff-backoffice-service/_chaos/ with date, owner, and remediation backlog. Particular focus on:

  • Lock-integration-service outage with active key issuance.
  • Memorystore session-tier failover during shift change.
  • DPoP replay attempt simulation.
  • Force-logout broadcast to 2,000 simultaneously connected SSE devices.
  • Sync-service handshake outage during multi-device reconnect storm.