FAILURE_MODES — bff-backoffice-service
Sibling: APPLICATION_LOGIC · OBSERVABILITY · SECURITY_MODEL · SYNC_CONTRACT
Cross-cutting: 02 Enterprise Architecture · §10 Failure Posture · Standards · ERROR_CODES
The dominant property of this BFF's failure posture is that the desktop is offline-first. When this BFF (or any of its upstreams) is degraded, the desktop falls back to local SQLite reads and outbox-queued mutations. The user impact column O = backoffice operator at a hotel desk, on the Electron desktop. Failures here generally lower the quality of online experience but never block the operator from running the property.
1. Failure catalogue
1.1 Upstream service failures
| # | Failure | Detection | User impact (O) | Mitigation | Runbook |
|---|---|---|---|---|---|
| F-1 | iam-service.refresh 5xx | Per-route alert | Token expires within 15 min; user re-signs in | DPoP retry on next interval; if persistent, desktop falls back to offline mode | bff-backoffice/iam-refresh-down |
| F-2 | iam-service.attestStepUp 5xx | Per-route alert | Sensitive actions (lock revoke, refund) blocked | Banner shown; non-sensitive flows unaffected | bff-backoffice/mfa-down |
| F-3 | tenant-service slow | Per-upstream alert | Operator resolution slow; preferences mirror stale | 5-min cache absorbs; alert if cache miss + outage | bff-backoffice/tenant-degraded |
| F-4 | reservation-service 5xx | Per-route alert | Workbench reads + check-in/check-out fail (online); offline still works locally | Banner; mutation idempotency absorbs retries | bff-backoffice/reservation-down |
| F-5 | inventory-service 5xx | Per-route alert | Occupancy KPI widget unavailable; rest of dashboard intact | Per-widget skeleton | bff-backoffice/inventory-down |
| F-6 | pricing-service 5xx | Per-route alert | Rate snapshot tile unavailable | Per-widget skeleton | bff-backoffice/pricing-down |
| F-7 | housekeeping-service 5xx | Per-route alert | Housekeeping board unavailable; transitions fail online | Board cache absorbs short outage; mutations queue offline | bff-backoffice/housekeeping-down |
| F-8 | maintenance-service 5xx | Per-route alert | Maintenance board unavailable; same | Same | bff-backoffice/maintenance-down |
| F-9 | billing-service 5xx | Per-route alert | Folio summary tile unavailable; charge mutations fail | Per-widget skeleton; idem-key absorbs retries | bff-backoffice/billing-down |
| F-10 | lock-integration-service 5xx | Per-route alert | Lock issue/revoke fails; alert raised | Audit row outcome=failure; manual workaround per ADR-0004 | bff-backoffice/lock-down |
| F-11 | ai-orchestrator-service 5xx | Per-route alert | AI surfaces hidden | Silent degrade | bff-backoffice/ai-down |
| F-12 | notification-service 5xx | Per-route alert | Alert inbox stale; ack fails | Cache absorbs; ack idempotent retry | bff-backoffice/notif-down |
| F-13 | sync-service.handshake 5xx | Per-route alert | Reconnect blocked; desktop stays offline | Desktop continues offline; retry handshake every 30 s | bff-backoffice/sync-handshake-down |
| F-14 | analytics-service 5xx | Per-upstream alert | Some KPI tiles unavailable | Per-widget skeleton | bff-backoffice/analytics-down |
| F-15 | Schema drift from any upstream | Zod parse failure | Affected route 502 + SCHEMA_DRIFT | On-call within 15 min; provider rolled back or BFF schema patched | bff/schema-drift |
| F-16 | property-service 5xx | Per-upstream alert | Property metadata stale | Cache absorbs | bff-backoffice/property-down |
1.2 Stateful dependency failures
| # | Failure | Detection | User impact (O) | Mitigation | Runbook |
|---|---|---|---|---|---|
| F-17 | Memorystore cache tier down | Health + client errors | Higher upstream load; latency rises | Read-through to upstream; readiness reflects state | bff-backoffice/memorystore-cache-down |
| F-18 | Memorystore session tier down | Health + client errors | Mutating endpoints 503; SSE bus down | Reads fall back direct; no session blob; readiness DOWN; auto-failover to standby | bff-backoffice/memorystore-session-down |
| F-19 | Memorystore failover | Failover event | < 30 s elevated latency; sessions preserved | HA standby; session blobs intact | bff-backoffice/memorystore-failover |
| F-20 | Cloud SQL primary down | Health check | Mutating endpoints 503; idempotency unwritable | HA failover (~ 60 s) | bff-backoffice/postgres-down |
| F-21 | Outbox grows | Depth alert | None visible; storage pressure | Relay retries; SRE investigates | platform/outbox-backlog |
| F-22 | Pub/Sub publish 100% failure | Publish error alert | None visible | Outbox queues; recovers | platform/pubsub-publish-down |
1.3 Edge / ingress failures
| # | Failure | Detection | User impact (O) | Mitigation | Runbook |
|---|---|---|---|---|---|
| F-23 | TLS cert expiry on backoffice.melmastoon.ghasi.io | TLS handshake monitor | Desktop unable to reach BFF; falls offline | Cert Manager auto-renewal; alerts at T-30/7/1 | bff-backoffice/tls-cert-expiry |
| F-24 | Cloud Armor false-positive blocks legitimate device traffic | 403 spike from device IPs | Some operators blocked; offline-mode survives | Stage soak before prod; rollback Terraform | bff-backoffice/waf-fp |
| F-25 | DNS regression | Synthetic uptime fail | Desktop fleet unreachable; offline fallback | Cloud DNS managed; DR DNS as fallback | bff-backoffice/dns |
| F-26 | DDoS via SSE flood | SSE active conns spike | Instance count climbs; cost spike | Per-device 1-conn cap; Cloud Armor IP rules; circuit-break | bff-backoffice/sse-flood |
1.4 Application-layer failures
| # | Failure | Detection | User impact (O) | Mitigation | Runbook |
|---|---|---|---|---|---|
| F-27 | Lock-action audit log gap | lock_audit_completeness < 100% | Compliance + dispute risk | Hard freeze on lock proxy; SRE investigates within 30 min | bff-backoffice/lock-audit-gap |
| F-28 | DPoP replay accepted | dpop_replay_blocked_total < expected; or audit log shows replays succeeding | Token theft enabled | Page Security; halt traffic to affected revision; restore from prior known-good | bff-backoffice/dpop-replay-bypass |
| F-29 | MFA attestation replayed successfully | mfa.attestation.replay_success (would be 0; nonzero is a bug) | Sensitive action without fresh MFA | Page Security; halt sensitive endpoints | bff-backoffice/mfa-replay-bypass |
| F-30 | Cross-tenant data leakage via cache key collision | Synthetic cross-tenant probe fail | Catastrophic: operator sees another tenant's data | Tenant-scoped cache keys; nightly probe; rollback on alert | bff-backoffice/cross-tenant-leak |
| F-31 | Force-logout SSE event lost | E2E latency > 5 s p95 | Revoked operator continues to act | Refresh-time check (next refresh fails with SESSION_EXPIRED) acts as backstop; investigate within 1 h | bff-backoffice/forcelogout-lag |
| F-32 | Idempotency key replay creates duplicate mutation upstream | idempotency_dedup_total < expected; reservation-service duplicate counter spike | Duplicate room assignment / double charge | Page on-call; halt mutation proxy; investigate | bff-backoffice/idem-bypass |
| F-33 | Dashboard composer infinite-await due to upstream that doesn't honor deadline | dashboard.partial.spike + p95 spike | Many widgets unavailable | Hard deadline enforced via withDeadline; bug in offending fetch; revert | bff-backoffice/dashboard-await-leak |
| F-34 | SSE bus stops delivering after Memorystore failover | sse_first_event_p95 spike | New events not delivered until reconnect | Clients reconnect SSE on stale-detection; fallback polling | bff-backoffice/sse-bus-stall |
| F-35 | device_sync_status.cursor_version divergence from sync-service | Reconciliation job alert | Operators see "stale sync" hint | Reset cache from sync-service authoritative on next handshake | bff-backoffice/cursor-cache-divergence |
| F-36 | OperatorPreferences write-through fails (tenant-service down but BFF cache write succeeds) | Reconciliation alert | Mirror diverges from authority | Write-through transaction must succeed both; failure → 503; cache only updated after upstream OK | bff-backoffice/prefs-divergence |
| F-37 | Memory leak | Heap RSS upward trend > 10% / 4 h | Pod restarts (acceptable, no user impact via Cloud Run min-instances) | Rolling restart; pre-stop drains SSE | bff-backoffice/memory-leak |
| F-38 | DLQ spike | DLQ depth alert | None | DLQ 7 d retention; SRE inspects + replays | platform/dlq-spike |
1.5 Cost / quota
| # | Failure | Detection | User impact (O) | Mitigation | Runbook |
|---|---|---|---|---|---|
| F-39 | Cloud Run cost spike (instance count run-up) | Billing alert | None directly | Capacity audit; reduce SSE timeout; investigate device fleet | platform/cost-spike |
| F-40 | Pub/Sub egress cost spike from telemetry | Billing alert | None | Reduce sample rates via flag | bff-backoffice/telemetry-cost |
| F-41 | DPoP cache (Memorystore) memory pressure from spike in unique jti | Memory utilization alert | DPoP verify failures | TTL trim; switch to bloom filter for replay check | bff-backoffice/dpop-cache-pressure |
2. Failure decision tree
incoming request
│
├── Cloud Armor block? ── yes ──► 403 (no telemetry)
│
├── No Authorization header? ── yes ──► 401 SESSION_REQUIRED
│
├── DPoP invalid? ── yes ──► 401 DPOP_INVALID
│
├── Token expired? ── yes ──► 401 SESSION_EXPIRED
│
├── Device mismatch? ── yes ──► 403 DEVICE_MISMATCH
│
├── Tenant suspended? ── yes ──► 503 TENANT.SUSPENDED
│
├── Property out of scope? ── yes ──► 403 PROPERTY_OUT_OF_SCOPE
│
├── Memorystore session down? ── yes ──► 503 CACHE_UNAVAILABLE
│
├── route fanout
│ ├── all upstreams ok ──► 200 with full VM
│ ├── widget fanout partial ──► 200 with partial=true
│ ├── reservation-service down (read) ──► route 502 + UPSTREAM_UNAVAILABLE
│ ├── reservation-service down (mutation) ──► 504 + UPSTREAM_TIMEOUT (idem-key absorbs retry)
│ ├── lock-integration-service down ──► 502 + UPSTREAM_UNAVAILABLE; audit row outcome=failure
│ ├── MFA required + missing ──► 401 MFA_REQUIRED
│ ├── MFA invalid/used ──► 401 MFA_INVALID_OR_USED
│ ├── schema drift detected ──► 502 SCHEMA_DRIFT + alert
│ └── ok ──► 200
│
└── outbox enqueue
├── Postgres ok ──► 200 (telemetry async)
└── Postgres down ──► best-effort; readiness reflects DOWN
3. Blast radius matrix
| Failure | Backoffice surface (O) | Consumer surface | Tenant booking surface | Other tenants |
|---|---|---|---|---|
| Memorystore down | Severe (mutating fails) | None | None | Same (cross-tenant Memorystore) |
| Cloud SQL down | Mutating fails | None | None | Same |
iam-service down | Refresh fails; sessions hold | Same | Same | All tenants |
reservation-service down | Mutations fail | None | Hold/confirm fail too | All tenants |
lock-integration-service down | Lock proxy fails | None | None | All tenants |
sync-service down | Reconnect blocked; offline survives | None | None | All tenants |
| TLS cert expiry on backoffice domain | Backoffice unreachable | None | None | All tenants |
| WAF FP | Some operators blocked | None | None | Variable |
| DPoP replay bypass | Catastrophic security | None | None | Variable |
| Cross-tenant cache leak | Catastrophic | None | None | Variable |
4. Recovery objectives
| Objective | Target |
|---|---|
| RPO | 5 min |
| RTO | 30 min |
| MTTD (P1) | < 2 min |
| MTTA (P1) | < 5 min |
| MTTM (P1) | < 30 min |
| Lock-audit completeness | 100% (no tolerance) |
| Mutation idempotency correctness | 100% |
| Force-logout E2E | < 5 s p95 |
5. Game-day exercises
Quarterly. Recent runs in services/bff-backoffice-service/_chaos/ with date, owner, and remediation backlog. Particular focus on:
- Lock-integration-service outage with active key issuance.
- Memorystore session-tier failover during shift change.
- DPoP replay attempt simulation.
- Force-logout broadcast to 2,000 simultaneously connected SSE devices.
- Sync-service handshake outage during multi-device reconnect storm.