Skip to main content

OBSERVABILITY — bff-backoffice-service

Sibling: SECURITY_MODEL · FAILURE_MODES · API_CONTRACTS

Cross-cutting: 02 Enterprise Architecture · §11 Observability

1. Stack

ConcernTool
TracingOpenTelemetry → Cloud Trace + Tempo
MetricsOpenTelemetry → Cloud Monitoring + Prometheus → Grafana
LogsPino → Cloud Logging → BigQuery (audit lake)
ProfilingCloud Profiler (1% sample)
SyntheticCloud Monitoring uptime + Playwright canary
AlertingCloud Monitoring + PagerDuty

OTel SDK initialized before NestFactory in main.ts.

2. SLIs and SLOs

SLISLOWindowBudget
dashboard_p95_latency_warm< 600 ms28 d5%
dashboard_first_byte_p95< 200 ms28 d5%
workbench_p95< 400 ms28 d5%
lock_action_p95< 800 ms28 d1%
heartbeat_p95< 80 ms28 d1%
sync_handshake_p95< 300 ms28 d1%
auth_refresh_p95< 250 ms28 d1%
availability99.9%28 d43 m / 28 d
lock_action_audit_completeness100%continuous0
mutation_idempotency_correctness100%continuous0
force_logout_e2e_latency_p95< 5 s28 d1%
sse_first_event_p95< 200 ms28 d5%

3. RED metrics (per route)

bff_backoffice_request_total{tenant_id, route, method, status_class, role}
bff_backoffice_request_duration_seconds{tenant_id, route, method, status_class}
bff_backoffice_request_inflight{route}
bff_backoffice_errors_total{tenant_id, route, error_code}

4. USE metrics

bff_backoffice_redis_pool_inuse / total
bff_backoffice_postgres_pool_inuse / total
bff_backoffice_outbox_depth
bff_backoffice_outbox_lag_seconds
bff_backoffice_circuit_breaker_state{upstream}
bff_backoffice_cache_hit_total{cache}
bff_backoffice_cache_miss_total{cache}
bff_backoffice_singleflight_followers_total{key_prefix}
bff_backoffice_dashboard_partial_total
bff_backoffice_lock_action_total{outcome,vendor}
bff_backoffice_mfa_attestation_total{scope,outcome}
bff_backoffice_dpop_replay_blocked_total
bff_backoffice_sse_active_connections
bff_backoffice_sse_events_pushed_total{channel}
bff_backoffice_device_status_count{status}
bff_backoffice_session_active_count
bff_backoffice_idempotency_dedup_total

5. Trace attributes

Every span carries:

KeyCardinality
service.namelow
tenant.idmedium
operator.idhigh (sampled)
device.idhigh (sampled)
session.idhigh (sampled)
route.namelow
cache.outcomelow
upstream.namelow
upstream.deadline_mslow
circuit.statelow
idempotency.keyhigh (sampled)
lock.action.vendorlow
lock.action.outcomelow
mfa.scopelow
mfa.attestation_usedlow
dpop.outcomelow
app.versionlow
app.platformlow
network.profilelow

6. Log fields

{
"ts": "...",
"level": "info",
"service": "bff-backoffice-service",
"instance": "...",
"traceId": "...",
"spanId": "...",
"requestId": "...",
"tenantId": "tnt_...",
"operatorId": "opr_...",
"deviceId": "dev_...",
"sessionId": "bos_...",
"route": "POST /reservations/{id}/check-in",
"statusCode": 200,
"latencyMs": 412,
"cacheOutcome": "MISS",
"upstream": [{"name":"reservation-service","latencyMs":380,"status":"ok"}],
"appVersion": "1.4.2",
"appPlatform": "win32",
"msg": "mutation_proxied"
}

PII (operator name, guest name) NEVER logged.

7. Dashboards

7.1 "Operator effectiveness" (executive)

  • Active operators per property (live)
  • Mutations / operator / day
  • Lock actions / day (issue vs revoke)
  • AI suggestion acceptance rate
  • Average dashboard p95 by tenant
  • Force-logout rate (anomaly indicator)

7.2 "Service SLO" (SRE)

  • p50/p95/p99 by route
  • Error rate by route + class
  • SLO burn rate (1h/6h/1d/7d)
  • Upstream dependency health
  • Circuit state timeline
  • Cache hit ratios

7.3 "Device health" (on-call + CSM)

  • Online vs offline device count
  • Devices by app version
  • Heartbeat lag distribution
  • Sync handshake success ratio
  • Devices stuck on outdated version
  • DPoP replay blocked count (security)

7.4 "Lock actions" (security)

  • Issue vs revoke counts by tenant + property
  • MFA bypass attempt count
  • Lock vendor failure breakdown
  • Top operators by revoke count (anomaly indicator)
  • Audit completeness gap

7.5 "AI decisions" (product + audit)

  • Suggestions surfaced / decided / expired-without-decision
  • Acceptance rate by category
  • Decision latency (suggestion shown → decision)
  • Operator override rate
  • Edge vs cloud provenance split

8. Alerts (P1)

AlertConditionAction
bff_backoffice_p95_latency_burn_fastroute p95 > 2× SLO for 5 minPage on-call
bff_backoffice_error_rate_burn_fastoverall 5xx > 1% for 5 minPage on-call
bff_backoffice_lock_audit_gapany lock_action_proxied event without matching audit rowPage Security; halt lock proxy
bff_backoffice_mfa_bypass_attemptMFA_INVALID_OR_USED rate > 5/min for 5 minPage Security
bff_backoffice_dpop_replay_spikedpop_replay_blocked_total > 10/min for 5 minPage Security
bff_backoffice_dashboard_partial_spikepartial > 30% for 10 minPage on-call
bff_backoffice_circuit_openany upstream circuit open > 5 minPage on-call
bff_backoffice_outbox_lag> 60 s for 5 minPage SRE
bff_backoffice_redis_failoverfailover eventAuto-ack; verify SSE bus continuity
bff_backoffice_postgres_downuptime check fails for 2 minPage SRE
bff_backoffice_force_logout_stormforce-logouts > 50 / minPage Security
bff_backoffice_pii_in_telemetrysynthetic probe finds raw PIIPage Security; halt outbox

Each alert links to a runbook at runbooks.melmastoon.ghasi.io/bff-backoffice/<short-name>.

9. Alerts (P2)

AlertCondition
bff_backoffice_cache_hit_dropdashboard cache hit < 70% for 30 min
bff_backoffice_device_offline_spikedevices in offline state > 2× baseline for 10 min
bff_backoffice_sync_handshake_failure_spikehandshake failures > 5% for 10 min
bff_backoffice_app_version_skewactive devices on EOL version count > 0 for 24 h
bff_backoffice_ai_decision_lagmean decision-latency > 1 h for 24 h

10. Synthetic monitoring

  • Cloud Monitoring uptime checks every 60 s on /health/ready from 4 regions.
  • Per-tenant canary device emulator: signs DPoP, refreshes, fetches dashboard, posts heartbeat, decides synthetic AI suggestion. Runs every 5 min from stage.
  • Playwright nightly E2E from a real Electron build container against stage.

11. Trace sampling

  • Default head sampler 5%; tail sampler in OTel collector elevates to 100% for status_class=5xx, error_code != null, duration > p99, lock-action category, MFA failure.
  • Incident toggle: 100% via bff-backoffice-flags (timer-bound 1 h auto-revert).

12. Log retention

ClassRetention
audit.*7 y
request.*30 d hot, 90 d cold
error.*90 d hot, 1 y cold
debug.*7 d
lock.*7 y
mfa.*7 y

All export to BigQuery via Log Router.

13. Correlation IDs

Every BFF response carries X-Request-Id and traceparent. Same X-Request-Id propagates as X-Audit-RequestId to all upstream services so a single trace spans a check-in saga (BFF → reservation → folio → notification → analytics).

14. Cost observability

  • Cloud Billing alerts at 50/80/100/120%.
  • Per-tenant cost dashboard.
  • Pub/Sub egress per subject.

15. SLO error-budget policy

ConsumptionAction
25%Slack notification
50%TODO ticket auto-created
75%Freeze non-critical changes
100%Hard freeze; revert recent changes