OBSERVABILITY — bff-backoffice-service
Sibling: SECURITY_MODEL · FAILURE_MODES · API_CONTRACTS
Cross-cutting: 02 Enterprise Architecture · §11 Observability
1. Stack
| Concern | Tool |
|---|---|
| Tracing | OpenTelemetry → Cloud Trace + Tempo |
| Metrics | OpenTelemetry → Cloud Monitoring + Prometheus → Grafana |
| Logs | Pino → Cloud Logging → BigQuery (audit lake) |
| Profiling | Cloud Profiler (1% sample) |
| Synthetic | Cloud Monitoring uptime + Playwright canary |
| Alerting | Cloud Monitoring + PagerDuty |
OTel SDK initialized before NestFactory in main.ts.
2. SLIs and SLOs
| SLI | SLO | Window | Budget |
|---|---|---|---|
dashboard_p95_latency_warm | < 600 ms | 28 d | 5% |
dashboard_first_byte_p95 | < 200 ms | 28 d | 5% |
workbench_p95 | < 400 ms | 28 d | 5% |
lock_action_p95 | < 800 ms | 28 d | 1% |
heartbeat_p95 | < 80 ms | 28 d | 1% |
sync_handshake_p95 | < 300 ms | 28 d | 1% |
auth_refresh_p95 | < 250 ms | 28 d | 1% |
availability | 99.9% | 28 d | 43 m / 28 d |
lock_action_audit_completeness | 100% | continuous | 0 |
mutation_idempotency_correctness | 100% | continuous | 0 |
force_logout_e2e_latency_p95 | < 5 s | 28 d | 1% |
sse_first_event_p95 | < 200 ms | 28 d | 5% |
3. RED metrics (per route)
bff_backoffice_request_total{tenant_id, route, method, status_class, role}
bff_backoffice_request_duration_seconds{tenant_id, route, method, status_class}
bff_backoffice_request_inflight{route}
bff_backoffice_errors_total{tenant_id, route, error_code}
4. USE metrics
bff_backoffice_redis_pool_inuse / total
bff_backoffice_postgres_pool_inuse / total
bff_backoffice_outbox_depth
bff_backoffice_outbox_lag_seconds
bff_backoffice_circuit_breaker_state{upstream}
bff_backoffice_cache_hit_total{cache}
bff_backoffice_cache_miss_total{cache}
bff_backoffice_singleflight_followers_total{key_prefix}
bff_backoffice_dashboard_partial_total
bff_backoffice_lock_action_total{outcome,vendor}
bff_backoffice_mfa_attestation_total{scope,outcome}
bff_backoffice_dpop_replay_blocked_total
bff_backoffice_sse_active_connections
bff_backoffice_sse_events_pushed_total{channel}
bff_backoffice_device_status_count{status}
bff_backoffice_session_active_count
bff_backoffice_idempotency_dedup_total
5. Trace attributes
Every span carries:
| Key | Cardinality |
|---|---|
service.name | low |
tenant.id | medium |
operator.id | high (sampled) |
device.id | high (sampled) |
session.id | high (sampled) |
route.name | low |
cache.outcome | low |
upstream.name | low |
upstream.deadline_ms | low |
circuit.state | low |
idempotency.key | high (sampled) |
lock.action.vendor | low |
lock.action.outcome | low |
mfa.scope | low |
mfa.attestation_used | low |
dpop.outcome | low |
app.version | low |
app.platform | low |
network.profile | low |
6. Log fields
{
"ts": "...",
"level": "info",
"service": "bff-backoffice-service",
"instance": "...",
"traceId": "...",
"spanId": "...",
"requestId": "...",
"tenantId": "tnt_...",
"operatorId": "opr_...",
"deviceId": "dev_...",
"sessionId": "bos_...",
"route": "POST /reservations/{id}/check-in",
"statusCode": 200,
"latencyMs": 412,
"cacheOutcome": "MISS",
"upstream": [{"name":"reservation-service","latencyMs":380,"status":"ok"}],
"appVersion": "1.4.2",
"appPlatform": "win32",
"msg": "mutation_proxied"
}
PII (operator name, guest name) NEVER logged.
7. Dashboards
7.1 "Operator effectiveness" (executive)
- Active operators per property (live)
- Mutations / operator / day
- Lock actions / day (issue vs revoke)
- AI suggestion acceptance rate
- Average dashboard p95 by tenant
- Force-logout rate (anomaly indicator)
7.2 "Service SLO" (SRE)
- p50/p95/p99 by route
- Error rate by route + class
- SLO burn rate (1h/6h/1d/7d)
- Upstream dependency health
- Circuit state timeline
- Cache hit ratios
7.3 "Device health" (on-call + CSM)
- Online vs offline device count
- Devices by app version
- Heartbeat lag distribution
- Sync handshake success ratio
- Devices stuck on outdated version
- DPoP replay blocked count (security)
7.4 "Lock actions" (security)
- Issue vs revoke counts by tenant + property
- MFA bypass attempt count
- Lock vendor failure breakdown
- Top operators by revoke count (anomaly indicator)
- Audit completeness gap
7.5 "AI decisions" (product + audit)
- Suggestions surfaced / decided / expired-without-decision
- Acceptance rate by category
- Decision latency (suggestion shown → decision)
- Operator override rate
- Edge vs cloud provenance split
8. Alerts (P1)
| Alert | Condition | Action |
|---|---|---|
bff_backoffice_p95_latency_burn_fast | route p95 > 2× SLO for 5 min | Page on-call |
bff_backoffice_error_rate_burn_fast | overall 5xx > 1% for 5 min | Page on-call |
bff_backoffice_lock_audit_gap | any lock_action_proxied event without matching audit row | Page Security; halt lock proxy |
bff_backoffice_mfa_bypass_attempt | MFA_INVALID_OR_USED rate > 5/min for 5 min | Page Security |
bff_backoffice_dpop_replay_spike | dpop_replay_blocked_total > 10/min for 5 min | Page Security |
bff_backoffice_dashboard_partial_spike | partial > 30% for 10 min | Page on-call |
bff_backoffice_circuit_open | any upstream circuit open > 5 min | Page on-call |
bff_backoffice_outbox_lag | > 60 s for 5 min | Page SRE |
bff_backoffice_redis_failover | failover event | Auto-ack; verify SSE bus continuity |
bff_backoffice_postgres_down | uptime check fails for 2 min | Page SRE |
bff_backoffice_force_logout_storm | force-logouts > 50 / min | Page Security |
bff_backoffice_pii_in_telemetry | synthetic probe finds raw PII | Page Security; halt outbox |
Each alert links to a runbook at runbooks.melmastoon.ghasi.io/bff-backoffice/<short-name>.
9. Alerts (P2)
| Alert | Condition |
|---|---|
bff_backoffice_cache_hit_drop | dashboard cache hit < 70% for 30 min |
bff_backoffice_device_offline_spike | devices in offline state > 2× baseline for 10 min |
bff_backoffice_sync_handshake_failure_spike | handshake failures > 5% for 10 min |
bff_backoffice_app_version_skew | active devices on EOL version count > 0 for 24 h |
bff_backoffice_ai_decision_lag | mean decision-latency > 1 h for 24 h |
10. Synthetic monitoring
- Cloud Monitoring uptime checks every 60 s on
/health/readyfrom 4 regions. - Per-tenant canary device emulator: signs DPoP, refreshes, fetches dashboard, posts heartbeat, decides synthetic AI suggestion. Runs every 5 min from
stage. - Playwright nightly E2E from a real Electron build container against
stage.
11. Trace sampling
- Default head sampler 5%; tail sampler in OTel collector elevates to 100% for status_class=5xx, error_code != null, duration > p99, lock-action category, MFA failure.
- Incident toggle: 100% via
bff-backoffice-flags(timer-bound 1 h auto-revert).
12. Log retention
| Class | Retention |
|---|---|
audit.* | 7 y |
request.* | 30 d hot, 90 d cold |
error.* | 90 d hot, 1 y cold |
debug.* | 7 d |
lock.* | 7 y |
mfa.* | 7 y |
All export to BigQuery via Log Router.
13. Correlation IDs
Every BFF response carries X-Request-Id and traceparent. Same X-Request-Id propagates as X-Audit-RequestId to all upstream services so a single trace spans a check-in saga (BFF → reservation → folio → notification → analytics).
14. Cost observability
- Cloud Billing alerts at 50/80/100/120%.
- Per-tenant cost dashboard.
- Pub/Sub egress per subject.
15. SLO error-budget policy
| Consumption | Action |
|---|---|
| 25% | Slack notification |
| 50% | TODO ticket auto-created |
| 75% | Freeze non-critical changes |
| 100% | Hard freeze; revert recent changes |