OBSERVABILITY — bff-consumer-service
Sibling: APPLICATION_LOGIC · FAILURE_MODES · SECURITY_MODEL
Cross-cutting: 02 Enterprise Architecture · §11 Observability
1. Telemetry stack
| Layer | Tool | Purpose |
|---|---|---|
| Traces | OpenTelemetry SDK → Cloud Trace | End-to-end request and upstream call traces |
| Metrics | OpenTelemetry → Cloud Monitoring (Managed Prometheus) | RED + USE + business metrics |
| Logs | Pino → Cloud Logging (structured JSON) | Request, error, audit |
| Errors | Sentry (NestJS integration) | Unhandled exceptions, regressions |
| RUM | Web Vitals + page-view beacons → BigQuery | Front-end UX feedback (collected via /telemetry/page-view) |
| Profiling | Cloud Profiler (continuous CPU + heap, 1% sample) | Hot-path optimisation |
2. Service-level objectives
| SLO | Target | Window | Alert at |
|---|---|---|---|
Availability of /search (HTTP success rate) | 99.9% | 30 d rolling | Burn rate > 14.4× over 1 h or > 6× over 6 h |
/search p95 latency | < 700 ms | 30 d | p95 > 850 ms for 10 min |
/search p99 latency | < 1500 ms | 30 d | p99 > 2000 ms for 10 min |
/hotels/{id} p95 latency | < 600 ms | 30 d | p95 > 750 ms for 10 min |
/handoff mint p95 latency | < 350 ms | 30 d | p95 > 500 ms for 10 min |
/handoff consume p95 latency | < 100 ms | 30 d | p95 > 150 ms for 10 min |
| Cache hit ratio (search list, brand peek) | > 80% | 30 d | < 65% for 30 min |
| Memorystore connection error rate | < 0.1% | 30 d | > 0.5% for 5 min |
| HMAC verify failures (handoff) | < 0.05% of mints | 30 d | > 0.5% for 5 min (potential key issue or attack) |
| Bot-suspected ratio | < 8% of /search | 30 d | > 25% for 15 min (bot wave) |
3. Golden signals dashboard (Cloud Monitoring)
Per endpoint group (search, hotel-detail, handoff, wishlist, session, telemetry):
- Rate — requests/sec (stacked by
http.status_code). - Errors — 5xx + 4xx breakdown.
- Duration — p50, p95, p99 latency histogram.
- Saturation — CPU %, container memory, Memorystore connection pool utilization, Postgres pgbouncer connections.
Plus business panels:
- Conversion funnel:
search → click → handoffratio (sampled and projected from telemetry events into BigQuery, surfaced via Looker Studio embed). - Cache hit ratio (Memorystore single-flight
hit / (hit + miss)). - Currency / locale distribution.
- Top 20 search queries (last 1 h, anonymised).
4. Trace propagation
- Inbound:
traceparentheader (W3C Trace Context) accepted; if absent, generate. - Outbound: propagate
traceparent+tracestateto all upstream HTTP calls. - Memorystore + Postgres calls add
db.system,db.statement(sanitised),db.cache.key.templateattributes. - Single-flight wrapper adds
cache.singleflight.outcome ∈ {hit, miss-leader, miss-follower}. - Pub/Sub publish adds
messaging.system="pubsub",messaging.destination=<topic>,messaging.message.id.
A canonical search trace looks like:
HTTP GET /bff/consumer/v1/search
├── interceptor.session-bootstrap (5 ms)
├── interceptor.bot-detection (3 ms)
├── interceptor.rate-limit (1 ms)
├── cache.singleflight (45 ms; outcome=miss-leader)
│ ├── http search-aggregation-service POST /search (210 ms)
│ ├── http pricing-service POST /quotes/preview ×10 (parallel, slowest 180 ms)
│ ├── http theme-config-service GET /brand-peek/batch (60 ms)
│ └── compose ListingCardVM ×20 (15 ms)
├── store.search-session (memorystore, 4 ms)
└── pubsub.publish search.executed.v1 (8 ms; non-blocking)
5. Logs
5.1 Structure (Pino → Cloud Logging)
Every log line is JSON with the following base fields:
| Field | Source |
|---|---|
severity | Pino → Cloud Logging mapping |
time | ISO 8601 |
requestId | Generated per request |
traceId | OTel trace ID |
spanId | OTel span ID |
service | bff-consumer-service |
version | Git SHA |
env | dev / stage / prod |
region | Cloud Run region |
instance | Cloud Run instance ID |
endpoint | route key, e.g. GET /search |
httpStatus | int |
latencyMs | int |
guestSessionId | when bootstrapped |
fingerprintHash | always (hashed) |
ipHash | always (hashed) |
cacheOutcome | when applicable |
errorCode | MELMASTOON.… when error |
errorClass | exception name when error |
5.2 Sampling
- INFO: 10% of successful 2xx requests.
- INFO 100% for: handoff mints, handoff consumes, session bootstrap (first request only), wishlist mutations.
- WARN: 100%.
- ERROR: 100%.
- AUDIT: 100% (handoff mint, bot suspected, ownership violation).
5.3 Sensitive data
- IP and UA are hashed with peppered SHA-256 at ingest; raw values never enter logs.
- Search queries are stored verbatim only in trace span attributes (1% sampled). Logs include only the
queryHash. - HMAC secrets, cookie values, and Authorization headers are scrubbed by a Pino redaction list.
6. Metrics catalogue
6.1 RED metrics (per endpoint)
| Metric | Type | Labels |
|---|---|---|
http.server.request.duration | histogram | route, method, status_code, cache_outcome |
http.server.request.count | counter | same |
http.server.errors | counter | route, error_code |
6.2 Domain-ish metrics (cross-tenant aggregate analytics)
| Metric | Type | Labels |
|---|---|---|
bff_consumer.search.executed.total | counter | currency, locale, bot_verdict |
bff_consumer.handoff.minted.total | counter | tenantId, currency, locale |
bff_consumer.handoff.consumed.total | counter | tenantId, outcome ∈ {ok, replay} |
bff_consumer.handoff.replay.total | counter | tenantId |
bff_consumer.bot.score.bucket.total | counter | bucket ∈ {0-30, 30-60, 60-85, 85-100} |
bff_consumer.cache.hit_ratio | gauge (recorded rule) | cache_kind ∈ {search, hotel, brand-peek, popularity, light-availability} |
bff_consumer.session.active | gauge (recorded rule) | n/a (estimated from Memorystore key scan; sampled) |
6.3 Resource metrics
- Cloud Run: container CPU, memory, instance count, concurrent requests, container startup latency.
- Memorystore: ops/sec, evictions, used_memory, connected_clients.
- Postgres: pgbouncer pool active/idle, query duration p99, deadlocks, autovacuum lag.
- Pub/Sub publisher: queue depth, publish latency, ack errors.
- Secret Manager: read latency, version-pin staleness.
7. Alerts (PagerDuty)
| Alert | Severity | Threshold | Action |
|---|---|---|---|
/search 5xx burn 14.4× over 1 h | P1 | SLO burn | Page primary on-call |
| Memorystore connection failure ratio > 0.5% | P1 | 5 min sustained | Page primary; attempt failover |
| Postgres pgbouncer pool exhausted | P1 | 5 min | Page; raise pool size |
| HMAC verify failures > 0.5% of mints | P1 | 5 min | Page security on-call; check key rotation |
| Handoff replay rate > 1% | P2 | 15 min | Page security on-call |
| Bot-suspected ratio > 25% | P2 | 15 min | Notify SRE; consider campaign mode |
| Cache hit ratio < 65% | P3 | 30 min | Notify SRE; investigate stampede |
| Pub/Sub publish failure ratio > 1% | P2 | 10 min | Page; check IAM / quota |
| Container OOM kill | P2 | any | Page; investigate hot path |
8. Synthetic checks
- Cloud Monitoring uptime check every 60 s from 5 regions:
GET /health/live→ expects 200 with{ "ok": true }.GET /health/ready→ expects 200 with all dependency probes green.
- Synthetic search every 5 min from 3 regions:
GET /search?city=kabul&checkIn=…&checkOut=…&adults=2→ assertsresults.length > 0, p95 < 1.5 s. - Synthetic handoff every 15 min in stage: mint → consume cycle; asserts
consumed=trueand replay returnsMELMASTOON.BFF.TENANT.HANDOFF_REPLAYED.
9. Runbooks
Cross-linked from SECURITY_MODEL §15 and FAILURE_MODES. All under runbooks/bff-consumer/.
10. Capacity planning signals
Tracked weekly in the platform capacity review:
- p99 latency trend.
- Cache hit ratio trend.
- Cold-start ratio (Cloud Run).
- Memorystore memory headroom (target < 70% sustained).
- Pub/Sub publish backlog.
11. Audit + compliance feeds
The following feeds are wired automatically into the platform audit-service via Pub/Sub subscription:
melmastoon.bff.consumer.handoff.initiated.v1melmastoon.bff.consumer.bot_suspected.v1melmastoon.bff.consumer.session.started.v1(sampled)
12. Data retention
| Data | Retention | Storage |
|---|---|---|
| Cloud Logging structured logs | 30 d (info), 90 d (warn/error/audit) | Cloud Logging buckets |
| Cloud Trace spans | 30 d | Cloud Trace |
| Cloud Monitoring metrics | 6 weeks (default), 18 months (recorded rules) | Managed Prometheus |
| Sentry | 90 d | Sentry SaaS |
| BigQuery telemetry sink | 13 months hot, 5 years archive | BigQuery + GCS |
| RUM beacons | 90 d | BigQuery |