Skip to main content

OBSERVABILITY — bff-consumer-service

Sibling: APPLICATION_LOGIC · FAILURE_MODES · SECURITY_MODEL

Cross-cutting: 02 Enterprise Architecture · §11 Observability

1. Telemetry stack

LayerToolPurpose
TracesOpenTelemetry SDK → Cloud TraceEnd-to-end request and upstream call traces
MetricsOpenTelemetry → Cloud Monitoring (Managed Prometheus)RED + USE + business metrics
LogsPino → Cloud Logging (structured JSON)Request, error, audit
ErrorsSentry (NestJS integration)Unhandled exceptions, regressions
RUMWeb Vitals + page-view beacons → BigQueryFront-end UX feedback (collected via /telemetry/page-view)
ProfilingCloud Profiler (continuous CPU + heap, 1% sample)Hot-path optimisation

2. Service-level objectives

SLOTargetWindowAlert at
Availability of /search (HTTP success rate)99.9%30 d rollingBurn rate > 14.4× over 1 h or > 6× over 6 h
/search p95 latency< 700 ms30 dp95 > 850 ms for 10 min
/search p99 latency< 1500 ms30 dp99 > 2000 ms for 10 min
/hotels/{id} p95 latency< 600 ms30 dp95 > 750 ms for 10 min
/handoff mint p95 latency< 350 ms30 dp95 > 500 ms for 10 min
/handoff consume p95 latency< 100 ms30 dp95 > 150 ms for 10 min
Cache hit ratio (search list, brand peek)> 80%30 d< 65% for 30 min
Memorystore connection error rate< 0.1%30 d> 0.5% for 5 min
HMAC verify failures (handoff)< 0.05% of mints30 d> 0.5% for 5 min (potential key issue or attack)
Bot-suspected ratio< 8% of /search30 d> 25% for 15 min (bot wave)

3. Golden signals dashboard (Cloud Monitoring)

Per endpoint group (search, hotel-detail, handoff, wishlist, session, telemetry):

  • Rate — requests/sec (stacked by http.status_code).
  • Errors — 5xx + 4xx breakdown.
  • Duration — p50, p95, p99 latency histogram.
  • Saturation — CPU %, container memory, Memorystore connection pool utilization, Postgres pgbouncer connections.

Plus business panels:

  • Conversion funnel: search → click → handoff ratio (sampled and projected from telemetry events into BigQuery, surfaced via Looker Studio embed).
  • Cache hit ratio (Memorystore single-flight hit / (hit + miss)).
  • Currency / locale distribution.
  • Top 20 search queries (last 1 h, anonymised).

4. Trace propagation

  • Inbound: traceparent header (W3C Trace Context) accepted; if absent, generate.
  • Outbound: propagate traceparent + tracestate to all upstream HTTP calls.
  • Memorystore + Postgres calls add db.system, db.statement (sanitised), db.cache.key.template attributes.
  • Single-flight wrapper adds cache.singleflight.outcome ∈ {hit, miss-leader, miss-follower}.
  • Pub/Sub publish adds messaging.system="pubsub", messaging.destination=<topic>, messaging.message.id.

A canonical search trace looks like:

HTTP GET /bff/consumer/v1/search
├── interceptor.session-bootstrap (5 ms)
├── interceptor.bot-detection (3 ms)
├── interceptor.rate-limit (1 ms)
├── cache.singleflight (45 ms; outcome=miss-leader)
│ ├── http search-aggregation-service POST /search (210 ms)
│ ├── http pricing-service POST /quotes/preview ×10 (parallel, slowest 180 ms)
│ ├── http theme-config-service GET /brand-peek/batch (60 ms)
│ └── compose ListingCardVM ×20 (15 ms)
├── store.search-session (memorystore, 4 ms)
└── pubsub.publish search.executed.v1 (8 ms; non-blocking)

5. Logs

5.1 Structure (Pino → Cloud Logging)

Every log line is JSON with the following base fields:

FieldSource
severityPino → Cloud Logging mapping
timeISO 8601
requestIdGenerated per request
traceIdOTel trace ID
spanIdOTel span ID
servicebff-consumer-service
versionGit SHA
envdev / stage / prod
regionCloud Run region
instanceCloud Run instance ID
endpointroute key, e.g. GET /search
httpStatusint
latencyMsint
guestSessionIdwhen bootstrapped
fingerprintHashalways (hashed)
ipHashalways (hashed)
cacheOutcomewhen applicable
errorCodeMELMASTOON.… when error
errorClassexception name when error

5.2 Sampling

  • INFO: 10% of successful 2xx requests.
  • INFO 100% for: handoff mints, handoff consumes, session bootstrap (first request only), wishlist mutations.
  • WARN: 100%.
  • ERROR: 100%.
  • AUDIT: 100% (handoff mint, bot suspected, ownership violation).

5.3 Sensitive data

  • IP and UA are hashed with peppered SHA-256 at ingest; raw values never enter logs.
  • Search queries are stored verbatim only in trace span attributes (1% sampled). Logs include only the queryHash.
  • HMAC secrets, cookie values, and Authorization headers are scrubbed by a Pino redaction list.

6. Metrics catalogue

6.1 RED metrics (per endpoint)

MetricTypeLabels
http.server.request.durationhistogramroute, method, status_code, cache_outcome
http.server.request.countcountersame
http.server.errorscounterroute, error_code

6.2 Domain-ish metrics (cross-tenant aggregate analytics)

MetricTypeLabels
bff_consumer.search.executed.totalcountercurrency, locale, bot_verdict
bff_consumer.handoff.minted.totalcountertenantId, currency, locale
bff_consumer.handoff.consumed.totalcountertenantId, outcome ∈ {ok, replay}
bff_consumer.handoff.replay.totalcountertenantId
bff_consumer.bot.score.bucket.totalcounterbucket ∈ {0-30, 30-60, 60-85, 85-100}
bff_consumer.cache.hit_ratiogauge (recorded rule)cache_kind ∈ {search, hotel, brand-peek, popularity, light-availability}
bff_consumer.session.activegauge (recorded rule)n/a (estimated from Memorystore key scan; sampled)

6.3 Resource metrics

  • Cloud Run: container CPU, memory, instance count, concurrent requests, container startup latency.
  • Memorystore: ops/sec, evictions, used_memory, connected_clients.
  • Postgres: pgbouncer pool active/idle, query duration p99, deadlocks, autovacuum lag.
  • Pub/Sub publisher: queue depth, publish latency, ack errors.
  • Secret Manager: read latency, version-pin staleness.

7. Alerts (PagerDuty)

AlertSeverityThresholdAction
/search 5xx burn 14.4× over 1 hP1SLO burnPage primary on-call
Memorystore connection failure ratio > 0.5%P15 min sustainedPage primary; attempt failover
Postgres pgbouncer pool exhaustedP15 minPage; raise pool size
HMAC verify failures > 0.5% of mintsP15 minPage security on-call; check key rotation
Handoff replay rate > 1%P215 minPage security on-call
Bot-suspected ratio > 25%P215 minNotify SRE; consider campaign mode
Cache hit ratio < 65%P330 minNotify SRE; investigate stampede
Pub/Sub publish failure ratio > 1%P210 minPage; check IAM / quota
Container OOM killP2anyPage; investigate hot path

8. Synthetic checks

  • Cloud Monitoring uptime check every 60 s from 5 regions:
    • GET /health/live → expects 200 with { "ok": true }.
    • GET /health/ready → expects 200 with all dependency probes green.
  • Synthetic search every 5 min from 3 regions: GET /search?city=kabul&checkIn=…&checkOut=…&adults=2 → asserts results.length > 0, p95 < 1.5 s.
  • Synthetic handoff every 15 min in stage: mint → consume cycle; asserts consumed=true and replay returns MELMASTOON.BFF.TENANT.HANDOFF_REPLAYED.

9. Runbooks

Cross-linked from SECURITY_MODEL §15 and FAILURE_MODES. All under runbooks/bff-consumer/.

10. Capacity planning signals

Tracked weekly in the platform capacity review:

  • p99 latency trend.
  • Cache hit ratio trend.
  • Cold-start ratio (Cloud Run).
  • Memorystore memory headroom (target < 70% sustained).
  • Pub/Sub publish backlog.

11. Audit + compliance feeds

The following feeds are wired automatically into the platform audit-service via Pub/Sub subscription:

  • melmastoon.bff.consumer.handoff.initiated.v1
  • melmastoon.bff.consumer.bot_suspected.v1
  • melmastoon.bff.consumer.session.started.v1 (sampled)

12. Data retention

DataRetentionStorage
Cloud Logging structured logs30 d (info), 90 d (warn/error/audit)Cloud Logging buckets
Cloud Trace spans30 dCloud Trace
Cloud Monitoring metrics6 weeks (default), 18 months (recorded rules)Managed Prometheus
Sentry90 dSentry SaaS
BigQuery telemetry sink13 months hot, 5 years archiveBigQuery + GCS
RUM beacons90 dBigQuery