OBSERVABILITY — bff-consumer-service

Sibling: APPLICATION_LOGIC · FAILURE_MODES · SECURITY_MODEL

Cross-cutting: 02 Enterprise Architecture · §11 Observability

1. Telemetry stack

Layer	Tool	Purpose
Traces	OpenTelemetry SDK → Cloud Trace	End-to-end request and upstream call traces
Metrics	OpenTelemetry → Cloud Monitoring (Managed Prometheus)	RED + USE + business metrics
Logs	Pino → Cloud Logging (structured JSON)	Request, error, audit
Errors	Sentry (NestJS integration)	Unhandled exceptions, regressions
RUM	Web Vitals + page-view beacons → BigQuery	Front-end UX feedback (collected via `/telemetry/page-view`)
Profiling	Cloud Profiler (continuous CPU + heap, 1% sample)	Hot-path optimisation

2. Service-level objectives

SLO	Target	Window	Alert at
Availability of `/search` (HTTP success rate)	99.9%	30 d rolling	Burn rate > 14.4× over 1 h or > 6× over 6 h
`/search` p95 latency	< 700 ms	30 d	p95 > 850 ms for 10 min
`/search` p99 latency	< 1500 ms	30 d	p99 > 2000 ms for 10 min
`/hotels/{id}` p95 latency	< 600 ms	30 d	p95 > 750 ms for 10 min
`/handoff` mint p95 latency	< 350 ms	30 d	p95 > 500 ms for 10 min
`/handoff` consume p95 latency	< 100 ms	30 d	p95 > 150 ms for 10 min
Cache hit ratio (search list, brand peek)	> 80%	30 d	< 65% for 30 min
Memorystore connection error rate	< 0.1%	30 d	> 0.5% for 5 min
HMAC verify failures (handoff)	< 0.05% of mints	30 d	> 0.5% for 5 min (potential key issue or attack)
Bot-suspected ratio	< 8% of `/search`	30 d	> 25% for 15 min (bot wave)

3. Golden signals dashboard (Cloud Monitoring)

Per endpoint group (search, hotel-detail, handoff, wishlist, session, telemetry):

Rate — requests/sec (stacked by http.status_code).
Errors — 5xx + 4xx breakdown.
Duration — p50, p95, p99 latency histogram.
Saturation — CPU %, container memory, Memorystore connection pool utilization, Postgres pgbouncer connections.

Plus business panels:

Conversion funnel: search → click → handoff ratio (sampled and projected from telemetry events into BigQuery, surfaced via Looker Studio embed).
Cache hit ratio (Memorystore single-flight hit / (hit + miss)).
Currency / locale distribution.
Top 20 search queries (last 1 h, anonymised).

4. Trace propagation

Inbound: traceparent header (W3C Trace Context) accepted; if absent, generate.
Outbound: propagate traceparent + tracestate to all upstream HTTP calls.
Memorystore + Postgres calls add db.system, db.statement (sanitised), db.cache.key.template attributes.
Single-flight wrapper adds cache.singleflight.outcome ∈ {hit, miss-leader, miss-follower}.
Pub/Sub publish adds messaging.system="pubsub", messaging.destination=<topic>, messaging.message.id.

A canonical search trace looks like:

HTTP GET /bff/consumer/v1/search
├── interceptor.session-bootstrap (5 ms)
├── interceptor.bot-detection (3 ms)
├── interceptor.rate-limit (1 ms)
├── cache.singleflight (45 ms; outcome=miss-leader)
│   ├── http search-aggregation-service POST /search (210 ms)
│   ├── http pricing-service POST /quotes/preview ×10 (parallel, slowest 180 ms)
│   ├── http theme-config-service GET /brand-peek/batch (60 ms)
│   └── compose ListingCardVM ×20 (15 ms)
├── store.search-session (memorystore, 4 ms)
└── pubsub.publish search.executed.v1 (8 ms; non-blocking)

5. Logs

5.1 Structure (Pino → Cloud Logging)

Every log line is JSON with the following base fields:

Field	Source
`severity`	Pino → Cloud Logging mapping
`time`	ISO 8601
`requestId`	Generated per request
`traceId`	OTel trace ID
`spanId`	OTel span ID
`service`	`bff-consumer-service`
`version`	Git SHA
`env`	dev / stage / prod
`region`	Cloud Run region
`instance`	Cloud Run instance ID
`endpoint`	route key, e.g. `GET /search`
`httpStatus`	int
`latencyMs`	int
`guestSessionId`	when bootstrapped
`fingerprintHash`	always (hashed)
`ipHash`	always (hashed)
`cacheOutcome`	when applicable
`errorCode`	`MELMASTOON.…` when error
`errorClass`	exception name when error

5.2 Sampling

INFO: 10% of successful 2xx requests.
INFO 100% for: handoff mints, handoff consumes, session bootstrap (first request only), wishlist mutations.
WARN: 100%.
ERROR: 100%.
AUDIT: 100% (handoff mint, bot suspected, ownership violation).

5.3 Sensitive data

IP and UA are hashed with peppered SHA-256 at ingest; raw values never enter logs.
Search queries are stored verbatim only in trace span attributes (1% sampled). Logs include only the queryHash.
HMAC secrets, cookie values, and Authorization headers are scrubbed by a Pino redaction list.

6. Metrics catalogue

6.1 RED metrics (per endpoint)

Metric	Type	Labels
`http.server.request.duration`	histogram	`route`, `method`, `status_code`, `cache_outcome`
`http.server.request.count`	counter	same
`http.server.errors`	counter	`route`, `error_code`

6.2 Domain-ish metrics (cross-tenant aggregate analytics)

Metric	Type	Labels
`bff_consumer.search.executed.total`	counter	`currency`, `locale`, `bot_verdict`
`bff_consumer.handoff.minted.total`	counter	`tenantId`, `currency`, `locale`
`bff_consumer.handoff.consumed.total`	counter	`tenantId`, `outcome ∈ {ok, replay}`
`bff_consumer.handoff.replay.total`	counter	`tenantId`
`bff_consumer.bot.score.bucket.total`	counter	`bucket ∈ {0-30, 30-60, 60-85, 85-100}`
`bff_consumer.cache.hit_ratio`	gauge (recorded rule)	`cache_kind ∈ {search, hotel, brand-peek, popularity, light-availability}`
`bff_consumer.session.active`	gauge (recorded rule)	n/a (estimated from Memorystore key scan; sampled)

6.3 Resource metrics

Cloud Run: container CPU, memory, instance count, concurrent requests, container startup latency.
Memorystore: ops/sec, evictions, used_memory, connected_clients.
Postgres: pgbouncer pool active/idle, query duration p99, deadlocks, autovacuum lag.
Pub/Sub publisher: queue depth, publish latency, ack errors.
Secret Manager: read latency, version-pin staleness.

7. Alerts (PagerDuty)

Alert	Severity	Threshold	Action
`/search` 5xx burn 14.4× over 1 h	P1	SLO burn	Page primary on-call
Memorystore connection failure ratio > 0.5%	P1	5 min sustained	Page primary; attempt failover
Postgres pgbouncer pool exhausted	P1	5 min	Page; raise pool size
HMAC verify failures > 0.5% of mints	P1	5 min	Page security on-call; check key rotation
Handoff replay rate > 1%	P2	15 min	Page security on-call
Bot-suspected ratio > 25%	P2	15 min	Notify SRE; consider campaign mode
Cache hit ratio < 65%	P3	30 min	Notify SRE; investigate stampede
Pub/Sub publish failure ratio > 1%	P2	10 min	Page; check IAM / quota
Container OOM kill	P2	any	Page; investigate hot path

8. Synthetic checks

Cloud Monitoring uptime check every 60 s from 5 regions:
- GET /health/live → expects 200 with { "ok": true }.
- GET /health/ready → expects 200 with all dependency probes green.
Synthetic search every 5 min from 3 regions: GET /search?city=kabul&checkIn=…&checkOut=…&adults=2 → asserts results.length > 0, p95 < 1.5 s.
Synthetic handoff every 15 min in stage: mint → consume cycle; asserts consumed=true and replay returns MELMASTOON.BFF.TENANT.HANDOFF_REPLAYED.

9. Runbooks

Cross-linked from SECURITY_MODEL §15 and FAILURE_MODES. All under runbooks/bff-consumer/.

10. Capacity planning signals

Tracked weekly in the platform capacity review:

p99 latency trend.
Cache hit ratio trend.
Cold-start ratio (Cloud Run).
Memorystore memory headroom (target < 70% sustained).
Pub/Sub publish backlog.

11. Audit + compliance feeds

The following feeds are wired automatically into the platform audit-service via Pub/Sub subscription:

melmastoon.bff.consumer.handoff.initiated.v1
melmastoon.bff.consumer.bot_suspected.v1
melmastoon.bff.consumer.session.started.v1 (sampled)

12. Data retention

Data	Retention	Storage
Cloud Logging structured logs	30 d (info), 90 d (warn/error/audit)	Cloud Logging buckets
Cloud Trace spans	30 d	Cloud Trace
Cloud Monitoring metrics	6 weeks (default), 18 months (recorded rules)	Managed Prometheus
Sentry	90 d	Sentry SaaS
BigQuery telemetry sink	13 months hot, 5 years archive	BigQuery + GCS
RUM beacons	90 d	BigQuery

1. Telemetry stack​

2. Service-level objectives​

3. Golden signals dashboard (Cloud Monitoring)​

4. Trace propagation​

5. Logs​

5.1 Structure (Pino → Cloud Logging)​

5.2 Sampling​

5.3 Sensitive data​

6. Metrics catalogue​

6.1 RED metrics (per endpoint)​

6.2 Domain-ish metrics (cross-tenant aggregate analytics)​

6.3 Resource metrics​

7. Alerts (PagerDuty)​

8. Synthetic checks​

9. Runbooks​

10. Capacity planning signals​

11. Audit + compliance feeds​

12. Data retention​