search-aggregation-service — OBSERVABILITY

Companion: SERVICE_OVERVIEW · API_CONTRACTS · EVENT_SCHEMAS · FAILURE_MODES · DEPLOYMENT_TOPOLOGY

1. Telemetry stack

Concern	Tool	Notes
Distributed traces	OpenTelemetry SDK → Cloud Trace	OTLP exporter; W3C Trace Context propagated to Pub/Sub messages and OpenSearch (X-Opaque-Id)
Metrics	OpenTelemetry → Cloud Monitoring (Managed Prometheus)	RED + USE; histograms via OTel `View` config
Logs	Pino → Cloud Logging (structured JSON)	`severity` mapping; PII redaction middleware
Profiling	Cloud Profiler (Node.js)	Always-on in production
Error reporting	Cloud Error Reporting	via Pino `level: error` adapter
Dashboards & alerts	Cloud Monitoring + Grafana (read-only mirror)	Dashboards committed in `ops/dashboards/`
Synthetic monitoring	Cloud Monitoring uptime checks + Playwright synthetics from `bff-consumer-service`	Hits `/api/v1/search/queries` with golden query

2. Trace propagation

Every external request enters with a W3C traceparent (Apigee injects when missing). The trace propagates through:

bff-consumer-service → search-aggregation-service → Postgres / OpenSearch / Memorystore / Pub/Sub publish.

For consumed events, the publisher's trace context is unpacked from the Pub/Sub traceparent attribute and a new child span is started in the consumer.

Span naming

Span	Kind	Key attributes
`http.server POST /api/v1/search/queries`	server	`http.route`, `query.canonicalHash`, `query.locale`, `query.region`, `query.degradationLevel`
`http.server GET /api/v1/search/hotels/:id`	server	`http.route`, `propertyId`, `currency`
`app.command UpsertHotelIndexEntry`	internal	`propertyId`, `tenantId`, `vc.property_service`, `result`
`app.consumer melmastoon.property.published.v1`	consumer	`messaging.system=gcp_pubsub`, `messaging.message.id`, `propertyId`, `result`
`db.postgres SELECT search.hotel_index_entries`	client	`db.system=postgresql`, `db.statement.fingerprint`, `rows`
`opensearch.search melmastoon-search-current`	client	`opensearch.took_ms`, `opensearch.shards_failed`, `opensearch.timed_out`
`cache.get srh:q`	client	`cache.hit`, `cache.key.hash` (never raw key)
`pubsub.publish melmastoon.search.projection.updated.v1`	producer	`topic`, `messaging.message.id`, `ordering_key`
`ai.orchestrator parse-search`	client	`ai.cache_hit`, `ai.model`, `ai.prompt_version`, `ai.cost_micros`

db.statement.fingerprint strips literals (uses pg-query-emscripten normalizer) before tagging the span — no raw query text is exported.

3. SLIs

SLI	Definition	Source
Search latency	`http.server.duration{route="POST /api/v1/search/queries", status<500}`	OTel histogram
Hotel detail latency	`http.server.duration{route="GET /api/v1/search/hotels/:id", status<500}`	OTel histogram
Search availability	`1 - (5xx / total)` on the same route, 5-min window	OTel counter
Projection freshness	`now() - max(occurred_at)` per upstream topic processed in the last 1 m	gauge `projection_freshness_seconds{topic}`
Index lag	`(rows in postgres with last_upserted_at > T) - (docs in opensearch with last_upserted_at > T)`	gauge `index_lag_docs`
Cache hit ratio	`cache_hits / (cache_hits + cache_misses)` per surface	counters
Allow-list strip rate	`projection_field_stripped_total / projection_events_total`	counters
AI fallback rate	`ai_search_intent_fallbacks_total / ai_search_intent_calls_total`	counters
DLQ rate	`pubsub_dlq_total / pubsub_messages_total` per subscription	counters

4. SLOs

SLO	Target	Window	Burn-rate alerts
Search latency p95 < 250 ms	99 %	30 d rolling	1 h burn ≥ 14× ⇒ page; 6 h burn ≥ 6× ⇒ ticket
Search availability ≥ 99.9 %	99.9 %	30 d rolling	as above
Hotel detail latency p95 < 200 ms	99 %	30 d rolling	as above
Projection freshness p95 < 30 s	99 %	7 d rolling	1 h burn ≥ 14× ⇒ page
Index lag < 1 000 docs	99.5 %	7 d rolling	sustained > 5 min ⇒ ticket; > 30 min ⇒ page
Allow-list strip rate = 0 in steady state	100 %	24 h	ANY occurrence ⇒ security page
DLQ rate < 0.1 %	99 %	1 h	sustained > 0.1 % over 15 min ⇒ page
Cache hit ratio ≥ 70 % (`/queries`)	n/a (efficiency)	24 h	< 50 % over 1 h ⇒ ticket

Burn-rate is computed via Cloud Monitoring SLO services with multi-window alerts (1 h fast burn + 6 h slow burn).

5. Metrics catalog

5.1 Counters

Metric	Labels	Increment
`search_queries_total`	`region`, `locale`, `degradation_level`	per executed query
`search_results_returned_total`	`region`	per query, by `result_count` (sum)
`clicks_recorded_total`	`region`	per click
`projection_events_total`	`topic`, `result`	per consumed event
`projection_field_stripped_total`	`field`	per stripped field
`projection_skipped_stale_total`	`topic`, `slice`	per stale-vector-clock skip
`pubsub_dlq_total`	`subscription`	per DLQ
`idempotency_replay_total`	`route`	per replayed request
`ai_search_intent_calls_total`	`cache_hit`	per orchestrator call
`ai_search_intent_fallbacks_total`	`reason`	per fallback
`cache_hits_total` / `cache_misses_total`	`surface`	per cache op
`boost_rule_writes_total`	`result`	per boost-rule mutation
`index_build_phase_total`	`phase`, `result`	per phase transition
`tenant_purge_total`	`result`	per tenant purge

5.2 Histograms

Metric	Buckets (ms)	Notes
`http_server_duration_ms`	5,10,25,50,100,250,500,1000,2500,5000	by `route`, `status`
`db_query_duration_ms`	1,2,5,10,25,50,100,250,500,1000	by `op`, `table`
`opensearch_query_duration_ms`	5,10,25,50,100,250,500,1000,2500	by `op`
`cache_op_duration_ms`	0.5,1,2,5,10,25,50	by `op`, `surface`
`consumer_processing_duration_ms`	5,10,25,50,100,250,500,1000,2500,5000	by `topic`
`outbox_publish_lag_ms`	50,100,250,500,1000,2500,5000,10000	created→published

5.3 Gauges

Metric	Labels	Source
`projection_freshness_seconds`	`topic`	scraped every 15 s
`index_lag_docs`	none	scraped every 60 s
`outbox_pending`	none	scraped every 30 s
`inbox_unprocessed`	none	scraped every 30 s
`redis_cache_keys`	`prefix`	scraped every 60 s
`opensearch_cluster_status`	`color`	from cluster health
`index_build_active`	`region`	scraped every 60 s

6. Logs

6.1 Format

{
  "ts": "2026-04-23T12:34:56.789Z",
  "severity": "INFO",
  "service": "search-aggregation-service",
  "version": "1.42.0",
  "env": "prod",
  "region": "europe-west1",
  "trace": "00-<trace-id>-<span-id>-01",
  "spanId": "<span>",
  "traceId": "<trace>",
  "tenantId": null,                      // null for cross-tenant reads
  "route": "POST /api/v1/search/queries",
  "msg": "search.executed",
  "ctx": {
    "queryHash": "sha256:…",
    "locale": "ps",
    "region": "AF",
    "resultCount": 42,
    "tookMs": 187,
    "degradationLevel": "none",
    "cacheHit": false
  }
}

6.2 Required fields

ts, severity, service, version, env, region, traceId, spanId, route, msg. Optional but standardized: tenantId, propertyId, eventId, topic, result, errorCode, durationMs.

6.3 Log levels

Level	When
`DEBUG`	Local dev / temporarily enabled per-pod via `LOG_LEVEL` env override
`INFO`	Normal events: query executed, projection upsert applied, cache invalidate
`WARN`	Recoverable degradation: OpenSearch fallback, AI fallback, cache miss storm
`ERROR`	Unrecoverable per-request: validation failure (with code), 5xx
`CRITICAL`	Process-wide: allow-list breach, RLS sentinel mismatch, secret refresh failure

6.4 PII redaction

Pino redact paths: ctx.text, ctx.userBucket, ctx.email, req.headers.authorization, req.headers.cookie. Redaction is enforced at appender time and again at the Cloud Logging sink filter (defense in depth).

7. Standard dashboards

Stored in ops/dashboards/search-aggregation-service.json (Grafana JSON). Tabs:

SLO overview — burn-rate, error budget remaining, all SLIs.
Search read path — RPS by route, latency p50/p95/p99, error rate, cache hit ratio.
Projection write path — events/sec by topic, processing duration, DLQ rate, vector-clock skip rate.
Freshness & lag — projection freshness gauge by topic, index lag, outbox pending.
OpenSearch health — cluster color, shard status, took_ms, query rejections.
Postgres health — connection pool, slow queries, RLS errors (should be zero), partitioned table sizes.
Cache — Memorystore hit ratio, evictions, memory usage, key count by prefix.
AI — orchestrator call rate, cache hit, fallback rate, cost micros sum.
Security & compliance — allow-list strip rate, auth failures, idempotency replays, tenant purges.
Index builds — active builds, phase durations, swap events.

8. Alerts (Cloud Monitoring)

Alert	Condition	Severity	Runbook
`search-latency-fast-burn`	1 h burn ≥ 14× over SLO 99 % p95 250 ms	P1 page	FAILURE_MODES § Latency
`search-availability-fast-burn`	1 h burn ≥ 14× over 99.9 %	P1 page	FAILURE_MODES § 5xx
`freshness-fast-burn`	`projection_freshness_seconds{topic=*}` p95 > 60 for 5 min	P2 page	FAILURE_MODES § Stale projection
`dlq-spike`	`pubsub_dlq_total` rate > 0.1% over 15 min	P2 page	FAILURE_MODES § DLQ
`allow-list-breach`	`projection_field_stripped_total` ≥ 1 in 5 min	P1 security page	SECURITY_MODEL § Allow-list
`opensearch-yellow`	cluster status `yellow` for 10 min	P3 ticket	FAILURE_MODES § OpenSearch
`opensearch-red`	cluster status `red` for 1 min	P1 page	same
`outbox-backlog`	`outbox_pending > 5 000` for 5 min	P2 page	FAILURE_MODES § Outbox
`index-lag-high`	`index_lag_docs > 5 000` for 5 min	P2 page	runbook Reconcile index lag
`ai-fallback-elevated`	fallback rate > 20 % for 15 min	P3 ticket	AI_INTEGRATION § Failure modes
`cost-budget-soft`	monthly Cloud Run CPU * memory > 80 % budget	P3 ticket	platform finance
`cache-hit-low`	cache hit ratio for `/queries` < 50 % for 1 h	P3 ticket	platform owner

9. Runbooks

Live in ops/runbooks/search-aggregation-service/:

search-latency-degraded.md
freshness-stale.md
dlq-drain.md
opensearch-degraded.md
outbox-stuck.md
index-lag-reconcile.md
index-rebuild.md
tenant-purge.md
allow-list-breach-ir.md (incident response)
ai-orchestrator-degraded.md

Each runbook follows the standard sections: Symptom, Detection, Triage, Mitigate, Recover, Postmortem.

10. Synthetic checks

Check	Frequency	Region	Expectation
`search-golden-query-kabul-3star`	60 s	EU + ASIA	200, results > 5, p95 < 400 ms
`hotel-detail-pinned-property`	60 s	EU	200, includes hero photo
`suggest-prefix-kab`	60 s	EU	200, suggestions > 0
`healthz`	30 s	EU + ASIA	200
`readyz`	30 s	EU + ASIA	200

Synthetic failures page on three consecutive failures.

11. Audit feed

Per SECURITY_MODEL § Auditability, the topics melmastoon.search.boost_rule.v1 and melmastoon.search.index.v1 are mirrored to audit-service BigQuery dataset audit.search with retention controls. A daily Looker dashboard summarizes admin actions for compliance review.

1. Telemetry stack​

2. Trace propagation​

Span naming​

3. SLIs​

4. SLOs​

5. Metrics catalog​

5.1 Counters​

5.2 Histograms​

5.3 Gauges​

6. Logs​

6.1 Format​

6.2 Required fields​

6.3 Log levels​

6.4 PII redaction​

7. Standard dashboards​

8. Alerts (Cloud Monitoring)​

9. Runbooks​

10. Synthetic checks​

11. Audit feed​