Skip to main content

search-aggregation-service — OBSERVABILITY

Companion: SERVICE_OVERVIEW · API_CONTRACTS · EVENT_SCHEMAS · FAILURE_MODES · DEPLOYMENT_TOPOLOGY

1. Telemetry stack

ConcernToolNotes
Distributed tracesOpenTelemetry SDK → Cloud TraceOTLP exporter; W3C Trace Context propagated to Pub/Sub messages and OpenSearch (X-Opaque-Id)
MetricsOpenTelemetry → Cloud Monitoring (Managed Prometheus)RED + USE; histograms via OTel View config
LogsPino → Cloud Logging (structured JSON)severity mapping; PII redaction middleware
ProfilingCloud Profiler (Node.js)Always-on in production
Error reportingCloud Error Reportingvia Pino level: error adapter
Dashboards & alertsCloud Monitoring + Grafana (read-only mirror)Dashboards committed in ops/dashboards/
Synthetic monitoringCloud Monitoring uptime checks + Playwright synthetics from bff-consumer-serviceHits /api/v1/search/queries with golden query

2. Trace propagation

Every external request enters with a W3C traceparent (Apigee injects when missing). The trace propagates through:

bff-consumer-servicesearch-aggregation-service → Postgres / OpenSearch / Memorystore / Pub/Sub publish.

For consumed events, the publisher's trace context is unpacked from the Pub/Sub traceparent attribute and a new child span is started in the consumer.

Span naming

SpanKindKey attributes
http.server POST /api/v1/search/queriesserverhttp.route, query.canonicalHash, query.locale, query.region, query.degradationLevel
http.server GET /api/v1/search/hotels/:idserverhttp.route, propertyId, currency
app.command UpsertHotelIndexEntryinternalpropertyId, tenantId, vc.property_service, result
app.consumer melmastoon.property.published.v1consumermessaging.system=gcp_pubsub, messaging.message.id, propertyId, result
db.postgres SELECT search.hotel_index_entriesclientdb.system=postgresql, db.statement.fingerprint, rows
opensearch.search melmastoon-search-currentclientopensearch.took_ms, opensearch.shards_failed, opensearch.timed_out
cache.get srh:qclientcache.hit, cache.key.hash (never raw key)
pubsub.publish melmastoon.search.projection.updated.v1producertopic, messaging.message.id, ordering_key
ai.orchestrator parse-searchclientai.cache_hit, ai.model, ai.prompt_version, ai.cost_micros

db.statement.fingerprint strips literals (uses pg-query-emscripten normalizer) before tagging the span — no raw query text is exported.

3. SLIs

SLIDefinitionSource
Search latencyhttp.server.duration{route="POST /api/v1/search/queries", status<500}OTel histogram
Hotel detail latencyhttp.server.duration{route="GET /api/v1/search/hotels/:id", status<500}OTel histogram
Search availability1 - (5xx / total) on the same route, 5-min windowOTel counter
Projection freshnessnow() - max(occurred_at) per upstream topic processed in the last 1 mgauge projection_freshness_seconds{topic}
Index lag(rows in postgres with last_upserted_at > T) - (docs in opensearch with last_upserted_at > T)gauge index_lag_docs
Cache hit ratiocache_hits / (cache_hits + cache_misses) per surfacecounters
Allow-list strip rateprojection_field_stripped_total / projection_events_totalcounters
AI fallback rateai_search_intent_fallbacks_total / ai_search_intent_calls_totalcounters
DLQ ratepubsub_dlq_total / pubsub_messages_total per subscriptioncounters

4. SLOs

SLOTargetWindowBurn-rate alerts
Search latency p95 < 250 ms99 %30 d rolling1 h burn ≥ 14× ⇒ page; 6 h burn ≥ 6× ⇒ ticket
Search availability ≥ 99.9 %99.9 %30 d rollingas above
Hotel detail latency p95 < 200 ms99 %30 d rollingas above
Projection freshness p95 < 30 s99 %7 d rolling1 h burn ≥ 14× ⇒ page
Index lag < 1 000 docs99.5 %7 d rollingsustained > 5 min ⇒ ticket; > 30 min ⇒ page
Allow-list strip rate = 0 in steady state100 %24 hANY occurrence ⇒ security page
DLQ rate < 0.1 %99 %1 hsustained > 0.1 % over 15 min ⇒ page
Cache hit ratio ≥ 70 % (/queries)n/a (efficiency)24 h< 50 % over 1 h ⇒ ticket

Burn-rate is computed via Cloud Monitoring SLO services with multi-window alerts (1 h fast burn + 6 h slow burn).

5. Metrics catalog

5.1 Counters

MetricLabelsIncrement
search_queries_totalregion, locale, degradation_levelper executed query
search_results_returned_totalregionper query, by result_count (sum)
clicks_recorded_totalregionper click
projection_events_totaltopic, resultper consumed event
projection_field_stripped_totalfieldper stripped field
projection_skipped_stale_totaltopic, sliceper stale-vector-clock skip
pubsub_dlq_totalsubscriptionper DLQ
idempotency_replay_totalrouteper replayed request
ai_search_intent_calls_totalcache_hitper orchestrator call
ai_search_intent_fallbacks_totalreasonper fallback
cache_hits_total / cache_misses_totalsurfaceper cache op
boost_rule_writes_totalresultper boost-rule mutation
index_build_phase_totalphase, resultper phase transition
tenant_purge_totalresultper tenant purge

5.2 Histograms

MetricBuckets (ms)Notes
http_server_duration_ms5,10,25,50,100,250,500,1000,2500,5000by route, status
db_query_duration_ms1,2,5,10,25,50,100,250,500,1000by op, table
opensearch_query_duration_ms5,10,25,50,100,250,500,1000,2500by op
cache_op_duration_ms0.5,1,2,5,10,25,50by op, surface
consumer_processing_duration_ms5,10,25,50,100,250,500,1000,2500,5000by topic
outbox_publish_lag_ms50,100,250,500,1000,2500,5000,10000created→published

5.3 Gauges

MetricLabelsSource
projection_freshness_secondstopicscraped every 15 s
index_lag_docsnonescraped every 60 s
outbox_pendingnonescraped every 30 s
inbox_unprocessednonescraped every 30 s
redis_cache_keysprefixscraped every 60 s
opensearch_cluster_statuscolorfrom cluster health
index_build_activeregionscraped every 60 s

6. Logs

6.1 Format

{
"ts": "2026-04-23T12:34:56.789Z",
"severity": "INFO",
"service": "search-aggregation-service",
"version": "1.42.0",
"env": "prod",
"region": "europe-west1",
"trace": "00-<trace-id>-<span-id>-01",
"spanId": "<span>",
"traceId": "<trace>",
"tenantId": null, // null for cross-tenant reads
"route": "POST /api/v1/search/queries",
"msg": "search.executed",
"ctx": {
"queryHash": "sha256:…",
"locale": "ps",
"region": "AF",
"resultCount": 42,
"tookMs": 187,
"degradationLevel": "none",
"cacheHit": false
}
}

6.2 Required fields

ts, severity, service, version, env, region, traceId, spanId, route, msg. Optional but standardized: tenantId, propertyId, eventId, topic, result, errorCode, durationMs.

6.3 Log levels

LevelWhen
DEBUGLocal dev / temporarily enabled per-pod via LOG_LEVEL env override
INFONormal events: query executed, projection upsert applied, cache invalidate
WARNRecoverable degradation: OpenSearch fallback, AI fallback, cache miss storm
ERRORUnrecoverable per-request: validation failure (with code), 5xx
CRITICALProcess-wide: allow-list breach, RLS sentinel mismatch, secret refresh failure

6.4 PII redaction

Pino redact paths: ctx.text, ctx.userBucket, ctx.email, req.headers.authorization, req.headers.cookie. Redaction is enforced at appender time and again at the Cloud Logging sink filter (defense in depth).

7. Standard dashboards

Stored in ops/dashboards/search-aggregation-service.json (Grafana JSON). Tabs:

  1. SLO overview — burn-rate, error budget remaining, all SLIs.
  2. Search read path — RPS by route, latency p50/p95/p99, error rate, cache hit ratio.
  3. Projection write path — events/sec by topic, processing duration, DLQ rate, vector-clock skip rate.
  4. Freshness & lag — projection freshness gauge by topic, index lag, outbox pending.
  5. OpenSearch health — cluster color, shard status, took_ms, query rejections.
  6. Postgres health — connection pool, slow queries, RLS errors (should be zero), partitioned table sizes.
  7. Cache — Memorystore hit ratio, evictions, memory usage, key count by prefix.
  8. AI — orchestrator call rate, cache hit, fallback rate, cost micros sum.
  9. Security & compliance — allow-list strip rate, auth failures, idempotency replays, tenant purges.
  10. Index builds — active builds, phase durations, swap events.

8. Alerts (Cloud Monitoring)

AlertConditionSeverityRunbook
search-latency-fast-burn1 h burn ≥ 14× over SLO 99 % p95 250 msP1 pageFAILURE_MODES § Latency
search-availability-fast-burn1 h burn ≥ 14× over 99.9 %P1 pageFAILURE_MODES § 5xx
freshness-fast-burnprojection_freshness_seconds{topic=*} p95 > 60 for 5 minP2 pageFAILURE_MODES § Stale projection
dlq-spikepubsub_dlq_total rate > 0.1% over 15 minP2 pageFAILURE_MODES § DLQ
allow-list-breachprojection_field_stripped_total ≥ 1 in 5 minP1 security pageSECURITY_MODEL § Allow-list
opensearch-yellowcluster status yellow for 10 minP3 ticketFAILURE_MODES § OpenSearch
opensearch-redcluster status red for 1 minP1 pagesame
outbox-backlogoutbox_pending > 5 000 for 5 minP2 pageFAILURE_MODES § Outbox
index-lag-highindex_lag_docs > 5 000 for 5 minP2 pagerunbook Reconcile index lag
ai-fallback-elevatedfallback rate > 20 % for 15 minP3 ticketAI_INTEGRATION § Failure modes
cost-budget-softmonthly Cloud Run CPU * memory > 80 % budgetP3 ticketplatform finance
cache-hit-lowcache hit ratio for /queries < 50 % for 1 hP3 ticketplatform owner

9. Runbooks

Live in ops/runbooks/search-aggregation-service/:

  • search-latency-degraded.md
  • freshness-stale.md
  • dlq-drain.md
  • opensearch-degraded.md
  • outbox-stuck.md
  • index-lag-reconcile.md
  • index-rebuild.md
  • tenant-purge.md
  • allow-list-breach-ir.md (incident response)
  • ai-orchestrator-degraded.md

Each runbook follows the standard sections: Symptom, Detection, Triage, Mitigate, Recover, Postmortem.

10. Synthetic checks

CheckFrequencyRegionExpectation
search-golden-query-kabul-3star60 sEU + ASIA200, results > 5, p95 < 400 ms
hotel-detail-pinned-property60 sEU200, includes hero photo
suggest-prefix-kab60 sEU200, suggestions > 0
healthz30 sEU + ASIA200
readyz30 sEU + ASIA200

Synthetic failures page on three consecutive failures.

11. Audit feed

Per SECURITY_MODEL § Auditability, the topics melmastoon.search.boost_rule.v1 and melmastoon.search.index.v1 are mirrored to audit-service BigQuery dataset audit.search with retention controls. A daily Looker dashboard summarizes admin actions for compliance review.