search-aggregation-service — OBSERVABILITY
Companion: SERVICE_OVERVIEW · API_CONTRACTS · EVENT_SCHEMAS · FAILURE_MODES · DEPLOYMENT_TOPOLOGY
1. Telemetry stack
| Concern | Tool | Notes |
|---|---|---|
| Distributed traces | OpenTelemetry SDK → Cloud Trace | OTLP exporter; W3C Trace Context propagated to Pub/Sub messages and OpenSearch (X-Opaque-Id) |
| Metrics | OpenTelemetry → Cloud Monitoring (Managed Prometheus) | RED + USE; histograms via OTel View config |
| Logs | Pino → Cloud Logging (structured JSON) | severity mapping; PII redaction middleware |
| Profiling | Cloud Profiler (Node.js) | Always-on in production |
| Error reporting | Cloud Error Reporting | via Pino level: error adapter |
| Dashboards & alerts | Cloud Monitoring + Grafana (read-only mirror) | Dashboards committed in ops/dashboards/ |
| Synthetic monitoring | Cloud Monitoring uptime checks + Playwright synthetics from bff-consumer-service | Hits /api/v1/search/queries with golden query |
2. Trace propagation
Every external request enters with a W3C traceparent (Apigee injects when missing). The trace propagates through:
bff-consumer-service → search-aggregation-service → Postgres / OpenSearch / Memorystore / Pub/Sub publish.
For consumed events, the publisher's trace context is unpacked from the Pub/Sub traceparent attribute and a new child span is started in the consumer.
Span naming
| Span | Kind | Key attributes |
|---|---|---|
http.server POST /api/v1/search/queries | server | http.route, query.canonicalHash, query.locale, query.region, query.degradationLevel |
http.server GET /api/v1/search/hotels/:id | server | http.route, propertyId, currency |
app.command UpsertHotelIndexEntry | internal | propertyId, tenantId, vc.property_service, result |
app.consumer melmastoon.property.published.v1 | consumer | messaging.system=gcp_pubsub, messaging.message.id, propertyId, result |
db.postgres SELECT search.hotel_index_entries | client | db.system=postgresql, db.statement.fingerprint, rows |
opensearch.search melmastoon-search-current | client | opensearch.took_ms, opensearch.shards_failed, opensearch.timed_out |
cache.get srh:q | client | cache.hit, cache.key.hash (never raw key) |
pubsub.publish melmastoon.search.projection.updated.v1 | producer | topic, messaging.message.id, ordering_key |
ai.orchestrator parse-search | client | ai.cache_hit, ai.model, ai.prompt_version, ai.cost_micros |
db.statement.fingerprint strips literals (uses pg-query-emscripten normalizer) before tagging the span — no raw query text is exported.
3. SLIs
| SLI | Definition | Source |
|---|---|---|
| Search latency | http.server.duration{route="POST /api/v1/search/queries", status<500} | OTel histogram |
| Hotel detail latency | http.server.duration{route="GET /api/v1/search/hotels/:id", status<500} | OTel histogram |
| Search availability | 1 - (5xx / total) on the same route, 5-min window | OTel counter |
| Projection freshness | now() - max(occurred_at) per upstream topic processed in the last 1 m | gauge projection_freshness_seconds{topic} |
| Index lag | (rows in postgres with last_upserted_at > T) - (docs in opensearch with last_upserted_at > T) | gauge index_lag_docs |
| Cache hit ratio | cache_hits / (cache_hits + cache_misses) per surface | counters |
| Allow-list strip rate | projection_field_stripped_total / projection_events_total | counters |
| AI fallback rate | ai_search_intent_fallbacks_total / ai_search_intent_calls_total | counters |
| DLQ rate | pubsub_dlq_total / pubsub_messages_total per subscription | counters |
4. SLOs
| SLO | Target | Window | Burn-rate alerts |
|---|---|---|---|
| Search latency p95 < 250 ms | 99 % | 30 d rolling | 1 h burn ≥ 14× ⇒ page; 6 h burn ≥ 6× ⇒ ticket |
| Search availability ≥ 99.9 % | 99.9 % | 30 d rolling | as above |
| Hotel detail latency p95 < 200 ms | 99 % | 30 d rolling | as above |
| Projection freshness p95 < 30 s | 99 % | 7 d rolling | 1 h burn ≥ 14× ⇒ page |
| Index lag < 1 000 docs | 99.5 % | 7 d rolling | sustained > 5 min ⇒ ticket; > 30 min ⇒ page |
| Allow-list strip rate = 0 in steady state | 100 % | 24 h | ANY occurrence ⇒ security page |
| DLQ rate < 0.1 % | 99 % | 1 h | sustained > 0.1 % over 15 min ⇒ page |
Cache hit ratio ≥ 70 % (/queries) | n/a (efficiency) | 24 h | < 50 % over 1 h ⇒ ticket |
Burn-rate is computed via Cloud Monitoring SLO services with multi-window alerts (1 h fast burn + 6 h slow burn).
5. Metrics catalog
5.1 Counters
| Metric | Labels | Increment |
|---|---|---|
search_queries_total | region, locale, degradation_level | per executed query |
search_results_returned_total | region | per query, by result_count (sum) |
clicks_recorded_total | region | per click |
projection_events_total | topic, result | per consumed event |
projection_field_stripped_total | field | per stripped field |
projection_skipped_stale_total | topic, slice | per stale-vector-clock skip |
pubsub_dlq_total | subscription | per DLQ |
idempotency_replay_total | route | per replayed request |
ai_search_intent_calls_total | cache_hit | per orchestrator call |
ai_search_intent_fallbacks_total | reason | per fallback |
cache_hits_total / cache_misses_total | surface | per cache op |
boost_rule_writes_total | result | per boost-rule mutation |
index_build_phase_total | phase, result | per phase transition |
tenant_purge_total | result | per tenant purge |
5.2 Histograms
| Metric | Buckets (ms) | Notes |
|---|---|---|
http_server_duration_ms | 5,10,25,50,100,250,500,1000,2500,5000 | by route, status |
db_query_duration_ms | 1,2,5,10,25,50,100,250,500,1000 | by op, table |
opensearch_query_duration_ms | 5,10,25,50,100,250,500,1000,2500 | by op |
cache_op_duration_ms | 0.5,1,2,5,10,25,50 | by op, surface |
consumer_processing_duration_ms | 5,10,25,50,100,250,500,1000,2500,5000 | by topic |
outbox_publish_lag_ms | 50,100,250,500,1000,2500,5000,10000 | created→published |
5.3 Gauges
| Metric | Labels | Source |
|---|---|---|
projection_freshness_seconds | topic | scraped every 15 s |
index_lag_docs | none | scraped every 60 s |
outbox_pending | none | scraped every 30 s |
inbox_unprocessed | none | scraped every 30 s |
redis_cache_keys | prefix | scraped every 60 s |
opensearch_cluster_status | color | from cluster health |
index_build_active | region | scraped every 60 s |
6. Logs
6.1 Format
{
"ts": "2026-04-23T12:34:56.789Z",
"severity": "INFO",
"service": "search-aggregation-service",
"version": "1.42.0",
"env": "prod",
"region": "europe-west1",
"trace": "00-<trace-id>-<span-id>-01",
"spanId": "<span>",
"traceId": "<trace>",
"tenantId": null, // null for cross-tenant reads
"route": "POST /api/v1/search/queries",
"msg": "search.executed",
"ctx": {
"queryHash": "sha256:…",
"locale": "ps",
"region": "AF",
"resultCount": 42,
"tookMs": 187,
"degradationLevel": "none",
"cacheHit": false
}
}
6.2 Required fields
ts, severity, service, version, env, region, traceId, spanId, route, msg. Optional but standardized: tenantId, propertyId, eventId, topic, result, errorCode, durationMs.
6.3 Log levels
| Level | When |
|---|---|
DEBUG | Local dev / temporarily enabled per-pod via LOG_LEVEL env override |
INFO | Normal events: query executed, projection upsert applied, cache invalidate |
WARN | Recoverable degradation: OpenSearch fallback, AI fallback, cache miss storm |
ERROR | Unrecoverable per-request: validation failure (with code), 5xx |
CRITICAL | Process-wide: allow-list breach, RLS sentinel mismatch, secret refresh failure |
6.4 PII redaction
Pino redact paths: ctx.text, ctx.userBucket, ctx.email, req.headers.authorization, req.headers.cookie. Redaction is enforced at appender time and again at the Cloud Logging sink filter (defense in depth).
7. Standard dashboards
Stored in ops/dashboards/search-aggregation-service.json (Grafana JSON). Tabs:
- SLO overview — burn-rate, error budget remaining, all SLIs.
- Search read path — RPS by route, latency p50/p95/p99, error rate, cache hit ratio.
- Projection write path — events/sec by topic, processing duration, DLQ rate, vector-clock skip rate.
- Freshness & lag — projection freshness gauge by topic, index lag, outbox pending.
- OpenSearch health — cluster color, shard status, took_ms, query rejections.
- Postgres health — connection pool, slow queries, RLS errors (should be zero), partitioned table sizes.
- Cache — Memorystore hit ratio, evictions, memory usage, key count by prefix.
- AI — orchestrator call rate, cache hit, fallback rate, cost micros sum.
- Security & compliance — allow-list strip rate, auth failures, idempotency replays, tenant purges.
- Index builds — active builds, phase durations, swap events.
8. Alerts (Cloud Monitoring)
| Alert | Condition | Severity | Runbook |
|---|---|---|---|
search-latency-fast-burn | 1 h burn ≥ 14× over SLO 99 % p95 250 ms | P1 page | FAILURE_MODES § Latency |
search-availability-fast-burn | 1 h burn ≥ 14× over 99.9 % | P1 page | FAILURE_MODES § 5xx |
freshness-fast-burn | projection_freshness_seconds{topic=*} p95 > 60 for 5 min | P2 page | FAILURE_MODES § Stale projection |
dlq-spike | pubsub_dlq_total rate > 0.1% over 15 min | P2 page | FAILURE_MODES § DLQ |
allow-list-breach | projection_field_stripped_total ≥ 1 in 5 min | P1 security page | SECURITY_MODEL § Allow-list |
opensearch-yellow | cluster status yellow for 10 min | P3 ticket | FAILURE_MODES § OpenSearch |
opensearch-red | cluster status red for 1 min | P1 page | same |
outbox-backlog | outbox_pending > 5 000 for 5 min | P2 page | FAILURE_MODES § Outbox |
index-lag-high | index_lag_docs > 5 000 for 5 min | P2 page | runbook Reconcile index lag |
ai-fallback-elevated | fallback rate > 20 % for 15 min | P3 ticket | AI_INTEGRATION § Failure modes |
cost-budget-soft | monthly Cloud Run CPU * memory > 80 % budget | P3 ticket | platform finance |
cache-hit-low | cache hit ratio for /queries < 50 % for 1 h | P3 ticket | platform owner |
9. Runbooks
Live in ops/runbooks/search-aggregation-service/:
search-latency-degraded.mdfreshness-stale.mddlq-drain.mdopensearch-degraded.mdoutbox-stuck.mdindex-lag-reconcile.mdindex-rebuild.mdtenant-purge.mdallow-list-breach-ir.md(incident response)ai-orchestrator-degraded.md
Each runbook follows the standard sections: Symptom, Detection, Triage, Mitigate, Recover, Postmortem.
10. Synthetic checks
| Check | Frequency | Region | Expectation |
|---|---|---|---|
search-golden-query-kabul-3star | 60 s | EU + ASIA | 200, results > 5, p95 < 400 ms |
hotel-detail-pinned-property | 60 s | EU | 200, includes hero photo |
suggest-prefix-kab | 60 s | EU | 200, suggestions > 0 |
healthz | 30 s | EU + ASIA | 200 |
readyz | 30 s | EU + ASIA | 200 |
Synthetic failures page on three consecutive failures.
11. Audit feed
Per SECURITY_MODEL § Auditability, the topics melmastoon.search.boost_rule.v1 and melmastoon.search.index.v1 are mirrored to audit-service BigQuery dataset audit.search with retention controls. A daily Looker dashboard summarizes admin actions for compliance review.