Skip to main content

Observability

:::info Source Sourced from services/search-service/OBSERVABILITY.md in the documentation repo. :::

Inherits OpenTelemetry stack, SigNoz dashboards, alerting, and SLO policy from docs/15-observability-telemetry.md.

1. Telemetry Pillars

PillarToolRetention
TracesOpenTelemetry → SigNoz/Tempo14d hot, 90d cold
MetricsOpenTelemetry → Prometheus30d hot, 1y cold
LogsOpenTelemetry → Loki14d hot, 90d cold
Continuous profilingPyroscope (pprof)7d

2. Trace Instrumentation

2.1 Auto-instrumented

  • HTTP server (Fastify).
  • NATS consumer + publisher.
  • Postgres (pg).
  • Redis.
  • HTTP client (axios) → ai-gateway, OpenSearch.

2.2 Custom spans

Span nameAttributes
search.querytenant.id, semantic, types, hybrid.alpha, page.size, degraded, result.count
search.lexical.opensearchindex.name, shards.hit, took.ms, cache.hit
search.vector.knnk, ef, model.id, candidates.returned
search.rerank.l2rmodel.version, candidates.in, duration.ms, degraded
search.suggestprefix.len, result.count, cache.hit
search.recommend.generateuser.id.hash, context, cache.hit, items.returned, model.version
search.index.upsertdoc.type, doc.id, aggregateVersion, embedding.model.id, embedding.skipped
search.projector.handlesource.subject, event.id, tenant.id, handled.as
search.reindex.phasejob.id, phase, docs.processed

W3C traceparent propagated through NATS headers via OpenTelemetry NATS instrumentation.

3. Metrics

3.1 RED / USE

MetricTypeLabels
search_http_requests_totalcounterroute, status, tenant_bucket
search_http_duration_mshistogramroute, status
search_query_totalcountersemantic, tenant_bucket, degraded
search_query_result_counthistogramsemantic
search_index_upserts_totalcounterdoc_type, had_embedding
search_indexing_lag_mshistogramsource.subject
search_embedding_batch_sizehistogram-
search_embedding_cache_hit_ratiogauge-
search_reindex_runninggaugescope
search_reindex_duration_sechistogramscope
search_dlq_depthgaugesubject
search_rate_limited_totalcounteractor_class, endpoint
search_ndcg_at_10gaugetenant_bucket (computed offline, scraped)
search_recommendation_ctrgaugecontext

Tenant label cardinality capped — we bucket tenants (bucket = floor(log10(tenantId_hash))) or use an allowlist of top 20 tenants.

3.2 SLI Definitions

SLOSLI
Query p95 ≤ 250mshistogram_quantile(0.95, sum(rate(search_http_duration_ms_bucket{route="/search"}[5m])) by (le))
Indexing lag p95 ≤ 2shistogram_quantile(0.95, rate(search_indexing_lag_ms_bucket[5m]))
Availability 99.9%1 - sum(rate(search_http_requests_total{status=~"5.."}[30d])) / sum(rate(search_http_requests_total[30d]))
NDCG@10 ≥ 0.72scraped from offline eval job

4. Logging

  • JSON lines, OpenTelemetry logs shape.
  • Always include: service.name=search-service, tenant.id, trace.id, span.id, route.
  • Sampling: DEBUG dropped in prod except tenant_id IN debug_allowlist.

Never log: q verbatim (PII risk), user-supplied free-form values. Log hashes and lengths.

Log example:

{
"ts": "2026-04-15T08:12:00Z",
"level": "INFO",
"msg": "search.query.completed",
"tenant.id": "01HA...",
"route": "/api/v1/search",
"semantic": "hybrid",
"q.length": 17,
"q.hash": "sha256:abc...",
"result.count": 42,
"duration.ms": 174,
"degraded": false,
"trace.id": "00-abc-def-01"
}

5. Dashboards (SigNoz)

DashboardOwnerPurpose
search-overviewsearch teamRED + traffic + degraded rate
search-tenant-hotspotssearch teamTop 20 tenants by QPS / latency / errors
search-indexing-pipelinesearch teamProjector throughput, inbox depth, DLQ
search-ranking-qualitydata scienceNDCG, CTR, L2R feature drift
search-ai-costsplatformTokens/sec, cost per tenant, batch efficiency
search-slo-burnSRESLO burn-rate windows (1h / 6h / 24h)

6. Alerting

6.1 Page-worthy

AlertConditionPage
Service downup{service="search"} == 0 for 2myes
Error budget burn 2xburn_rate{window=1h} > 14.4yes
Indexing lag p95 > 10s10m sustainedyes
DLQ depth > 05m sustainedyes
OpenSearch cluster redvia elastic exporteryes
ai-gateway embeddings failing > 50%5myes

6.2 Ticket-worthy

AlertCondition
Hybrid degraded rate > 5%30m
Rate-limit spike (top tenant)10m
NDCG@10 < 0.6824h
Embedding cache hit < 50%1h
Reindex running > 2h-

7. Tracing Examples

7.1 Search happy path

HTTP GET /search [142ms]
├── auth.verify [3ms]
├── search.query (semantic=hybrid) [138ms]
│ ├── cache.get(redis) [1ms, miss]
│ ├── search.lexical.opensearch [38ms]
│ ├── ai-gateway.embed(q) [42ms]
│ ├── search.vector.knn [35ms]
│ ├── search.rerank.l2r [18ms]
│ ├── hydrate.highlights [4ms]
│ └── cache.set(redis) [1ms]

7.2 Index path

nats.consumer.handle (catalog.course_version.published.v1) [520ms]
├── inbox.check [2ms]
├── projector.map [1ms]
├── search.index.upsert [510ms]
│ ├── ai-gateway.embed(doc) [440ms]
│ ├── opensearch.upsert [32ms]
│ └── pgvector.upsert [22ms]
└── inbox.record [2ms]

8. Health Endpoints

  • /healthz — always 200 if the process is up.
  • /readyz — 200 iff: NATS connected, OpenSearch reachable, Postgres reachable, ai-gateway reachable (tried with 300ms timeout).
  • /startup — used by orchestrator; 200 once initial consumer durables are bound.

9. Continuous Profiling

  • pprof endpoints gated by mTLS.
  • Automatic flamegraph on latency p99 regressions > 2σ.
  • Heap profiles collected hourly.

10. Synthetic Monitoring

ProbeFrequencyRegionAssertion
Public search health1 minUS, EU, ME, AP200 within 1s
Authenticated search sample query5 min4 regions200, 5+ results
Autocomplete sample5 min4 regions200
Reindex dry-rundailyUS202

11. Runbook Index

See FAILURE_MODES.md for the full runbook tree. Each alert links to its runbook.

12. Change Correlation

  • Deployment events emitted as deployment.search.performed.v1.
  • Grafana annotations tied to deploy timestamps.
  • Canary rollouts tag 10% traffic with x-rollout: canary → filterable in all dashboards.