Observability
:::info Source
Sourced from services/search-service/OBSERVABILITY.md in the documentation repo.
:::
Inherits OpenTelemetry stack, SigNoz dashboards, alerting, and SLO policy from docs/15-observability-telemetry.md.
1. Telemetry Pillars
| Pillar | Tool | Retention |
|---|---|---|
| Traces | OpenTelemetry → SigNoz/Tempo | 14d hot, 90d cold |
| Metrics | OpenTelemetry → Prometheus | 30d hot, 1y cold |
| Logs | OpenTelemetry → Loki | 14d hot, 90d cold |
| Continuous profiling | Pyroscope (pprof) | 7d |
2. Trace Instrumentation
2.1 Auto-instrumented
- HTTP server (Fastify).
- NATS consumer + publisher.
- Postgres (pg).
- Redis.
- HTTP client (axios) → ai-gateway, OpenSearch.
2.2 Custom spans
| Span name | Attributes |
|---|---|
search.query | tenant.id, semantic, types, hybrid.alpha, page.size, degraded, result.count |
search.lexical.opensearch | index.name, shards.hit, took.ms, cache.hit |
search.vector.knn | k, ef, model.id, candidates.returned |
search.rerank.l2r | model.version, candidates.in, duration.ms, degraded |
search.suggest | prefix.len, result.count, cache.hit |
search.recommend.generate | user.id.hash, context, cache.hit, items.returned, model.version |
search.index.upsert | doc.type, doc.id, aggregateVersion, embedding.model.id, embedding.skipped |
search.projector.handle | source.subject, event.id, tenant.id, handled.as |
search.reindex.phase | job.id, phase, docs.processed |
W3C traceparent propagated through NATS headers via OpenTelemetry NATS instrumentation.
3. Metrics
3.1 RED / USE
| Metric | Type | Labels |
|---|---|---|
search_http_requests_total | counter | route, status, tenant_bucket |
search_http_duration_ms | histogram | route, status |
search_query_total | counter | semantic, tenant_bucket, degraded |
search_query_result_count | histogram | semantic |
search_index_upserts_total | counter | doc_type, had_embedding |
search_indexing_lag_ms | histogram | source.subject |
search_embedding_batch_size | histogram | - |
search_embedding_cache_hit_ratio | gauge | - |
search_reindex_running | gauge | scope |
search_reindex_duration_sec | histogram | scope |
search_dlq_depth | gauge | subject |
search_rate_limited_total | counter | actor_class, endpoint |
search_ndcg_at_10 | gauge | tenant_bucket (computed offline, scraped) |
search_recommendation_ctr | gauge | context |
Tenant label cardinality capped — we bucket tenants (bucket = floor(log10(tenantId_hash))) or use an allowlist of top 20 tenants.
3.2 SLI Definitions
| SLO | SLI |
|---|---|
| Query p95 ≤ 250ms | histogram_quantile(0.95, sum(rate(search_http_duration_ms_bucket{route="/search"}[5m])) by (le)) |
| Indexing lag p95 ≤ 2s | histogram_quantile(0.95, rate(search_indexing_lag_ms_bucket[5m])) |
| Availability 99.9% | 1 - sum(rate(search_http_requests_total{status=~"5.."}[30d])) / sum(rate(search_http_requests_total[30d])) |
| NDCG@10 ≥ 0.72 | scraped from offline eval job |
4. Logging
- JSON lines, OpenTelemetry logs shape.
- Always include:
service.name=search-service,tenant.id,trace.id,span.id,route. - Sampling: DEBUG dropped in prod except
tenant_id IN debug_allowlist.
Never log: q verbatim (PII risk), user-supplied free-form values. Log hashes and lengths.
Log example:
{
"ts": "2026-04-15T08:12:00Z",
"level": "INFO",
"msg": "search.query.completed",
"tenant.id": "01HA...",
"route": "/api/v1/search",
"semantic": "hybrid",
"q.length": 17,
"q.hash": "sha256:abc...",
"result.count": 42,
"duration.ms": 174,
"degraded": false,
"trace.id": "00-abc-def-01"
}
5. Dashboards (SigNoz)
| Dashboard | Owner | Purpose |
|---|---|---|
search-overview | search team | RED + traffic + degraded rate |
search-tenant-hotspots | search team | Top 20 tenants by QPS / latency / errors |
search-indexing-pipeline | search team | Projector throughput, inbox depth, DLQ |
search-ranking-quality | data science | NDCG, CTR, L2R feature drift |
search-ai-costs | platform | Tokens/sec, cost per tenant, batch efficiency |
search-slo-burn | SRE | SLO burn-rate windows (1h / 6h / 24h) |
6. Alerting
6.1 Page-worthy
| Alert | Condition | Page |
|---|---|---|
| Service down | up{service="search"} == 0 for 2m | yes |
| Error budget burn 2x | burn_rate{window=1h} > 14.4 | yes |
| Indexing lag p95 > 10s | 10m sustained | yes |
| DLQ depth > 0 | 5m sustained | yes |
| OpenSearch cluster red | via elastic exporter | yes |
| ai-gateway embeddings failing > 50% | 5m | yes |
6.2 Ticket-worthy
| Alert | Condition |
|---|---|
| Hybrid degraded rate > 5% | 30m |
| Rate-limit spike (top tenant) | 10m |
| NDCG@10 < 0.68 | 24h |
| Embedding cache hit < 50% | 1h |
| Reindex running > 2h | - |
7. Tracing Examples
7.1 Search happy path
HTTP GET /search [142ms]
├── auth.verify [3ms]
├── search.query (semantic=hybrid) [138ms]
│ ├── cache.get(redis) [1ms, miss]
│ ├── search.lexical.opensearch [38ms]
│ ├── ai-gateway.embed(q) [42ms]
│ ├── search.vector.knn [35ms]
│ ├── search.rerank.l2r [18ms]
│ ├── hydrate.highlights [4ms]
│ └── cache.set(redis) [1ms]
7.2 Index path
nats.consumer.handle (catalog.course_version.published.v1) [520ms]
├── inbox.check [2ms]
├── projector.map [1ms]
├── search.index.upsert [510ms]
│ ├── ai-gateway.embed(doc) [440ms]
│ ├── opensearch.upsert [32ms]
│ └── pgvector.upsert [22ms]
└── inbox.record [2ms]
8. Health Endpoints
/healthz— always 200 if the process is up./readyz— 200 iff: NATS connected, OpenSearch reachable, Postgres reachable, ai-gateway reachable (tried with 300ms timeout)./startup— used by orchestrator; 200 once initial consumer durables are bound.
9. Continuous Profiling
pprofendpoints gated by mTLS.- Automatic flamegraph on latency p99 regressions > 2σ.
- Heap profiles collected hourly.
10. Synthetic Monitoring
| Probe | Frequency | Region | Assertion |
|---|---|---|---|
| Public search health | 1 min | US, EU, ME, AP | 200 within 1s |
| Authenticated search sample query | 5 min | 4 regions | 200, 5+ results |
| Autocomplete sample | 5 min | 4 regions | 200 |
| Reindex dry-run | daily | US | 202 |
11. Runbook Index
See FAILURE_MODES.md for the full runbook tree. Each alert links to its runbook.
12. Change Correlation
- Deployment events emitted as
deployment.search.performed.v1. - Grafana annotations tied to deploy timestamps.
- Canary rollouts tag 10% traffic with
x-rollout: canary→ filterable in all dashboards.