Observability

:::info Source Sourced from services/search-service/OBSERVABILITY.md in the documentation repo. :::

Inherits OpenTelemetry stack, SigNoz dashboards, alerting, and SLO policy from docs/15-observability-telemetry.md.

1. Telemetry Pillars

Pillar	Tool	Retention
Traces	OpenTelemetry → SigNoz/Tempo	14d hot, 90d cold
Metrics	OpenTelemetry → Prometheus	30d hot, 1y cold
Logs	OpenTelemetry → Loki	14d hot, 90d cold
Continuous profiling	Pyroscope (pprof)	7d

2. Trace Instrumentation

2.1 Auto-instrumented

HTTP server (Fastify).
NATS consumer + publisher.
Postgres (pg).
Redis.
HTTP client (axios) → ai-gateway, OpenSearch.

2.2 Custom spans

Span name	Attributes
`search.query`	`tenant.id`, `semantic`, `types`, `hybrid.alpha`, `page.size`, `degraded`, `result.count`
`search.lexical.opensearch`	`index.name`, `shards.hit`, `took.ms`, `cache.hit`
`search.vector.knn`	`k`, `ef`, `model.id`, `candidates.returned`
`search.rerank.l2r`	`model.version`, `candidates.in`, `duration.ms`, `degraded`
`search.suggest`	`prefix.len`, `result.count`, `cache.hit`
`search.recommend.generate`	`user.id.hash`, `context`, `cache.hit`, `items.returned`, `model.version`
`search.index.upsert`	`doc.type`, `doc.id`, `aggregateVersion`, `embedding.model.id`, `embedding.skipped`
`search.projector.handle`	`source.subject`, `event.id`, `tenant.id`, `handled.as`
`search.reindex.phase`	`job.id`, `phase`, `docs.processed`

W3C traceparent propagated through NATS headers via OpenTelemetry NATS instrumentation.

3. Metrics

3.1 RED / USE

Metric	Type	Labels
`search_http_requests_total`	counter	`route`, `status`, `tenant_bucket`
`search_http_duration_ms`	histogram	`route`, `status`
`search_query_total`	counter	`semantic`, `tenant_bucket`, `degraded`
`search_query_result_count`	histogram	`semantic`
`search_index_upserts_total`	counter	`doc_type`, `had_embedding`
`search_indexing_lag_ms`	histogram	`source.subject`
`search_embedding_batch_size`	histogram	-
`search_embedding_cache_hit_ratio`	gauge	-
`search_reindex_running`	gauge	`scope`
`search_reindex_duration_sec`	histogram	`scope`
`search_dlq_depth`	gauge	`subject`
`search_rate_limited_total`	counter	`actor_class`, `endpoint`
`search_ndcg_at_10`	gauge	`tenant_bucket` (computed offline, scraped)
`search_recommendation_ctr`	gauge	`context`

Tenant label cardinality capped — we bucket tenants (bucket = floor(log10(tenantId_hash))) or use an allowlist of top 20 tenants.

3.2 SLI Definitions

SLO	SLI
Query p95 ≤ 250ms	`histogram_quantile(0.95, sum(rate(search_http_duration_ms_bucket{route="/search"}[5m])) by (le))`
Indexing lag p95 ≤ 2s	`histogram_quantile(0.95, rate(search_indexing_lag_ms_bucket[5m]))`
Availability 99.9%	`1 - sum(rate(search_http_requests_total{status=~"5.."}[30d])) / sum(rate(search_http_requests_total[30d]))`
NDCG@10 ≥ 0.72	scraped from offline eval job

4. Logging

JSON lines, OpenTelemetry logs shape.
Always include: service.name=search-service, tenant.id, trace.id, span.id, route.
Sampling: DEBUG dropped in prod except tenant_id IN debug_allowlist.

Never log: q verbatim (PII risk), user-supplied free-form values. Log hashes and lengths.

Log example:

{
  "ts": "2026-04-15T08:12:00Z",
  "level": "INFO",
  "msg": "search.query.completed",
  "tenant.id": "01HA...",
  "route": "/api/v1/search",
  "semantic": "hybrid",
  "q.length": 17,
  "q.hash": "sha256:abc...",
  "result.count": 42,
  "duration.ms": 174,
  "degraded": false,
  "trace.id": "00-abc-def-01"
}

5. Dashboards (SigNoz)

Dashboard	Owner	Purpose
`search-overview`	search team	RED + traffic + degraded rate
`search-tenant-hotspots`	search team	Top 20 tenants by QPS / latency / errors
`search-indexing-pipeline`	search team	Projector throughput, inbox depth, DLQ
`search-ranking-quality`	data science	NDCG, CTR, L2R feature drift
`search-ai-costs`	platform	Tokens/sec, cost per tenant, batch efficiency
`search-slo-burn`	SRE	SLO burn-rate windows (1h / 6h / 24h)

6. Alerting

6.1 Page-worthy

Alert	Condition	Page
Service down	`up{service="search"} == 0` for 2m	yes
Error budget burn 2x	burn_rate{window=1h} > 14.4	yes
Indexing lag p95 > 10s	10m sustained	yes
DLQ depth > 0	5m sustained	yes
OpenSearch cluster red	via elastic exporter	yes
ai-gateway embeddings failing > 50%	5m	yes

6.2 Ticket-worthy

Alert	Condition
Hybrid degraded rate > 5%	30m
Rate-limit spike (top tenant)	10m
NDCG@10 < 0.68	24h
Embedding cache hit < 50%	1h
Reindex running > 2h	-

7. Tracing Examples

7.1 Search happy path

HTTP GET /search                                [142ms]
├── auth.verify                                 [3ms]
├── search.query (semantic=hybrid)              [138ms]
│   ├── cache.get(redis)                        [1ms, miss]
│   ├── search.lexical.opensearch               [38ms]
│   ├── ai-gateway.embed(q)                     [42ms]
│   ├── search.vector.knn                       [35ms]
│   ├── search.rerank.l2r                       [18ms]
│   ├── hydrate.highlights                      [4ms]
│   └── cache.set(redis)                        [1ms]

7.2 Index path

nats.consumer.handle (catalog.course_version.published.v1)   [520ms]
├── inbox.check                                               [2ms]
├── projector.map                                             [1ms]
├── search.index.upsert                                       [510ms]
│   ├── ai-gateway.embed(doc)                                 [440ms]
│   ├── opensearch.upsert                                     [32ms]
│   └── pgvector.upsert                                       [22ms]
└── inbox.record                                              [2ms]

8. Health Endpoints

/healthz — always 200 if the process is up.
/readyz — 200 iff: NATS connected, OpenSearch reachable, Postgres reachable, ai-gateway reachable (tried with 300ms timeout).
/startup — used by orchestrator; 200 once initial consumer durables are bound.

9. Continuous Profiling

pprof endpoints gated by mTLS.
Automatic flamegraph on latency p99 regressions > 2σ.
Heap profiles collected hourly.

10. Synthetic Monitoring

Probe	Frequency	Region	Assertion
Public search health	1 min	US, EU, ME, AP	200 within 1s
Authenticated search sample query	5 min	4 regions	200, 5+ results
Autocomplete sample	5 min	4 regions	200
Reindex dry-run	daily	US	202

11. Runbook Index

See FAILURE_MODES.md for the full runbook tree. Each alert links to its runbook.

12. Change Correlation

Deployment events emitted as deployment.search.performed.v1.
Grafana annotations tied to deploy timestamps.
Canary rollouts tag 10% traffic with x-rollout: canary → filterable in all dashboards.

1. Telemetry Pillars​

2. Trace Instrumentation​

2.1 Auto-instrumented​

2.2 Custom spans​

3. Metrics​

3.1 RED / USE​

3.2 SLI Definitions​

4. Logging​

5. Dashboards (SigNoz)​

6. Alerting​

6.1 Page-worthy​

6.2 Ticket-worthy​

7. Tracing Examples​

7.1 Search happy path​

7.2 Index path​

8. Health Endpoints​

9. Continuous Profiling​

10. Synthetic Monitoring​

11. Runbook Index​

12. Change Correlation​