Failure Modes

:::info Source Sourced from services/search-service/FAILURE_MODES.md in the documentation repo. :::

Complements OBSERVABILITY.md and DEPLOYMENT_TOPOLOGY.md.

1. Failure Matrix

#	Mode	Detection	Blast radius	Mitigation	Runbook
F1	OpenSearch cluster red	`cluster.health=red` alert	All search	Degraded path + fallback to Postgres trigram	§3.1
F2	OpenSearch slow queries	p95 latency alert	Subset of queries	Circuit breaker + reject heavy queries	§3.2
F3	Postgres unavailable	connection errors	Admin/reindex paths; metadata	Serve from Redis cache; block reindex	§3.3
F4	Redis unavailable	connection errors	Cache miss → latency ↑	Disable cache path; rely on OS	§3.4
F5	ai-gateway unreachable	HTTP 5xx/timeout	Semantic + recs	Lexical-only fallback; disable recs	§3.5
F6	Embedding model drift	model mismatch alert	Mixed quality	Queued rebuild; read both versions	§3.6
F7	NATS disconnected	consumer lag	Indexing stalls	Retry + queue in-memory; alert at 2m	§3.7
F8	DLQ non-empty	DLQ depth > 0	Lost updates for some docs	Triage + replay	§3.8
F9	Cross-tenant leak suspected	audit alert	Compliance	Emergency filter + forensic review	§3.9
F10	Reindex stuck	job running > 2h	One tenant	Kill + rollback to previous alias	§3.10
F11	Embedding budget blown	ai-gateway quota alert	Semantic disabled	Throttle to lexical; raise ticket	§3.11
F12	Document stream poisoning	payload outside schema	Projector crash	Quarantine; DLQ; patch projector	§3.12
F13	Tenant erasure failure	audit mismatch	GDPR non-compliance	Manual purge + DPO notification	§3.13
F14	Alias swap failure	reindex final step fails	Intermittent 5xx	Rollback alias; retry	§3.14
F15	Over-capacity query (regex)	long-running span	Per tenant	Cancel + rate limit harder	§3.15
F16	Certificate revocation not indexed	stale cert visible	Per user (compliance)	Force targeted reindex	§3.16

2. Degradation Ladder

L0 Full hybrid + L2R
  ↓ ai-gateway ranker down
L1 Hybrid + RRF (no L2R)
  ↓ pgvector down
L2 Lexical + quality + recency
  ↓ OpenSearch slow/unavailable
L3 Postgres trigram LIKE on cached metadata
  ↓ everything down
L4 Static cached "trending" + 503 banner

Every response emits meta.degraded and a degradation.level telemetry attribute.

3. Runbooks

3.1 OpenSearch red

Symptoms: cluster red, query errors surge.

Steps:

Check OpenSearch dashboard; identify lost shard(s).
If primary lost and replica promoted → wait for recovery.
If storage full → scale disk OR run force-merge.
Set feature flag search.opensearch.degraded=true → clients fall to L3.
If unrecoverable → reindex affected index from NATS (see §3.14 rollback first).
Post-mortem template filed within 24h.

3.2 OpenSearch slow

Symptoms: p95 > 1s, CPU > 80%.

Steps:

Inspect slow-log for heavy queries.
Identify offending tenant/actor → tighter rate limit.
Reject queries with wildcard-leading or regex via short-term circuit.
Scale data nodes or pre-warm caches.

3.3 Postgres down

Symptoms: 5xx on admin endpoints; projectors failing on inbox write.

Steps:

Stop indexer pods (let NATS buffer).
Ops fails over to replica.
Once Postgres back → re-enable indexer; backfill inbox rows from NATS replay.
If data corruption → restore from PITR.

3.4 Redis down

Symptoms: latency +30%, cache-miss rate 100%.

Steps:

Fall back to no-cache mode (feature flag search.cache.disabled=true).
Scale Redis cluster; monitor OOM.
Re-enable cache; warm hottest keys via synthetic probes.

3.5 ai-gateway down

Symptoms: embedding/ranker calls fail.

Steps:

Circuit breaker opens after 50% error rate.
Semantic search disabled (degraded=true).
Recommendations served from last cached snapshot; new generations blocked.
When gateway back → breaker half-open; gradual recovery.

3.6 Embedding model drift

Symptoms: search.embeddingModel.mismatch > 0.

Steps:

Verify ai-gateway emitted ai.embedding.model.rotated.v1.
Schedule rolling rebuild (14-day budget).
Dual-read both model vectors during cutover.
Post-cutover verification: NDCG@10 unchanged.

3.7 NATS down

Symptoms: consumer connection errors, publisher retries.

Steps:

Indexer buffers up to 10k events in RAM (outbox flush blocked).
API stays up (reads unaffected).
If NATS down > 30m → freeze outbox (no further internal events).
On recovery → drain outbox, resume consumers.

3.8 DLQ non-empty

Symptoms: DLQ alert.

Steps:

Query search.dlq — classify error.
Patch projector or schema validator.
POST /search/dlq/{id}/replay for each recoverable row.
If irrecoverable → document in incident report.

3.9 Cross-tenant leak

Symptoms: audit alert cross_tenant_match_detected.

Steps:

Immediate: enable search.paranoid.filter=true (adds double-check on every result).
Freeze new deploys.
Compliance + security paged.
Forensic: trace replay + tenant filter audit.
Fix + re-deploy.
DPO notification within 72h if confirmed.

3.10 Reindex stuck

Symptoms: job running > 2h, no progress.

Steps:

Inspect job phase via GET /search/reindex/{id}.
If snapshot phase stuck → check source service snapshot endpoint.
Kill job: ghasi-ops search reindex cancel <jobId>.
Alias still points to old index → clients unaffected.
Retry with smaller batch size or during low-traffic window.

3.11 Embedding budget blown

Symptoms: ai-gateway quota alert.

Steps:

Disable semantic for affected tenant.
Investigate cause (massive authoring burst? reindex loop?).
Raise quota or throttle source.
Resume semantic.

3.12 Stream poisoning

Symptoms: projector CPU spike, JSON parse errors.

Steps:

Route to DLQ after 5 retries (automatic).
Filter out poison subject temporarily (via feature flag).
Notify producer team.
Once producer patched → replay DLQ.

3.13 Tenant erasure failure

Symptoms: audit mismatch — user doc still queryable after 30d.

Steps:

Force targeted delete via admin API.
Confirm pgvector row gone.
Confirm Redis cache invalidated.
File DPO report.

3.14 Alias swap failure

Steps:

Rollback alias to previous index.
Investigate swap failure (permission, cluster health).
Retry swap after issue fix.
If rollback also fails → escalate to SRE on-call.

3.15 Heavy query

Steps:

Identify caller → lock account soft-limit.
Cancel long-running query (OpenSearch task API).
Add fingerprint to blocklist; tighten tenant limits.
If abuse → notify tenant admin; escalate.

3.16 Stale revocation

Steps:

Emit targeted re-projection via admin API POST /search/rebuild/{docId}.
Confirm removed from search.
Open bug on certification-service consumer lag.

4. Circuit Breakers

Downstream	Threshold	Open for	Half-open probe
OpenSearch	50% errors in 30s	60s	1 req/5s
pgvector (ai-gateway)	50% errors in 30s	60s	1 req/5s
Redis	80% errors in 10s	30s	1 req/3s
ai-gateway LLM	50% errors in 60s	120s	1 req/10s

5. Postmortem Triggers

Scenario	Required postmortem
P0/P1 user-facing	yes, within 5 days
Cross-tenant leak	yes, within 48h
SLO burn > budget/quarter	yes
Data loss > 0 documents	yes

6. Game Days

Monthly chaos exercise rotates through F1–F16 scenarios in staging; quarterly includes a DR drill. See TESTING_STRATEGY.md §9.

7. Known Gotchas

OpenSearch _reindex vs custom: built-in _reindex ignores our PII sanitizer — always use the projector pipeline.
Alias swap race: never run two reindex jobs for the same tenant concurrently (enforced by Redis mutex).
Embedding timeouts: must be < 300ms per call; longer timeouts risk indexer backpressure.
Locale drift: documents must always include primary locale; missing locale silently breaks ranking.

1. Failure Matrix​

2. Degradation Ladder​

3. Runbooks​

3.1 OpenSearch red​

3.2 OpenSearch slow​

3.3 Postgres down​

3.4 Redis down​

3.5 ai-gateway down​

3.6 Embedding model drift​

3.7 NATS down​

3.8 DLQ non-empty​

3.9 Cross-tenant leak​

3.10 Reindex stuck​

3.11 Embedding budget blown​

3.12 Stream poisoning​

3.13 Tenant erasure failure​

3.14 Alias swap failure​

3.15 Heavy query​

3.16 Stale revocation​

4. Circuit Breakers​

5. Postmortem Triggers​

6. Game Days​

7. Known Gotchas​

1. Failure Matrix

2. Degradation Ladder

3. Runbooks

3.1 OpenSearch red

3.2 OpenSearch slow

3.3 Postgres down

3.4 Redis down

3.5 ai-gateway down

3.6 Embedding model drift

3.7 NATS down

3.8 DLQ non-empty

3.9 Cross-tenant leak

3.10 Reindex stuck

3.11 Embedding budget blown

3.12 Stream poisoning

3.13 Tenant erasure failure

3.14 Alias swap failure

3.15 Heavy query

3.16 Stale revocation

4. Circuit Breakers

5. Postmortem Triggers

6. Game Days

7. Known Gotchas