Failure Modes
:::info Source
Sourced from services/search-service/FAILURE_MODES.md in the documentation repo.
:::
Complements OBSERVABILITY.md and DEPLOYMENT_TOPOLOGY.md.
1. Failure Matrix
| # | Mode | Detection | Blast radius | Mitigation | Runbook |
|---|---|---|---|---|---|
| F1 | OpenSearch cluster red | cluster.health=red alert | All search | Degraded path + fallback to Postgres trigram | §3.1 |
| F2 | OpenSearch slow queries | p95 latency alert | Subset of queries | Circuit breaker + reject heavy queries | §3.2 |
| F3 | Postgres unavailable | connection errors | Admin/reindex paths; metadata | Serve from Redis cache; block reindex | §3.3 |
| F4 | Redis unavailable | connection errors | Cache miss → latency ↑ | Disable cache path; rely on OS | §3.4 |
| F5 | ai-gateway unreachable | HTTP 5xx/timeout | Semantic + recs | Lexical-only fallback; disable recs | §3.5 |
| F6 | Embedding model drift | model mismatch alert | Mixed quality | Queued rebuild; read both versions | §3.6 |
| F7 | NATS disconnected | consumer lag | Indexing stalls | Retry + queue in-memory; alert at 2m | §3.7 |
| F8 | DLQ non-empty | DLQ depth > 0 | Lost updates for some docs | Triage + replay | §3.8 |
| F9 | Cross-tenant leak suspected | audit alert | Compliance | Emergency filter + forensic review | §3.9 |
| F10 | Reindex stuck | job running > 2h | One tenant | Kill + rollback to previous alias | §3.10 |
| F11 | Embedding budget blown | ai-gateway quota alert | Semantic disabled | Throttle to lexical; raise ticket | §3.11 |
| F12 | Document stream poisoning | payload outside schema | Projector crash | Quarantine; DLQ; patch projector | §3.12 |
| F13 | Tenant erasure failure | audit mismatch | GDPR non-compliance | Manual purge + DPO notification | §3.13 |
| F14 | Alias swap failure | reindex final step fails | Intermittent 5xx | Rollback alias; retry | §3.14 |
| F15 | Over-capacity query (regex) | long-running span | Per tenant | Cancel + rate limit harder | §3.15 |
| F16 | Certificate revocation not indexed | stale cert visible | Per user (compliance) | Force targeted reindex | §3.16 |
2. Degradation Ladder
L0 Full hybrid + L2R
↓ ai-gateway ranker down
L1 Hybrid + RRF (no L2R)
↓ pgvector down
L2 Lexical + quality + recency
↓ OpenSearch slow/unavailable
L3 Postgres trigram LIKE on cached metadata
↓ everything down
L4 Static cached "trending" + 503 banner
Every response emits meta.degraded and a degradation.level telemetry attribute.
3. Runbooks
3.1 OpenSearch red
Symptoms: cluster red, query errors surge.
Steps:
- Check OpenSearch dashboard; identify lost shard(s).
- If primary lost and replica promoted → wait for recovery.
- If storage full → scale disk OR run force-merge.
- Set feature flag
search.opensearch.degraded=true→ clients fall to L3. - If unrecoverable → reindex affected index from NATS (see §3.14 rollback first).
- Post-mortem template filed within 24h.
3.2 OpenSearch slow
Symptoms: p95 > 1s, CPU > 80%.
Steps:
- Inspect slow-log for heavy queries.
- Identify offending tenant/actor → tighter rate limit.
- Reject queries with wildcard-leading or regex via short-term circuit.
- Scale data nodes or pre-warm caches.
3.3 Postgres down
Symptoms: 5xx on admin endpoints; projectors failing on inbox write.
Steps:
- Stop indexer pods (let NATS buffer).
- Ops fails over to replica.
- Once Postgres back → re-enable indexer; backfill inbox rows from NATS replay.
- If data corruption → restore from PITR.
3.4 Redis down
Symptoms: latency +30%, cache-miss rate 100%.
Steps:
- Fall back to no-cache mode (feature flag
search.cache.disabled=true). - Scale Redis cluster; monitor OOM.
- Re-enable cache; warm hottest keys via synthetic probes.
3.5 ai-gateway down
Symptoms: embedding/ranker calls fail.
Steps:
- Circuit breaker opens after 50% error rate.
- Semantic search disabled (
degraded=true). - Recommendations served from last cached snapshot; new generations blocked.
- When gateway back → breaker half-open; gradual recovery.
3.6 Embedding model drift
Symptoms: search.embeddingModel.mismatch > 0.
Steps:
- Verify ai-gateway emitted
ai.embedding.model.rotated.v1. - Schedule rolling rebuild (14-day budget).
- Dual-read both model vectors during cutover.
- Post-cutover verification: NDCG@10 unchanged.
3.7 NATS down
Symptoms: consumer connection errors, publisher retries.
Steps:
- Indexer buffers up to 10k events in RAM (outbox flush blocked).
- API stays up (reads unaffected).
- If NATS down > 30m → freeze outbox (no further internal events).
- On recovery → drain outbox, resume consumers.
3.8 DLQ non-empty
Symptoms: DLQ alert.
Steps:
- Query
search.dlq— classify error. - Patch projector or schema validator.
POST /search/dlq/{id}/replayfor each recoverable row.- If irrecoverable → document in incident report.
3.9 Cross-tenant leak
Symptoms: audit alert cross_tenant_match_detected.
Steps:
- Immediate: enable
search.paranoid.filter=true(adds double-check on every result). - Freeze new deploys.
- Compliance + security paged.
- Forensic: trace replay + tenant filter audit.
- Fix + re-deploy.
- DPO notification within 72h if confirmed.
3.10 Reindex stuck
Symptoms: job running > 2h, no progress.
Steps:
- Inspect job phase via
GET /search/reindex/{id}. - If
snapshotphase stuck → check source service snapshot endpoint. - Kill job:
ghasi-ops search reindex cancel <jobId>. - Alias still points to old index → clients unaffected.
- Retry with smaller batch size or during low-traffic window.
3.11 Embedding budget blown
Symptoms: ai-gateway quota alert.
Steps:
- Disable semantic for affected tenant.
- Investigate cause (massive authoring burst? reindex loop?).
- Raise quota or throttle source.
- Resume semantic.
3.12 Stream poisoning
Symptoms: projector CPU spike, JSON parse errors.
Steps:
- Route to DLQ after 5 retries (automatic).
- Filter out poison subject temporarily (via feature flag).
- Notify producer team.
- Once producer patched → replay DLQ.
3.13 Tenant erasure failure
Symptoms: audit mismatch — user doc still queryable after 30d.
Steps:
- Force targeted delete via admin API.
- Confirm pgvector row gone.
- Confirm Redis cache invalidated.
- File DPO report.
3.14 Alias swap failure
Steps:
- Rollback alias to previous index.
- Investigate swap failure (permission, cluster health).
- Retry swap after issue fix.
- If rollback also fails → escalate to SRE on-call.
3.15 Heavy query
Steps:
- Identify caller → lock account soft-limit.
- Cancel long-running query (OpenSearch task API).
- Add fingerprint to blocklist; tighten tenant limits.
- If abuse → notify tenant admin; escalate.
3.16 Stale revocation
Steps:
- Emit targeted re-projection via admin API
POST /search/rebuild/{docId}. - Confirm removed from search.
- Open bug on certification-service consumer lag.
4. Circuit Breakers
| Downstream | Threshold | Open for | Half-open probe |
|---|---|---|---|
| OpenSearch | 50% errors in 30s | 60s | 1 req/5s |
| pgvector (ai-gateway) | 50% errors in 30s | 60s | 1 req/5s |
| Redis | 80% errors in 10s | 30s | 1 req/3s |
| ai-gateway LLM | 50% errors in 60s | 120s | 1 req/10s |
5. Postmortem Triggers
| Scenario | Required postmortem |
|---|---|
| P0/P1 user-facing | yes, within 5 days |
| Cross-tenant leak | yes, within 48h |
| SLO burn > budget/quarter | yes |
| Data loss > 0 documents | yes |
6. Game Days
Monthly chaos exercise rotates through F1–F16 scenarios in staging; quarterly includes a DR drill. See TESTING_STRATEGY.md §9.
7. Known Gotchas
- OpenSearch
_reindexvs custom: built-in_reindexignores our PII sanitizer — always use the projector pipeline. - Alias swap race: never run two reindex jobs for the same tenant concurrently (enforced by Redis mutex).
- Embedding timeouts: must be < 300ms per call; longer timeouts risk indexer backpressure.
- Locale drift: documents must always include primary locale; missing locale silently breaks ranking.