Skip to main content

Failure Modes

:::info Source Sourced from services/search-service/FAILURE_MODES.md in the documentation repo. :::

Complements OBSERVABILITY.md and DEPLOYMENT_TOPOLOGY.md.

1. Failure Matrix

#ModeDetectionBlast radiusMitigationRunbook
F1OpenSearch cluster redcluster.health=red alertAll searchDegraded path + fallback to Postgres trigram§3.1
F2OpenSearch slow queriesp95 latency alertSubset of queriesCircuit breaker + reject heavy queries§3.2
F3Postgres unavailableconnection errorsAdmin/reindex paths; metadataServe from Redis cache; block reindex§3.3
F4Redis unavailableconnection errorsCache miss → latency ↑Disable cache path; rely on OS§3.4
F5ai-gateway unreachableHTTP 5xx/timeoutSemantic + recsLexical-only fallback; disable recs§3.5
F6Embedding model driftmodel mismatch alertMixed qualityQueued rebuild; read both versions§3.6
F7NATS disconnectedconsumer lagIndexing stallsRetry + queue in-memory; alert at 2m§3.7
F8DLQ non-emptyDLQ depth > 0Lost updates for some docsTriage + replay§3.8
F9Cross-tenant leak suspectedaudit alertComplianceEmergency filter + forensic review§3.9
F10Reindex stuckjob running > 2hOne tenantKill + rollback to previous alias§3.10
F11Embedding budget blownai-gateway quota alertSemantic disabledThrottle to lexical; raise ticket§3.11
F12Document stream poisoningpayload outside schemaProjector crashQuarantine; DLQ; patch projector§3.12
F13Tenant erasure failureaudit mismatchGDPR non-complianceManual purge + DPO notification§3.13
F14Alias swap failurereindex final step failsIntermittent 5xxRollback alias; retry§3.14
F15Over-capacity query (regex)long-running spanPer tenantCancel + rate limit harder§3.15
F16Certificate revocation not indexedstale cert visiblePer user (compliance)Force targeted reindex§3.16

2. Degradation Ladder

L0 Full hybrid + L2R
↓ ai-gateway ranker down
L1 Hybrid + RRF (no L2R)
↓ pgvector down
L2 Lexical + quality + recency
↓ OpenSearch slow/unavailable
L3 Postgres trigram LIKE on cached metadata
↓ everything down
L4 Static cached "trending" + 503 banner

Every response emits meta.degraded and a degradation.level telemetry attribute.

3. Runbooks

3.1 OpenSearch red

Symptoms: cluster red, query errors surge.

Steps:

  1. Check OpenSearch dashboard; identify lost shard(s).
  2. If primary lost and replica promoted → wait for recovery.
  3. If storage full → scale disk OR run force-merge.
  4. Set feature flag search.opensearch.degraded=true → clients fall to L3.
  5. If unrecoverable → reindex affected index from NATS (see §3.14 rollback first).
  6. Post-mortem template filed within 24h.

3.2 OpenSearch slow

Symptoms: p95 > 1s, CPU > 80%.

Steps:

  1. Inspect slow-log for heavy queries.
  2. Identify offending tenant/actor → tighter rate limit.
  3. Reject queries with wildcard-leading or regex via short-term circuit.
  4. Scale data nodes or pre-warm caches.

3.3 Postgres down

Symptoms: 5xx on admin endpoints; projectors failing on inbox write.

Steps:

  1. Stop indexer pods (let NATS buffer).
  2. Ops fails over to replica.
  3. Once Postgres back → re-enable indexer; backfill inbox rows from NATS replay.
  4. If data corruption → restore from PITR.

3.4 Redis down

Symptoms: latency +30%, cache-miss rate 100%.

Steps:

  1. Fall back to no-cache mode (feature flag search.cache.disabled=true).
  2. Scale Redis cluster; monitor OOM.
  3. Re-enable cache; warm hottest keys via synthetic probes.

3.5 ai-gateway down

Symptoms: embedding/ranker calls fail.

Steps:

  1. Circuit breaker opens after 50% error rate.
  2. Semantic search disabled (degraded=true).
  3. Recommendations served from last cached snapshot; new generations blocked.
  4. When gateway back → breaker half-open; gradual recovery.

3.6 Embedding model drift

Symptoms: search.embeddingModel.mismatch > 0.

Steps:

  1. Verify ai-gateway emitted ai.embedding.model.rotated.v1.
  2. Schedule rolling rebuild (14-day budget).
  3. Dual-read both model vectors during cutover.
  4. Post-cutover verification: NDCG@10 unchanged.

3.7 NATS down

Symptoms: consumer connection errors, publisher retries.

Steps:

  1. Indexer buffers up to 10k events in RAM (outbox flush blocked).
  2. API stays up (reads unaffected).
  3. If NATS down > 30m → freeze outbox (no further internal events).
  4. On recovery → drain outbox, resume consumers.

3.8 DLQ non-empty

Symptoms: DLQ alert.

Steps:

  1. Query search.dlq — classify error.
  2. Patch projector or schema validator.
  3. POST /search/dlq/{id}/replay for each recoverable row.
  4. If irrecoverable → document in incident report.

3.9 Cross-tenant leak

Symptoms: audit alert cross_tenant_match_detected.

Steps:

  1. Immediate: enable search.paranoid.filter=true (adds double-check on every result).
  2. Freeze new deploys.
  3. Compliance + security paged.
  4. Forensic: trace replay + tenant filter audit.
  5. Fix + re-deploy.
  6. DPO notification within 72h if confirmed.

3.10 Reindex stuck

Symptoms: job running > 2h, no progress.

Steps:

  1. Inspect job phase via GET /search/reindex/{id}.
  2. If snapshot phase stuck → check source service snapshot endpoint.
  3. Kill job: ghasi-ops search reindex cancel <jobId>.
  4. Alias still points to old index → clients unaffected.
  5. Retry with smaller batch size or during low-traffic window.

3.11 Embedding budget blown

Symptoms: ai-gateway quota alert.

Steps:

  1. Disable semantic for affected tenant.
  2. Investigate cause (massive authoring burst? reindex loop?).
  3. Raise quota or throttle source.
  4. Resume semantic.

3.12 Stream poisoning

Symptoms: projector CPU spike, JSON parse errors.

Steps:

  1. Route to DLQ after 5 retries (automatic).
  2. Filter out poison subject temporarily (via feature flag).
  3. Notify producer team.
  4. Once producer patched → replay DLQ.

3.13 Tenant erasure failure

Symptoms: audit mismatch — user doc still queryable after 30d.

Steps:

  1. Force targeted delete via admin API.
  2. Confirm pgvector row gone.
  3. Confirm Redis cache invalidated.
  4. File DPO report.

3.14 Alias swap failure

Steps:

  1. Rollback alias to previous index.
  2. Investigate swap failure (permission, cluster health).
  3. Retry swap after issue fix.
  4. If rollback also fails → escalate to SRE on-call.

3.15 Heavy query

Steps:

  1. Identify caller → lock account soft-limit.
  2. Cancel long-running query (OpenSearch task API).
  3. Add fingerprint to blocklist; tighten tenant limits.
  4. If abuse → notify tenant admin; escalate.

3.16 Stale revocation

Steps:

  1. Emit targeted re-projection via admin API POST /search/rebuild/{docId}.
  2. Confirm removed from search.
  3. Open bug on certification-service consumer lag.

4. Circuit Breakers

DownstreamThresholdOpen forHalf-open probe
OpenSearch50% errors in 30s60s1 req/5s
pgvector (ai-gateway)50% errors in 30s60s1 req/5s
Redis80% errors in 10s30s1 req/3s
ai-gateway LLM50% errors in 60s120s1 req/10s

5. Postmortem Triggers

ScenarioRequired postmortem
P0/P1 user-facingyes, within 5 days
Cross-tenant leakyes, within 48h
SLO burn > budget/quarteryes
Data loss > 0 documentsyes

6. Game Days

Monthly chaos exercise rotates through F1–F16 scenarios in staging; quarterly includes a DR drill. See TESTING_STRATEGY.md §9.

7. Known Gotchas

  • OpenSearch _reindex vs custom: built-in _reindex ignores our PII sanitizer — always use the projector pipeline.
  • Alias swap race: never run two reindex jobs for the same tenant concurrently (enforced by Redis mutex).
  • Embedding timeouts: must be < 300ms per call; longer timeouts risk indexer backpressure.
  • Locale drift: documents must always include primary locale; missing locale silently breaks ranking.