Skip to main content

AI Integration

:::info Source Sourced from services/search-service/AI_INTEGRATION.md in the documentation repo. :::

All model calls flow through ai-gateway-service. Direct vendor calls from search-service are forbidden.

Reference implementation note (EP-11, Ghasi-EdTech): Hybrid search and recommendations use lexical match + catalog tag overlap as a semantic proxy and attach structured aiProvenance (e.g. embedding: "tag_proxy") until corpus/query embeddings are wired to ai-gateway on the hot path. Explanations for “why” strings are template- and heuristic-based in that slice; LLM-generated explanations remain the long-term target (see §4).

Search uses AI in four places:

  1. Embeddings — semantic search corpus + query embedding.
  2. Query expansion — short or noisy queries augmented with rephrased variants.
  3. Learning-to-rank — re-ranking hybrid candidates with a LightGBM model.
  4. Recommendation explanation — short natural-language "why you see this" strings.

1. Model Inventory

PurposeModel familyHosted viaToken/cost classGovernance
Corpus embeddingtext-embed-3-small (or local bge-m3 in EU/ME)ai-gatewaylow (batched)embedding model rotation event
Query embeddingsame as corpusai-gatewaylow (hot path)must match corpus model id
Query expansionsmall LLM (claude-haiku or local qwen2.5-7b)ai-gatewaymediumopt-in flag per tenant
L2R rankerLightGBM LambdaMART (not an LLM)ai-gateway inferencelowoffline-trained, shipped artifact
Rec explanationsmall LLMai-gatewaymediumopt-in per tenant; templated prompt
On-device semantic (M5)distilled MiniLM (int8)on-devicen/aoptional

2. ai-gateway Client Surface

interface AiGatewayClient {
embeddings: {
embed(input: { text: string; locale?: BCP47; tenantId: TenantId; purpose: 'corpus' | 'query' | 'user-profile' }):
Promise<{ vector: number[]; modelId: string; embeddingHash: string }>;
embedBatch(inputs: EmbedInput[]): Promise<EmbedResult[]>;
};
ranker: {
rerank(req: { tenantId: TenantId; candidates: Candidate[]; features: Record<string, number>[] }):
Promise<{ scores: number[]; modelVersion: string; explanationTopK?: Feature[][] }>;
};
completions: {
expandQuery(req: { tenantId: TenantId; q: string; locale: BCP47 }):
Promise<{ expansions: string[]; modelId: string; tokensIn: number; tokensOut: number }>;
explainRec(req: { tenantId: TenantId; userId: UserId; itemId: string; reasonCode: string }):
Promise<{ text: string; modelId: string }>;
};
}

3. Corpus Embedding Pipeline

Batching: up to 100 inputs per call; flush on size or timeout. Cost tracked by ai-gateway and attributed to search-service's tenant-bucket.

3.1 Content-to-Embed Template

[TYPE=$type] [LOCALE=$locale]
TITLE: $title
SUMMARY: $summary
TAGS: $tags
BODY: $body_truncated_4k
TAXONOMY: $taxonomy

PII scrubber runs before this template for any document with visibility ∈ {marketplace, public}. See §6.

4. Query Embedding

  • Cached per (tenantId, q, locale) with 60s TTL to avoid repeat embeddings for pagination.
  • Model-mismatch guard: if cached corpus model id ≠ gateway's current query model id, query falls back to lexical-only and an alert fires.

5. Learning-to-Rank

5.1 Feature Set

FeatureSourceRange
bm25_titleOpenSearch0..∞
bm25_bodyOpenSearch0..∞
cosine_simpgvector-1..1
recency_daysdoc.updatedAt0..∞
quality_ratingdoc.quality.ratingAvg0..5
quality_completion_ratedoc.quality.completionRate0..1
enrollment_loglog10(enrollmentCount+1)0..7
locale_matchbool0/1
user_cohort_affinitycohort propensity0..1
user_taxonomy_affinitypast interactions in that taxonomy0..1
click_through_rate_30danalytics rollup0..1

5.2 Training

  • Offline job in analytics-service: pulls search.recommendation.feedback.recorded.v1 + search.recommendation.generated.v1 + query logs.
  • Trains LambdaMART on pairwise judgements.
  • Outputs artifact → ai-gateway model registry → rolled out behind rankerModelVersion flag.
  • Canary: 5% traffic for 24h, NDCG@10 gate.

5.3 Serving

const scored = await ai.ranker.rerank({
tenantId,
candidates: candidates.map(c => ({ id: c.id })),
features: candidates.map(c => featurize(c, user)),
});
  • Timeout: 50ms hard; on fail → fall back to lexical+RRF score.

6. Content Safety & PII

Before embedding or sending any content to LLMs, search-service runs content through ai-gateway's sanitizer:

  • Strips email/phone/ID numbers via regex + Presidio.
  • Redacts named-entity PII in user documents.
  • Refuses to embed if visibility=public and sanitizer flags unredactable fragments.
  • Logs every refusal with doc id for audit.

For visibility = org | private, PII is permitted in embeddings but the resulting vectors never leave the tenant's pgvector scope.

7. Query Expansion (opt-in)

When enabled for tenant:

  • Short queries (q.length < 4 words) or noisy queries (>2 typos per spellchecker) trigger expansion.
  • Gateway returns up to 3 rewrites.
  • Expansions run through lexical search only (cost-capped); scores merged via max-over-variants.

Disabled by default. Gated by tenantPolicy.search.queryExpansion.

8. Recommendation Explanations

  • Template-guided LLM prompt, producing ≤ 140 chars, no PII, grounded in the reason code.
  • Cached per (itemId, reasonCode, userId) for 24h.
  • Client can fall back to a hard-coded reason phrase if explanation missing.

9. Cost Governance

BudgetOwnerLimits
Corpus embedding monthly tokensplatformper-tenant soft cap + hard cap
Query embedding monthly tokensplatformper-tenant hard cap
LLM expansion monthly tokenstenanttenant-configurable
LLM rec explanation tokenstenanttenant-configurable

All limits enforced by ai-gateway; when breached, search-service gracefully degrades.

10. Evaluation

MetricTargetSource
NDCG@10≥ 0.72golden judgments + L2R holdout
Rec CTRbaseline +15%A/B via analytics-service
Expansion success rate≥ 40%queries with >0 results after expansion that had 0 before
PII leak rate (sampled)0audit sample of 500 docs/month
Embedding hash cache hit≥ 70%service metrics

11. Fallback Hierarchy

hybrid (L2R) → hybrid (RRF) → lexical+quality → lexical → cache → static

On gateway degraded signal, search-service drops down one level.

12. Model Rotation

When ai-gateway publishes ai.embedding.model.rotated.v1:

  1. search-service schedules a rolling rebuild of embeddings (14-day budget).
  2. Queries dual-read both model vectors during cutover (tag by embeddingModelId at kNN time).
  3. Old vectors retained for 14d, then deleted.

13. Forbidden Patterns

  • ❌ Calling model vendors directly from search-service.
  • ❌ Storing raw LLM responses without hashing + provenance.
  • ❌ Embedding PII from visibility=public content.
  • ❌ Using LLMs in the critical ranking path without a deterministic fallback.