AI Integration
:::info Source
Sourced from services/search-service/AI_INTEGRATION.md in the documentation repo.
:::
All model calls flow through ai-gateway-service. Direct vendor calls from search-service are forbidden.
Reference implementation note (EP-11, Ghasi-EdTech): Hybrid search and recommendations use lexical match + catalog tag overlap as a semantic proxy and attach structured aiProvenance (e.g. embedding: "tag_proxy") until corpus/query embeddings are wired to ai-gateway on the hot path. Explanations for “why” strings are template- and heuristic-based in that slice; LLM-generated explanations remain the long-term target (see §4).
Search uses AI in four places:
- Embeddings — semantic search corpus + query embedding.
- Query expansion — short or noisy queries augmented with rephrased variants.
- Learning-to-rank — re-ranking hybrid candidates with a LightGBM model.
- Recommendation explanation — short natural-language "why you see this" strings.
1. Model Inventory
| Purpose | Model family | Hosted via | Token/cost class | Governance |
|---|---|---|---|---|
| Corpus embedding | text-embed-3-small (or local bge-m3 in EU/ME) | ai-gateway | low (batched) | embedding model rotation event |
| Query embedding | same as corpus | ai-gateway | low (hot path) | must match corpus model id |
| Query expansion | small LLM (claude-haiku or local qwen2.5-7b) | ai-gateway | medium | opt-in flag per tenant |
| L2R ranker | LightGBM LambdaMART (not an LLM) | ai-gateway inference | low | offline-trained, shipped artifact |
| Rec explanation | small LLM | ai-gateway | medium | opt-in per tenant; templated prompt |
| On-device semantic (M5) | distilled MiniLM (int8) | on-device | n/a | optional |
2. ai-gateway Client Surface
interface AiGatewayClient {
embeddings: {
embed(input: { text: string; locale?: BCP47; tenantId: TenantId; purpose: 'corpus' | 'query' | 'user-profile' }):
Promise<{ vector: number[]; modelId: string; embeddingHash: string }>;
embedBatch(inputs: EmbedInput[]): Promise<EmbedResult[]>;
};
ranker: {
rerank(req: { tenantId: TenantId; candidates: Candidate[]; features: Record<string, number>[] }):
Promise<{ scores: number[]; modelVersion: string; explanationTopK?: Feature[][] }>;
};
completions: {
expandQuery(req: { tenantId: TenantId; q: string; locale: BCP47 }):
Promise<{ expansions: string[]; modelId: string; tokensIn: number; tokensOut: number }>;
explainRec(req: { tenantId: TenantId; userId: UserId; itemId: string; reasonCode: string }):
Promise<{ text: string; modelId: string }>;
};
}
3. Corpus Embedding Pipeline
Batching: up to 100 inputs per call; flush on size or timeout. Cost tracked by ai-gateway and attributed to search-service's tenant-bucket.
3.1 Content-to-Embed Template
[TYPE=$type] [LOCALE=$locale]
TITLE: $title
SUMMARY: $summary
TAGS: $tags
BODY: $body_truncated_4k
TAXONOMY: $taxonomy
PII scrubber runs before this template for any document with visibility ∈ {marketplace, public}. See §6.
4. Query Embedding
- Cached per (tenantId, q, locale) with 60s TTL to avoid repeat embeddings for pagination.
- Model-mismatch guard: if cached corpus model id ≠ gateway's current query model id, query falls back to lexical-only and an alert fires.
5. Learning-to-Rank
5.1 Feature Set
| Feature | Source | Range |
|---|---|---|
bm25_title | OpenSearch | 0..∞ |
bm25_body | OpenSearch | 0..∞ |
cosine_sim | pgvector | -1..1 |
recency_days | doc.updatedAt | 0..∞ |
quality_rating | doc.quality.ratingAvg | 0..5 |
quality_completion_rate | doc.quality.completionRate | 0..1 |
enrollment_log | log10(enrollmentCount+1) | 0..7 |
locale_match | bool | 0/1 |
user_cohort_affinity | cohort propensity | 0..1 |
user_taxonomy_affinity | past interactions in that taxonomy | 0..1 |
click_through_rate_30d | analytics rollup | 0..1 |
5.2 Training
- Offline job in analytics-service: pulls
search.recommendation.feedback.recorded.v1+search.recommendation.generated.v1+ query logs. - Trains LambdaMART on pairwise judgements.
- Outputs artifact → ai-gateway model registry → rolled out behind
rankerModelVersionflag. - Canary: 5% traffic for 24h, NDCG@10 gate.
5.3 Serving
const scored = await ai.ranker.rerank({
tenantId,
candidates: candidates.map(c => ({ id: c.id })),
features: candidates.map(c => featurize(c, user)),
});
- Timeout: 50ms hard; on fail → fall back to lexical+RRF score.
6. Content Safety & PII
Before embedding or sending any content to LLMs, search-service runs content through ai-gateway's sanitizer:
- Strips email/phone/ID numbers via regex + Presidio.
- Redacts named-entity PII in user documents.
- Refuses to embed if
visibility=publicand sanitizer flags unredactable fragments. - Logs every refusal with doc id for audit.
For visibility = org | private, PII is permitted in embeddings but the resulting vectors never leave the tenant's pgvector scope.
7. Query Expansion (opt-in)
When enabled for tenant:
- Short queries (
q.length < 4 words) or noisy queries (>2 typos per spellchecker) trigger expansion. - Gateway returns up to 3 rewrites.
- Expansions run through lexical search only (cost-capped); scores merged via max-over-variants.
Disabled by default. Gated by tenantPolicy.search.queryExpansion.
8. Recommendation Explanations
- Template-guided LLM prompt, producing ≤ 140 chars, no PII, grounded in the reason code.
- Cached per (itemId, reasonCode, userId) for 24h.
- Client can fall back to a hard-coded reason phrase if explanation missing.
9. Cost Governance
| Budget | Owner | Limits |
|---|---|---|
| Corpus embedding monthly tokens | platform | per-tenant soft cap + hard cap |
| Query embedding monthly tokens | platform | per-tenant hard cap |
| LLM expansion monthly tokens | tenant | tenant-configurable |
| LLM rec explanation tokens | tenant | tenant-configurable |
All limits enforced by ai-gateway; when breached, search-service gracefully degrades.
10. Evaluation
| Metric | Target | Source |
|---|---|---|
| NDCG@10 | ≥ 0.72 | golden judgments + L2R holdout |
| Rec CTR | baseline +15% | A/B via analytics-service |
| Expansion success rate | ≥ 40% | queries with >0 results after expansion that had 0 before |
| PII leak rate (sampled) | 0 | audit sample of 500 docs/month |
| Embedding hash cache hit | ≥ 70% | service metrics |
11. Fallback Hierarchy
hybrid (L2R) → hybrid (RRF) → lexical+quality → lexical → cache → static
On gateway degraded signal, search-service drops down one level.
12. Model Rotation
When ai-gateway publishes ai.embedding.model.rotated.v1:
- search-service schedules a rolling rebuild of embeddings (14-day budget).
- Queries dual-read both model vectors during cutover (tag by
embeddingModelIdat kNN time). - Old vectors retained for 14d, then deleted.
13. Forbidden Patterns
- ❌ Calling model vendors directly from search-service.
- ❌ Storing raw LLM responses without hashing + provenance.
- ❌ Embedding PII from
visibility=publiccontent. - ❌ Using LLMs in the critical ranking path without a deterministic fallback.