AI Integration

:::info Source Sourced from services/search-service/AI_INTEGRATION.md in the documentation repo. :::

All model calls flow through ai-gateway-service. Direct vendor calls from search-service are forbidden.

Reference implementation note (EP-11, Ghasi-EdTech): Hybrid search and recommendations use lexical match + catalog tag overlap as a semantic proxy and attach structured aiProvenance (e.g. embedding: "tag_proxy") until corpus/query embeddings are wired to ai-gateway on the hot path. Explanations for “why” strings are template- and heuristic-based in that slice; LLM-generated explanations remain the long-term target (see §4).

Search uses AI in four places:

Embeddings — semantic search corpus + query embedding.
Query expansion — short or noisy queries augmented with rephrased variants.
Learning-to-rank — re-ranking hybrid candidates with a LightGBM model.
Recommendation explanation — short natural-language "why you see this" strings.

1. Model Inventory

Purpose	Model family	Hosted via	Token/cost class	Governance
Corpus embedding	`text-embed-3-small` (or local `bge-m3` in EU/ME)	ai-gateway	low (batched)	embedding model rotation event
Query embedding	same as corpus	ai-gateway	low (hot path)	must match corpus model id
Query expansion	small LLM (`claude-haiku` or local `qwen2.5-7b`)	ai-gateway	medium	opt-in flag per tenant
L2R ranker	LightGBM LambdaMART (not an LLM)	ai-gateway inference	low	offline-trained, shipped artifact
Rec explanation	small LLM	ai-gateway	medium	opt-in per tenant; templated prompt
On-device semantic (M5)	distilled MiniLM (int8)	on-device	n/a	optional

2. ai-gateway Client Surface

interface AiGatewayClient {
  embeddings: {
    embed(input: { text: string; locale?: BCP47; tenantId: TenantId; purpose: 'corpus' | 'query' | 'user-profile' }):
      Promise<{ vector: number[]; modelId: string; embeddingHash: string }>;
    embedBatch(inputs: EmbedInput[]): Promise<EmbedResult[]>;
  };
  ranker: {
    rerank(req: { tenantId: TenantId; candidates: Candidate[]; features: Record<string, number>[] }):
      Promise<{ scores: number[]; modelVersion: string; explanationTopK?: Feature[][] }>;
  };
  completions: {
    expandQuery(req: { tenantId: TenantId; q: string; locale: BCP47 }):
      Promise<{ expansions: string[]; modelId: string; tokensIn: number; tokensOut: number }>;
    explainRec(req: { tenantId: TenantId; userId: UserId; itemId: string; reasonCode: string }):
      Promise<{ text: string; modelId: string }>;
  };
}

3. Corpus Embedding Pipeline

Batching: up to 100 inputs per call; flush on size or timeout. Cost tracked by ai-gateway and attributed to search-service's tenant-bucket.

3.1 Content-to-Embed Template

[TYPE=$type] [LOCALE=$locale]
TITLE: $title
SUMMARY: $summary
TAGS: $tags
BODY: $body_truncated_4k
TAXONOMY: $taxonomy

PII scrubber runs before this template for any document with visibility ∈ {marketplace, public}. See §6.

4. Query Embedding

Cached per (tenantId, q, locale) with 60s TTL to avoid repeat embeddings for pagination.
Model-mismatch guard: if cached corpus model id ≠ gateway's current query model id, query falls back to lexical-only and an alert fires.

5. Learning-to-Rank

5.1 Feature Set

Feature	Source	Range
`bm25_title`	OpenSearch	0..∞
`bm25_body`	OpenSearch	0..∞
`cosine_sim`	pgvector	-1..1
`recency_days`	doc.updatedAt	0..∞
`quality_rating`	doc.quality.ratingAvg	0..5
`quality_completion_rate`	doc.quality.completionRate	0..1
`enrollment_log`	log10(enrollmentCount+1)	0..7
`locale_match`	bool	0/1
`user_cohort_affinity`	cohort propensity	0..1
`user_taxonomy_affinity`	past interactions in that taxonomy	0..1
`click_through_rate_30d`	analytics rollup	0..1

5.2 Training

Offline job in analytics-service: pulls search.recommendation.feedback.recorded.v1 + search.recommendation.generated.v1 + query logs.
Trains LambdaMART on pairwise judgements.
Outputs artifact → ai-gateway model registry → rolled out behind rankerModelVersion flag.
Canary: 5% traffic for 24h, NDCG@10 gate.

5.3 Serving

const scored = await ai.ranker.rerank({
  tenantId,
  candidates: candidates.map(c => ({ id: c.id })),
  features: candidates.map(c => featurize(c, user)),
});

Timeout: 50ms hard; on fail → fall back to lexical+RRF score.

6. Content Safety & PII

Before embedding or sending any content to LLMs, search-service runs content through ai-gateway's sanitizer:

Strips email/phone/ID numbers via regex + Presidio.
Redacts named-entity PII in user documents.
Refuses to embed if visibility=public and sanitizer flags unredactable fragments.
Logs every refusal with doc id for audit.

For visibility = org | private, PII is permitted in embeddings but the resulting vectors never leave the tenant's pgvector scope.

7. Query Expansion (opt-in)

When enabled for tenant:

Short queries (q.length < 4 words) or noisy queries (>2 typos per spellchecker) trigger expansion.
Gateway returns up to 3 rewrites.
Expansions run through lexical search only (cost-capped); scores merged via max-over-variants.

Disabled by default. Gated by tenantPolicy.search.queryExpansion.

8. Recommendation Explanations

Template-guided LLM prompt, producing ≤ 140 chars, no PII, grounded in the reason code.
Cached per (itemId, reasonCode, userId) for 24h.
Client can fall back to a hard-coded reason phrase if explanation missing.

9. Cost Governance

Budget	Owner	Limits
Corpus embedding monthly tokens	platform	per-tenant soft cap + hard cap
Query embedding monthly tokens	platform	per-tenant hard cap
LLM expansion monthly tokens	tenant	tenant-configurable
LLM rec explanation tokens	tenant	tenant-configurable

All limits enforced by ai-gateway; when breached, search-service gracefully degrades.

10. Evaluation

Metric	Target	Source
NDCG@10	≥ 0.72	golden judgments + L2R holdout
Rec CTR	baseline +15%	A/B via analytics-service
Expansion success rate	≥ 40%	queries with >0 results after expansion that had 0 before
PII leak rate (sampled)	0	audit sample of 500 docs/month
Embedding hash cache hit	≥ 70%	service metrics

11. Fallback Hierarchy

hybrid (L2R) → hybrid (RRF) → lexical+quality → lexical → cache → static

On gateway degraded signal, search-service drops down one level.

12. Model Rotation

When ai-gateway publishes ai.embedding.model.rotated.v1:

search-service schedules a rolling rebuild of embeddings (14-day budget).
Queries dual-read both model vectors during cutover (tag by embeddingModelId at kNN time).
Old vectors retained for 14d, then deleted.

13. Forbidden Patterns

❌ Calling model vendors directly from search-service.
❌ Storing raw LLM responses without hashing + provenance.
❌ Embedding PII from visibility=public content.
❌ Using LLMs in the critical ranking path without a deterministic fallback.

1. Model Inventory​

2. ai-gateway Client Surface​

3. Corpus Embedding Pipeline​

3.1 Content-to-Embed Template​

4. Query Embedding​

5. Learning-to-Rank​

5.1 Feature Set​

5.2 Training​

5.3 Serving​

6. Content Safety & PII​

7. Query Expansion (opt-in)​

8. Recommendation Explanations​

9. Cost Governance​

10. Evaluation​

11. Fallback Hierarchy​

12. Model Rotation​

13. Forbidden Patterns​