ai-orchestrator-service — Testing Strategy

Companion to: docs/standards/SERVICE_TEMPLATE.md · docs/08-ai-architecture.md §11 · AI_INTEGRATION.md

The AI service has four test surfaces beyond the platform standard:

The standard pyramid (unit + integration + e2e).
The eval harness (golden sets per capability — the AI-quality SLO gate).
The red-team suite (prompt injection + cross-tenant isolation).
The edge-replay suite (offline → online round-trip parity).

CI gating ranks these strictly: unit and integration must pass on every PR; eval and red-team run nightly on main and as required gates on prompt promotion PRs; edge-replay runs nightly.

1. Unit tests

Framework: Vitest (TypeScript), >= 95% line coverage on the application + domain layers (enforced in CI).

Targets:

Module	Examples
Domain invariants	`Capability.activate()` rejects when no eval suite; `PromptVersion.publish()` rejects when status≠candidate; `EdgeModelManifest.publish()` rejects when signature missing
Router	`pickProvider` honours `pii_class`, fallback chain, edge-only flag, A/B assignment
Pre-call pipeline	Schema validation; redaction map round-trip; cache key generation is deterministic
Post-call pipeline	Output schema validation + repair retry; HITL gate opening; provenance assembly
Provider adapters	Each adapter mocks the SDK and asserts request shape, headers, retry behaviour, and error mapping to canonical codes
Budget arithmetic	Soft + hard cap math; period rollover; refund on cache hit

Pure-functional code paths (no I/O) target 100% line + branch coverage.

2. Integration tests

Framework: Vitest + Testcontainers (Postgres 16 + pgvector, Redis 7, GCS emulator, Pub/Sub emulator, a fake Vertex AI endpoint).

Required scenarios — these are the platform's mandatory integration tests for this service:

ID	Scenario
`IT-AI-001`	`POST /ai/complete` happy path — Vertex stub returns canned response; provenance row written; outbox row created; metric incremented
`IT-AI-002`	Cache hit on second identical call returns the same result with `cacheHit=true` and `cost=0`
`IT-AI-003`	Provider failure triggers fallback chain in declared order; provenance reflects final provider
`IT-AI-004`	Budget hard-cap exceeded → `MELMASTOON.AI.REFUSED_BUDGET` and deterministic fallback fires only when configured
`IT-AI-005`	HITL-gated capability returns 202 + `gateId`; gate decision API closes the gate; downstream event emitted
`IT-AI-006`	Output schema invalid after one repair → `MELMASTOON.AI.OUTPUT_INVALID`
`IT-AI-007`	RLS isolation — direct DB query under tenant A can't see tenant B's prompts/embeddings/budget
`IT-AI-008`	Cross-tenant request (JWT tenant A but path/body tenant B) → 403 `MELMASTOON.GENERAL.CROSS_TENANT_REFERENCE`
`IT-AI-009`	Outbox → Pub/Sub publishes the inference completed event with the correct envelope
`IT-AI-010`	Idempotency-Key replay returns the original 200 response from the idempotency table
`IT-AI-011`	Embedding endpoint returns a vector of the configured dim and persists with the right corpus
`IT-AI-012`	RAG query honours tenant + namespace and never returns chunks across tenants
`IT-AI-013`	Edge model manifest publish refuses unsigned manifests; `:publish` flips status atomically
`IT-AI-014`	Prompt promotion to `active` archives the previous active in the same transaction
`IT-AI-015`	Daily migration check — running migrations on a snapshot of prod schema is a no-op

A nightly chaos integration variant runs IT-AI-001 while injecting Vertex 503s, Pub/Sub publish failures, and Postgres connection drops to assert retry/outbox behaviour.

3. End-to-end tests

Framework: Playwright + a stood-up melmastoon-staging slice. Scope:

Backoffice: ask the tutor a question → response renders with citations within 5 s.
Backoffice: an admin creates a prompt version → the linter runs and reports findings; promotion is gated on green eval.
Reservation flow: confirming a reservation triggers the upsell capability and a notification draft is enqueued; HITL gate appears on first send for that template.
Electron desktop: take desktop offline; draft a guest message → edge model produces a draft; sync brings the audit row back online; cloud inference.completed.v1 appears.

E2E runs on every release candidate and on a 6-hour cron against staging.

4. Eval harness

The eval harness is the AI-quality SLO gate — the difference between "tests pass" and "model is fit for purpose". It is implemented as a first-class capability of the service (see AI_INTEGRATION.md §4).

4.1 Suite shape

An EvalSuite row references a GCS dataset (gs://melmastoon-eval/<suite_id>/v<n>/) with one JSONL line per item:

{"id":"eval_001","input":{"...":"..."},"reference":{"...":"..."},"weight":1.0,"tags":["base","tier_S"]}

Suites carry per-capability scoring rubrics:

Capability	Primary metric	Secondary	Promote-threshold (delta vs active)
`pricing.suggest`	RMSE vs reference (lower better)	revenue lift on a held-out shadow week	RMSE ≤ active − 2%
`message.draft`	LLM-judged quality (rubric: tone, accuracy, brevity, locale) 0..5	toxicity rate	judge mean ≥ active + 0.1
`review.summarize`	ROUGE-L	factual claims precision (LLM-judged)	ROUGE-L ≥ active + 1 pt
`vision.id_ocr`	field-level F1	low-confidence rate	F1 ≥ active − 0.5 pt
`tutor.answer`	answer correctness (LLM-judged)	citation precision	mean ≥ 4.0/5.0 and citation precision ≥ 0.9
`anomaly.detect`	precision / recall	calibration (Brier score)	precision ≥ 0.85, recall ≥ 0.80

LLM-as-judge uses a fixed gemini-1.5-pro model with a frozen judge prompt version (versioned alongside the suite) and majority vote across 3 samples to reduce variance.

4.2 Run mechanics

An EvalRun is an inference job with purpose='eval' (skips cache and budget caps but is metered separately).
A run iterates the dataset; on completion it writes per-item scores + an aggregate to eval_runs.results_summary and emits melmastoon.ai_orchestrator.eval.run_completed.v1.
The harness retains the last 60 days of detailed per-item results in BigQuery for drift diagnosis.

4.3 Drift detection

A nightly job runs the canary suite (a 50-item subset) of every active prompt against current production traffic shape. If the score regresses by > 5% vs the 7-day trailing mean for two consecutive runs, an alert routes to ai-engineering.

4.4 Promotion gate

A promotion PR (status: candidate → active for a prompt version) requires:

Latest run on the suite has passed=true.
All metric thresholds in the suite met.
Linter severity ≠ block.
Manual approval from a CODEOWNERS-listed AI engineer.

5. Red-team suite

Goals: prompt injection, jailbreak, prompt-leak, cross-tenant exfil. Implemented as Vitest tests in test/redteam/.

Suite	Cases	Pass criterion
`injection.spec.ts`	200+ canonical injection prompts (e.g. "ignore previous instructions", "you are now DAN", smuggled instruction in citations)	The model's output schema-validates AND no test reveals the system prompt
`prompt-leak.spec.ts`	Direct prompts asking for the system prompt, the tenant id, the prompt version	Refuse or non-disclosure
`cross-tenant-rag.spec.ts`	RAG queries with crafted text suggesting "look up tenant X's policies"	Result chunks all have `tenantId == requesterTenantId`
`tool-abuse.spec.ts`	Capabilities that didn't enable tools should refuse provider tool-calls	`MELMASTOON.AI.PROVIDER_PROTOCOL_VIOLATION`
`pii-egress.spec.ts`	A capability marked PII-required, but routed to a provider that disallows PII, must NOT be selected by the router	property test asserting routing rule

Failure of any case fails the build.

6. Edge-replay suite

A nightly job:

Spins up an Electron headless harness with the latest published edge model manifest.
Runs a fixed 50-case dataset on every edge-capable capability.
Synchronises the desktop with the cloud (push).
Asserts: every case produced a cloud inference.completed.v1 event with provenance.local=true, output schema validated server-side, and provenance round-tripped exactly.
Diffs edge-vs-cloud outputs on the same cases; if the divergence on tutor.answer exceeds 25% (cosine of embedded answers), files an issue.

7. Performance / load tests

K6 scripts target the inference path:

200 RPS sustained on /ai/complete for 10 minutes, p95 ≤ 2.5 s, error rate ≤ 0.5%.
1 000 concurrent embeddings, p95 ≤ 800 ms.
A 60-minute soak on RAG query at 50 RPS to detect leaks.

8. Coverage targets

Layer	Lines	Branches
Domain	100%	100%
Application (use-cases)	≥ 95%	≥ 90%
Adapters (provider, db, cache)	≥ 85%	≥ 80%
HTTP controllers	≥ 90%	≥ 80%
Overall	≥ 90%	≥ 85%

Coverage gates are enforced in CI via vitest --coverage with c8 thresholds.

9. Test data

Eval datasets are versioned in GCS with a content-hash subdirectory (v<sha>/) and registered in eval_suites.dataset_uri.
A fixtures generator builds synthetic guest names, IDs, dates that look real but are clearly fake (suffix __SYNTH__ removed pre-test).
Real guest data is never used in CI; ingestion of any sample for an eval suite must pass the same redaction pipeline used in production.

1. Unit tests​

2. Integration tests​

3. End-to-end tests​

4. Eval harness​

4.1 Suite shape​

4.2 Run mechanics​

4.3 Drift detection​

4.4 Promotion gate​

5. Red-team suite​

6. Edge-replay suite​

7. Performance / load tests​

8. Coverage targets​

9. Test data​