ai-orchestrator-service — Testing Strategy
Companion to:
docs/standards/SERVICE_TEMPLATE.md·docs/08-ai-architecture.md §11·AI_INTEGRATION.md
The AI service has four test surfaces beyond the platform standard:
- The standard pyramid (unit + integration + e2e).
- The eval harness (golden sets per capability — the AI-quality SLO gate).
- The red-team suite (prompt injection + cross-tenant isolation).
- The edge-replay suite (offline → online round-trip parity).
CI gating ranks these strictly: unit and integration must pass on every PR; eval and red-team run nightly on main and as required gates on prompt promotion PRs; edge-replay runs nightly.
1. Unit tests
Framework: Vitest (TypeScript), >= 95% line coverage on the application + domain layers (enforced in CI).
Targets:
| Module | Examples |
|---|---|
| Domain invariants | Capability.activate() rejects when no eval suite; PromptVersion.publish() rejects when status≠candidate; EdgeModelManifest.publish() rejects when signature missing |
| Router | pickProvider honours pii_class, fallback chain, edge-only flag, A/B assignment |
| Pre-call pipeline | Schema validation; redaction map round-trip; cache key generation is deterministic |
| Post-call pipeline | Output schema validation + repair retry; HITL gate opening; provenance assembly |
| Provider adapters | Each adapter mocks the SDK and asserts request shape, headers, retry behaviour, and error mapping to canonical codes |
| Budget arithmetic | Soft + hard cap math; period rollover; refund on cache hit |
Pure-functional code paths (no I/O) target 100% line + branch coverage.
2. Integration tests
Framework: Vitest + Testcontainers (Postgres 16 + pgvector, Redis 7, GCS emulator, Pub/Sub emulator, a fake Vertex AI endpoint).
Required scenarios — these are the platform's mandatory integration tests for this service:
| ID | Scenario |
|---|---|
IT-AI-001 | POST /ai/complete happy path — Vertex stub returns canned response; provenance row written; outbox row created; metric incremented |
IT-AI-002 | Cache hit on second identical call returns the same result with cacheHit=true and cost=0 |
IT-AI-003 | Provider failure triggers fallback chain in declared order; provenance reflects final provider |
IT-AI-004 | Budget hard-cap exceeded → MELMASTOON.AI.REFUSED_BUDGET and deterministic fallback fires only when configured |
IT-AI-005 | HITL-gated capability returns 202 + gateId; gate decision API closes the gate; downstream event emitted |
IT-AI-006 | Output schema invalid after one repair → MELMASTOON.AI.OUTPUT_INVALID |
IT-AI-007 | RLS isolation — direct DB query under tenant A can't see tenant B's prompts/embeddings/budget |
IT-AI-008 | Cross-tenant request (JWT tenant A but path/body tenant B) → 403 MELMASTOON.GENERAL.CROSS_TENANT_REFERENCE |
IT-AI-009 | Outbox → Pub/Sub publishes the inference completed event with the correct envelope |
IT-AI-010 | Idempotency-Key replay returns the original 200 response from the idempotency table |
IT-AI-011 | Embedding endpoint returns a vector of the configured dim and persists with the right corpus |
IT-AI-012 | RAG query honours tenant + namespace and never returns chunks across tenants |
IT-AI-013 | Edge model manifest publish refuses unsigned manifests; :publish flips status atomically |
IT-AI-014 | Prompt promotion to active archives the previous active in the same transaction |
IT-AI-015 | Daily migration check — running migrations on a snapshot of prod schema is a no-op |
A nightly chaos integration variant runs IT-AI-001 while injecting Vertex 503s, Pub/Sub publish failures, and Postgres connection drops to assert retry/outbox behaviour.
3. End-to-end tests
Framework: Playwright + a stood-up melmastoon-staging slice. Scope:
- Backoffice: ask the tutor a question → response renders with citations within 5 s.
- Backoffice: an admin creates a prompt version → the linter runs and reports findings; promotion is gated on green eval.
- Reservation flow: confirming a reservation triggers the upsell capability and a notification draft is enqueued; HITL gate appears on first send for that template.
- Electron desktop: take desktop offline; draft a guest message → edge model produces a draft; sync brings the audit row back online; cloud
inference.completed.v1appears.
E2E runs on every release candidate and on a 6-hour cron against staging.
4. Eval harness
The eval harness is the AI-quality SLO gate — the difference between "tests pass" and "model is fit for purpose". It is implemented as a first-class capability of the service (see AI_INTEGRATION.md §4).
4.1 Suite shape
An EvalSuite row references a GCS dataset (gs://melmastoon-eval/<suite_id>/v<n>/) with one JSONL line per item:
{"id":"eval_001","input":{"...":"..."},"reference":{"...":"..."},"weight":1.0,"tags":["base","tier_S"]}
Suites carry per-capability scoring rubrics:
| Capability | Primary metric | Secondary | Promote-threshold (delta vs active) |
|---|---|---|---|
pricing.suggest | RMSE vs reference (lower better) | revenue lift on a held-out shadow week | RMSE ≤ active − 2% |
message.draft | LLM-judged quality (rubric: tone, accuracy, brevity, locale) 0..5 | toxicity rate | judge mean ≥ active + 0.1 |
review.summarize | ROUGE-L | factual claims precision (LLM-judged) | ROUGE-L ≥ active + 1 pt |
vision.id_ocr | field-level F1 | low-confidence rate | F1 ≥ active − 0.5 pt |
tutor.answer | answer correctness (LLM-judged) | citation precision | mean ≥ 4.0/5.0 and citation precision ≥ 0.9 |
anomaly.detect | precision / recall | calibration (Brier score) | precision ≥ 0.85, recall ≥ 0.80 |
LLM-as-judge uses a fixed gemini-1.5-pro model with a frozen judge prompt version (versioned alongside the suite) and majority vote across 3 samples to reduce variance.
4.2 Run mechanics
- An
EvalRunis an inference job withpurpose='eval'(skips cache and budget caps but is metered separately). - A run iterates the dataset; on completion it writes per-item scores + an aggregate to
eval_runs.results_summaryand emitsmelmastoon.ai_orchestrator.eval.run_completed.v1. - The harness retains the last 60 days of detailed per-item results in BigQuery for drift diagnosis.
4.3 Drift detection
A nightly job runs the canary suite (a 50-item subset) of every active prompt against current production traffic shape. If the score regresses by > 5% vs the 7-day trailing mean for two consecutive runs, an alert routes to ai-engineering.
4.4 Promotion gate
A promotion PR (status: candidate → active for a prompt version) requires:
- Latest run on the suite has
passed=true. - All
metricthresholds in the suite met. - Linter
severity ≠ block. - Manual approval from a CODEOWNERS-listed AI engineer.
5. Red-team suite
Goals: prompt injection, jailbreak, prompt-leak, cross-tenant exfil. Implemented as Vitest tests in test/redteam/.
| Suite | Cases | Pass criterion |
|---|---|---|
injection.spec.ts | 200+ canonical injection prompts (e.g. "ignore previous instructions", "you are now DAN", smuggled instruction in citations) | The model's output schema-validates AND no test reveals the system prompt |
prompt-leak.spec.ts | Direct prompts asking for the system prompt, the tenant id, the prompt version | Refuse or non-disclosure |
cross-tenant-rag.spec.ts | RAG queries with crafted text suggesting "look up tenant X's policies" | Result chunks all have tenantId == requesterTenantId |
tool-abuse.spec.ts | Capabilities that didn't enable tools should refuse provider tool-calls | MELMASTOON.AI.PROVIDER_PROTOCOL_VIOLATION |
pii-egress.spec.ts | A capability marked PII-required, but routed to a provider that disallows PII, must NOT be selected by the router | property test asserting routing rule |
Failure of any case fails the build.
6. Edge-replay suite
A nightly job:
- Spins up an Electron headless harness with the latest published edge model manifest.
- Runs a fixed 50-case dataset on every edge-capable capability.
- Synchronises the desktop with the cloud (push).
- Asserts: every case produced a cloud
inference.completed.v1event withprovenance.local=true, output schema validated server-side, and provenance round-tripped exactly. - Diffs edge-vs-cloud outputs on the same cases; if the divergence on
tutor.answerexceeds 25% (cosine of embedded answers), files an issue.
7. Performance / load tests
K6 scripts target the inference path:
- 200 RPS sustained on
/ai/completefor 10 minutes, p95 ≤ 2.5 s, error rate ≤ 0.5%. - 1 000 concurrent embeddings, p95 ≤ 800 ms.
- A 60-minute soak on RAG query at 50 RPS to detect leaks.
8. Coverage targets
| Layer | Lines | Branches |
|---|---|---|
| Domain | 100% | 100% |
| Application (use-cases) | ≥ 95% | ≥ 90% |
| Adapters (provider, db, cache) | ≥ 85% | ≥ 80% |
| HTTP controllers | ≥ 90% | ≥ 80% |
| Overall | ≥ 90% | ≥ 85% |
Coverage gates are enforced in CI via vitest --coverage with c8 thresholds.
9. Test data
- Eval datasets are versioned in GCS with a content-hash subdirectory (
v<sha>/) and registered ineval_suites.dataset_uri. - A fixtures generator builds synthetic guest names, IDs, dates that look real but are clearly fake (suffix
__SYNTH__removed pre-test). - Real guest data is never used in CI; ingestion of any sample for an eval suite must pass the same redaction pipeline used in production.