ai-orchestrator-service — Service Overview

Catalog summary: docs/03-microservices/ai-orchestrator-service.md · Strategic refs: 02 Enterprise Architecture · 04 Event-Driven Architecture · 05 API Design · 06 Data Models · 07 Security & Tenancy · 08 AI Architecture (canonical) · ADR-0003 Electron Offline-First · Standards · NAMING · Standards · ERROR_CODES · Standards · SERVICE_TEMPLATE

1. Purpose

ai-orchestrator-service is the single AI gateway of Ghasi Melmastoon. It is the only service in the platform that talks to model providers — Vertex AI (primary), Anthropic Claude and OpenAI (cloud fallback adapters), and ONNX Runtime Node on the Electron desktop (edge inference). Every AI capability advertised in docs/08-ai-architecture.md §7 — dynamic pricing suggestion, demand forecast, housekeeping schedule optimization, anomaly detection, upsell recommendation, smart guest message draft, review summarization, OCR for ID scan at check-in, voice transcription, description generation, translation drafts, AI tutor — is delivered by other services and BFFs only through this service.

It owns the capability catalog, the prompt registry with semver versioning + A/B rollout + deprecation policy, the model catalog spanning cloud and edge, per-tenant + per-feature cost control with soft and hard caps, content moderation (pre + post call), PII redaction, RAG over per-tenant pgvector namespaces, mandatory provenance metadata generation (AIProvenance), HITL gate orchestration (open, notify, capture decision, audit), the eval harness (golden sets, A/B promotion gates, drift detection on active), and the edge model manifest — the signed list of ONNX models packaged with the Electron installer with SHA-256 + signature verified on every load.

It does not decide whether to invoke an AI capability — that is the calling feature's choice, expressed by capability id and tenant context. The gateway decides how to fulfill it: which model, which provider, with which prompt version, with which cache, with which guardrails, and how to record what happened.

2. Bounded Context

Field	Value
Bounded context	AI
Subdomain type	Core — the AI thesis is the platform's force multiplier for small/medium hotels in low-resource markets
Strategic patterns	Open Host Service (capability catalog + REST surface) · Conformist (callers conform to the typed `AIClient` port) · Anti-Corruption Layer (provider adapters isolate Vertex AI / Anthropic / OpenAI / ONNX from the rest of the platform)
Bounded context map	`ai-orchestrator-service` ◀── (capability calls) ── every service + every BFF + Electron desktop · `ai-orchestrator-service` ──▶ (only) Vertex AI / Anthropic / OpenAI / ONNX edge
Ubiquitous language	Capability, Prompt, PromptVersion, Model, ModelDeployment, Provider, InferenceRequest, InferenceResult, Provenance, EvalSuite, EvalRun, RAGCorpus, Embedding, BudgetCounter, HitlGate, HitlDecision, EdgeModelManifest, FallbackChain, CostClass, LatencyClass, SafetyVerdict, RoutingDecision

3. Responsibilities (in scope)

#	Responsibility	Detail
1	Capability catalog	Versioned registry binding capability id → prompt template + default model + fallback chain + HITL gate config + eval suite + latency target + cost class + JSON output schema
2	Prompt registry	Semver versions per `(domain, ordinal)`; `draft → active → deprecated → retired` lifecycle; eval-gated promotion
3	Model catalog	Cloud + edge models with cost class, latency class, modality, context window
4	Provider routing	`pickProvider(capability, context)` — cloud (Vertex primary) vs edge; per-capability fallback chain on provider error or unhealthy circuit
5	Cost & budget	Per-tenant monthly token budget, soft cap (warn 80%) + hard cap (degrade 100%); per-feature sub-budgets; real-time counters
6	RAG	Per-tenant pgvector namespaces; HNSW indexes; chunking + ingestion pipeline; cross-tenant guard
7	Provenance	`AIProvenance` stamped on every artifact persisted or returned; CI gate refuses persistence without it in any sibling service
8	HITL gates	Open `HitlGate` row, notify reviewer (via `notification-service`), capture `HitlDecision` within SLA, audit
9	Content moderation	Pre-call on input (block on `harm_high`); post-call on output (block on `harm_*`, `hate`, `sexual`, `dangerous`, `pii_exposed`)
10	PII redaction	Pre-call strip of emails, phones, government IDs, credit-card-shaped strings, IBANs from anything bound for the model and from logs
11	Embedding generation	Vertex `text-embedding-004` (cloud) or `all-MiniLM-L6-v2` (edge); batched
12	Vector search	k-NN over per-tenant HNSW indexes; per-call `SET LOCAL hnsw.ef_search` tuned per use case
13	Eval harness	Golden sets per capability; precision/recall + acceptance metrics; A/B promotion gate; drift alerts on `active`
14	A/B prompt rollout	New prompt versions ship as `draft` → 5% sticky-by-tenant traffic → eval + 7-day production review → promote to `active`
15	Edge model manifest	Signed JSON manifest packaged with Electron installer; integrity verified on every model load
16	Telemetry	Token counts, latency, cost, cache hits per `(tenant_id, capability, model)` to BigQuery `ai_calls_fact` + Cloud Monitoring
17	Cache	Per-tenant prompt+input hash cache in Memorystore; TTL per capability
18	GDPR participation	Purge per-tenant embeddings + RAG corpora + cached artifacts within 7 d on guest erasure

4. Non-Responsibilities (explicitly out of scope)

#	Concern	Owner
1	Deciding whether to invoke AI for a feature	Calling service / BFF
2	Authoring business logic that consumes AI output (e.g., when to actually update a price)	`pricing-service` and others
3	Storing AI artifacts that are domain-owned (e.g., the published price)	Owning service (with `aiProvenanceRef` foreign reference)
4	Sending guest-facing messages	`notification-service` (we hand it the HITL-accepted draft)
5	Persisting reviewer identity profile	`iam-service` + `tenant-service`
6	Long-term archive of AI artifacts beyond audit	`audit-service` (we feed it events)
7	Authoring tenant policies / FAQ / SOP source documents	`theme-config-service` / tenant authoring tool
8	Sync engine to the desktop	`sync-service` (we publish the snapshot of the registry + manifest)

5. Dependencies

5.1 Upstream (we depend on)

Dependency	Relationship	Failure handling
Vertex AI (Gemini 1.5 Pro / Flash / Flash-8B; `text-embedding-004`; Document AI; Speech-to-Text)	Synchronous — primary cloud provider	Fallback chain: `gemini-pro → claude-3-5-sonnet → gpt-4.1`; `gemini-flash → claude-3-5-sonnet → gpt-4o-mini`; circuit breaker per provider
Anthropic Claude (via Vertex partner endpoint)	Synchronous — fallback adapter	Marked degraded after 5 consecutive errors; resumes on probe
OpenAI	Synchronous — tertiary fallback	Same circuit-breaker model; restricted to capabilities where data residency permits
Cloud SQL Postgres + pgvector	Synchronous — capability catalog, prompt registry, RAG, audit	Read replica fallback for catalog reads; writes block until primary recovers
Memorystore Redis (HA)	Synchronous — cache, HITL SLA timers, rate limiter	Postgres fallback for catalog reads (degraded latency); cache miss tolerated
GCS (`melmastoon-ai-artifacts-<env>`)	Synchronous — eval datasets, signed ONNX models, prompt fixtures	Cached in memory; degraded ingestion blocks eval runs only
Cloud KMS	Synchronous — manifest signature key, secrets envelope	Boot fails if KMS unreachable; cached signing context (5 min)
Secret Manager	Synchronous on boot	Cached; rotated via SIGHUP
Pub/Sub	Asynchronous — outbox publish + event consumption	Outbox table buffers; retry with backoff
`tenant-service`	Asynchronous — tenant region pin + plan limits	Cached for 5 min; degraded to last-known on error
`iam-service`	Synchronous — JWT verification + reviewer role assertion on HITL	JWKS cached; circuit breaker
`notification-service`	Asynchronous — HITL gate notification dispatch	Outbox; never blocks the inference response

5.2 Downstream (depend on us)

Consumer	What they consume	Coupling
Every feature service invoking AI (`pricing-service`, `housekeeping-service`, `reservation-service`, `billing-service`, `lock-integration-service`, `iam-service` adaptive MFA, `theme-config-service`, `search-aggregation-service`, etc.)	`POST /api/v1/ai/complete` and friends	OHS / Conformist
`bff-backoffice-service`	Tutor + draft assist + eval dashboards	Direct REST
`bff-tenant-booking-service`	Booking conversion assist (consumer hint)	Direct REST
`bff-consumer-service`	Booking conversion assist	Direct REST
Electron desktop (`@ghasi/app-desktop-backoffice`)	Edge inference via local ONNX + cloud passthrough; pulls signed `EdgeModelManifest`; pulls prompt registry snapshot	Sync + REST
`audit-service`	Every `melmastoon.ai_orchestrator.*` event (regulated retention)	Append-only ingest
`reporting-service`	Cost + acceptance dashboards via BigQuery `ai_calls_fact`	Read-only

6. Architecture Diagram

                                     ┌──────────────────────────────────┐
                                     │ 1. Edge / API Gateway            │
                                     │ Cloud Armor + WAF + mTLS         │
                                     │ rate-limit per (tenant, feature) │
                                     └──────────────┬───────────────────┘
                                                    │
        ┌───────────────────────────────────────────┴────────────────────────────────────────┐
        │                                                                                    │
        ▼                                                                                    ▼
┌────────────────────┐                                                              ┌─────────────────────┐
│ 2. Inference API   │                                                              │ 8. Admin API        │
│ /ai/complete /embed│                                                              │ prompts, eval, mfst │
│ /moderate /rag     │                                                              └────────┬────────────┘
│ /vision /transcribe│                                                                       │
└─────┬──────────────┘                                                                       │
      │                                                                                      │
      ▼                                                                                      │
┌─────────────────────────────────────────────────────────────────────────────────────┐      │
│ 3. Pre-call pipeline                                                                │      │
│   ─ moderate input  ─ redact PII  ─ check budget  ─ pin prompt version              │      │
│   ─ assemble system prompt  ─ hash input  ─ cache lookup  ─ HITL pre-check          │      │
└─────┬───────────────────────────────────────────────────────────────────────────────┘      │
      │ cache hit ──────────────────────────────────────────────────────────────────────┐    │
      │                                                                                 ▼    │
      ▼                                                                          ┌──────────────────┐
┌─────────────────────────┐         ┌───────────────────────────────┐            │ Memorystore      │
│ 4. Router               │────────▶│ 5. Provider adapters          │            │ prompt+input     │
│   pickProvider(...)     │         │   ─ vertex.adapter.ts         │            │ hash cache       │
│   capability fallback   │         │   ─ anthropic.adapter.ts      │            └──────────────────┘
└─────┬───────────────────┘         │   ─ openai.adapter.ts         │
      │                             │   ─ onnx-edge passthrough     │            ┌──────────────────┐
      │                             └────────────┬──────────────────┘            │ Vertex AI        │
      ▼                                          │                               │ (primary cloud)  │
┌─────────────────────────────────┐              ▼                               └──────────────────┘
│ 6. Post-call pipeline           │      ┌────────────────────────────┐
│   ─ moderate output             │      │ 7. Persist + emit          │
│   ─ schema validate (+ repair)  │─────▶│   ─ provenance row         │
│   ─ stamp AIProvenance          │      │   ─ outbox events          │
│   ─ open HITL gate if required  │      │   ─ BigQuery streaming     │
└─────────────────────────────────┘      └────────────────────────────┘

Sections:

Edge gateway — Cloud Armor + WAF; per-(tenant, feature) rate limits.
Inference API — REST surface; mTLS for service-to-service; JWT for BFF callers.
Pre-call pipeline — moderation, redaction, budget check, prompt pinning, system-prompt assembly, input hash + cache lookup, HITL pre-check (some capabilities require HITL on every call, e.g., guest-facing message dispatch).
Router — pickProvider(capability, context); per-capability fallback chain.
Provider adapters — exactly four, each implementing AIProviderPort. Adapters are the only modules that import provider SDKs.
Post-call pipeline — output moderation, JSON-schema validation with one repair attempt, provenance stamping, optional HITL-gate opening.
Persistence + outbox — provenance row written transactionally with the artifact; outbox events for inference.completed, capability-specific suggestion.*, and hitl.gate_opened; BigQuery streaming for ai_calls_fact.
Admin API — prompt CRUD, eval triggers, edge model manifest publish; restricted to platform admins via JWT scope melmastoon:ai:admin.

7. Capability Catalog (one row per capability — implementation detail)

The catalog is the single source of truth for the gateway. Every capability id used by callers must be a row here. New entries require an ADR-or-equivalent record.

Capability id	Prompt	Default model	Fallback chain	HITL gate	Latency target (p95)	Cost class	Output schema	Eval suite
`pricing.suggest`	`PRMP_PRICING_001_v3`	`gemini-1.5-flash`	`claude-sonnet → gpt-4o-mini → fallback-deterministic`	Yes if deviation > 5% from BAR	1.5 s	Medium	`PricingSuggestion`	`EVAL_PRICING_001`
`pricing.demand_forecast`	`PRMP_PRICING_002_v2`	tabular + `gemini-1.5-flash-8b`	last-year naive	No	1 h batch	Low	`DemandForecast`	`EVAL_FORECAST_001`
`housekeeping.route`	none (solver) + optional `PRMP_HK_001_v1` annotation	`melmastoon-edge-hkt-v2.onnx`	`gemini-1.5-flash` (annotation only) → greedy nearest-floor	Yes (lead accepts)	500 ms on device	Free	`HousekeepingRoute`	`EVAL_HK_001`
`staff.shift_optimize`	`PRMP_STAFF_001_v1`	`gemini-1.5-flash`	rule-based scheduler	Yes (manager accepts)	2 s	Medium	`ShiftSchedule`	`EVAL_STAFF_001`
`anomaly.detect`	`PRMP_ANOMALY_001_v4` (explanation)	`melmastoon-edge-anomaly-v3.onnx` (classification); `gemini-1.5-flash` (explanation)	rule-based heuristics	Yes for any auto-block	200 ms (edge) / 2 s (explanation)	Low	`AnomalyVerdict`	`EVAL_ANOMALY_001`
`upsell.recommend`	`PRMP_UPSELL_001_v2`	`gemini-1.5-flash`	static rules	No	1 s	Medium	`UpsellOffer[]`	`EVAL_UPSELL_001`
`message.draft`	`PRMP_MSG_001_v3`	`gemini-1.5-flash` (online) / `phi-3-mini` (offline)	static templates per intent + locale	Yes — always	1.5 s online / 4 s offline	Medium	`MessageDraft`	`EVAL_MSG_001`
`review.summarize`	`PRMP_REVIEW_001_v2`	`gemini-1.5-pro`	`claude-3-5-sonnet` → rule-based theme extraction	No	8 s for ≤200 reviews	High	`ReviewSummary`	`EVAL_REVIEW_001`
`booking.conversion_hint`	`PRMP_BOOKING_001_v1`	`gemini-1.5-flash`	static FAQ links	No	800 ms	Medium	`ConversionHint`	`EVAL_BOOKING_001`
`tutor.answer`	`PRMP_TUTOR_001_v2`	`gemini-1.5-flash-8b` (online) / `phi-3-mini` (offline)	local FAQ vector search (MiniLM + cosine)	No	1.5 s	Low	`TutorAnswer`	`EVAL_TUTOR_001`
`description.generate`	`PRMP_DESC_001_v3`	`gemini-1.5-flash-8b`	template fill	Yes (tenant accepts before publish)	2 s	Low	`Description`	`EVAL_DESC_001`
`translation.draft`	`PRMP_TRANSLATE_001_v2`	`gemini-1.5-flash-8b`	Cloud Translation API	Yes (tenant per-locale)	3 s per chunk	Low	`TranslationDraft`	`EVAL_TRANSLATE_001`
`ocr.id_scan`	Document AI + `PRMP_OCR_001_v1`	Document AI + `gemini-1.5-flash-8b`	manual entry (image attached)	Yes — always	4 s end-to-end	Medium	`IdScanFields`	`EVAL_OCR_001`
`stt.transcribe`	Vertex Speech + `PRMP_STT_001_v1`	Vertex Speech-to-Text + `gemini-1.5-flash-8b`	manual taps	No (reversible action)	2 s	Low	`Transcription`	`EVAL_STT_001`
`vision.photo_quality`	none	`mobilenet-v3-small-image-quality.onnx` (edge) / Vertex Vision (cloud)	accept all	No	200 ms (edge)	Free	`PhotoQualityScore`	`EVAL_VISION_001`

8. Key Decisions

#	Decision	Rationale
1	Single AI gateway, no exceptions	Cost control, audit, model swap, prompt versioning, moderation, PII redaction, cache, observability — all impossible to enforce uniformly otherwise. CI dependency-graph gate on model SDK imports.
2	Vertex AI primary, Anthropic + OpenAI fallback	Native to GCP; private VPC; CMEK; data residency support. Heterogeneous fallback survives a single-provider outage.
3	Edge inference via ONNX Runtime Node on Electron main process	Renderer never has model bytes. Target markets are bandwidth-constrained. Phi-3-mini + MiniLM + custom anomaly classifier ship signed with the installer.
4	Capability catalog, not direct model calls	Callers request `capability: 'pricing.suggest'`, never `model: 'gemini-1.5-flash'`. Lets the gateway swap models, prompts, and providers without caller changes.
5	Prompts are first-class versioned artifacts	`PRMP_<DOMAIN>_<NUMBER>_v<n>` registry; `draft → active → deprecated → retired` lifecycle; A/B promotion gated on eval green.
6	`AIProvenance` is mandatory	Every persisted AI artifact carries provenance or it is not persisted. CI gate enforces in sibling services.
7	HITL gate is a first-class aggregate	Decisions are auditable, SLA-tracked, and linked to the resulting state-change event via `decisionId`.
8	pgvector for RAG with per-tenant namespacing	Postgres-native; one fewer engine to operate; RLS gives a second isolation line; HNSW indexes meet recall + latency targets at our scale.
9	Per-tenant token budget (soft + hard)	Cost is the most likely incident vector; degrade gracefully to deterministic fallback rather than blocking the user.
10	Edge model manifest signed with KMS, verified on every load	Tampering blocks load; supply-chain attack on the installer is detected at runtime.

9. Service Boundaries Summary

Inputs: capability invocations from any service or BFF (REST or Pub/Sub event request/reply); prompt + capability + model admin operations from platform admins; HITL decisions from reviewers.
Outputs: structured AI artifacts with AIProvenance; melmastoon.ai_orchestrator.* outbox events; BigQuery ai_calls_fact rows; Cloud Monitoring metrics; signed EdgeModelManifest snapshots.
State owned: capability catalog, prompt registry, model catalog, inference + result audit, provenance, HITL gates + decisions, budget counters, eval suites + runs, RAG corpora + chunks + embeddings (per-tenant pgvector), edge model manifests.
State referenced (not owned): tenant region pin and plan limits (from tenant-service); reviewer identity (from iam-service); recipient + delivery for HITL notifications (notification-service).

10. Phased Maturity

Phase	Capabilities live	Edge models	Notes
0 (MVP)	`pricing.suggest`, `anomaly.detect`, `upsell.recommend`, `message.draft`, `tutor.answer`, `vision.photo_quality`	anomaly classifier, image-quality scorer	Cloud-only LLM via single Vertex Flash model; provenance + HITL + budget from day 1
1	+ `description.generate`, `translation.draft`, `ocr.id_scan`, `review.summarize`, `housekeeping.route`	+ `phi-3-mini` (offline drafting), `MiniLM` (offline RAG), `melmastoon-edge-hkt-v2` (route optimizer)	Full model catalog; A/B prompt rollout; per-tenant cost dashboard
2	+ `stt.transcribe`, `staff.shift_optimize`, `booking.conversion_hint`	unchanged	Per-tenant RAG over policies/FAQ/SOPs (cloud + edge); residency-aware routing
3	Self-tuning prompts, per-tenant LoRA fine-tunes (with consent), federated edge model updates via signed differential packs	+ per-tenant LoRA adapters where data permits	Continuous eval pipelines auto-promote prompts on green

11. Cross-Reference Quick Index

Aggregates + invariants: DOMAIN_MODEL.md
Use cases + ports + orchestration: APPLICATION_LOGIC.md
REST contracts + error codes: API_CONTRACTS.md
Events published + consumed: EVENT_SCHEMAS.md
Tables, indexes, RLS, pgvector schemas: DATA_MODEL.md
Desktop sync of registry + manifest: SYNC_CONTRACT.md
Self-AI integration (eval harness as customer of own gateway): AI_INTEGRATION.md
Prompt injection, PII, cross-tenant: SECURITY_MODEL.md
SLOs, cost dashboards: OBSERVABILITY.md
Eval harness in detail: TESTING_STRATEGY.md
Cloud Run + Vertex AI region pinning: DEPLOYMENT_TOPOLOGY.md
Failure catalog + runbooks: FAILURE_MODES.md
Local LLM emulator + ONNX dev loop: LOCAL_DEV_SETUP.md
Readiness gate: SERVICE_READINESS.md
Risks + mitigations: SERVICE_RISK_REGISTER.md
Migration of existing prompts + RAG corpora: MIGRATION_PLAN.md

1. Purpose​

2. Bounded Context​

3. Responsibilities (in scope)​

4. Non-Responsibilities (explicitly out of scope)​

5. Dependencies​

5.1 Upstream (we depend on)​

5.2 Downstream (depend on us)​

6. Architecture Diagram​

7. Capability Catalog (one row per capability — implementation detail)​

8. Key Decisions​

9. Service Boundaries Summary​

10. Phased Maturity​

11. Cross-Reference Quick Index​