ai-orchestrator-service — Service Overview
Catalog summary:
docs/03-microservices/ai-orchestrator-service.md· Strategic refs: 02 Enterprise Architecture · 04 Event-Driven Architecture · 05 API Design · 06 Data Models · 07 Security & Tenancy · 08 AI Architecture (canonical) · ADR-0003 Electron Offline-First · Standards · NAMING · Standards · ERROR_CODES · Standards · SERVICE_TEMPLATE
1. Purpose
ai-orchestrator-service is the single AI gateway of Ghasi Melmastoon. It is the only service in the platform that talks to model providers — Vertex AI (primary), Anthropic Claude and OpenAI (cloud fallback adapters), and ONNX Runtime Node on the Electron desktop (edge inference). Every AI capability advertised in docs/08-ai-architecture.md §7 — dynamic pricing suggestion, demand forecast, housekeeping schedule optimization, anomaly detection, upsell recommendation, smart guest message draft, review summarization, OCR for ID scan at check-in, voice transcription, description generation, translation drafts, AI tutor — is delivered by other services and BFFs only through this service.
It owns the capability catalog, the prompt registry with semver versioning + A/B rollout + deprecation policy, the model catalog spanning cloud and edge, per-tenant + per-feature cost control with soft and hard caps, content moderation (pre + post call), PII redaction, RAG over per-tenant pgvector namespaces, mandatory provenance metadata generation (AIProvenance), HITL gate orchestration (open, notify, capture decision, audit), the eval harness (golden sets, A/B promotion gates, drift detection on active), and the edge model manifest — the signed list of ONNX models packaged with the Electron installer with SHA-256 + signature verified on every load.
It does not decide whether to invoke an AI capability — that is the calling feature's choice, expressed by capability id and tenant context. The gateway decides how to fulfill it: which model, which provider, with which prompt version, with which cache, with which guardrails, and how to record what happened.
2. Bounded Context
| Field | Value |
|---|---|
| Bounded context | AI |
| Subdomain type | Core — the AI thesis is the platform's force multiplier for small/medium hotels in low-resource markets |
| Strategic patterns | Open Host Service (capability catalog + REST surface) · Conformist (callers conform to the typed AIClient port) · Anti-Corruption Layer (provider adapters isolate Vertex AI / Anthropic / OpenAI / ONNX from the rest of the platform) |
| Bounded context map | ai-orchestrator-service ◀── (capability calls) ── every service + every BFF + Electron desktop · ai-orchestrator-service ──▶ (only) Vertex AI / Anthropic / OpenAI / ONNX edge |
| Ubiquitous language | Capability, Prompt, PromptVersion, Model, ModelDeployment, Provider, InferenceRequest, InferenceResult, Provenance, EvalSuite, EvalRun, RAGCorpus, Embedding, BudgetCounter, HitlGate, HitlDecision, EdgeModelManifest, FallbackChain, CostClass, LatencyClass, SafetyVerdict, RoutingDecision |
3. Responsibilities (in scope)
| # | Responsibility | Detail |
|---|---|---|
| 1 | Capability catalog | Versioned registry binding capability id → prompt template + default model + fallback chain + HITL gate config + eval suite + latency target + cost class + JSON output schema |
| 2 | Prompt registry | Semver versions per (domain, ordinal); draft → active → deprecated → retired lifecycle; eval-gated promotion |
| 3 | Model catalog | Cloud + edge models with cost class, latency class, modality, context window |
| 4 | Provider routing | pickProvider(capability, context) — cloud (Vertex primary) vs edge; per-capability fallback chain on provider error or unhealthy circuit |
| 5 | Cost & budget | Per-tenant monthly token budget, soft cap (warn 80%) + hard cap (degrade 100%); per-feature sub-budgets; real-time counters |
| 6 | RAG | Per-tenant pgvector namespaces; HNSW indexes; chunking + ingestion pipeline; cross-tenant guard |
| 7 | Provenance | AIProvenance stamped on every artifact persisted or returned; CI gate refuses persistence without it in any sibling service |
| 8 | HITL gates | Open HitlGate row, notify reviewer (via notification-service), capture HitlDecision within SLA, audit |
| 9 | Content moderation | Pre-call on input (block on harm_high); post-call on output (block on harm_*, hate, sexual, dangerous, pii_exposed) |
| 10 | PII redaction | Pre-call strip of emails, phones, government IDs, credit-card-shaped strings, IBANs from anything bound for the model and from logs |
| 11 | Embedding generation | Vertex text-embedding-004 (cloud) or all-MiniLM-L6-v2 (edge); batched |
| 12 | Vector search | k-NN over per-tenant HNSW indexes; per-call SET LOCAL hnsw.ef_search tuned per use case |
| 13 | Eval harness | Golden sets per capability; precision/recall + acceptance metrics; A/B promotion gate; drift alerts on active |
| 14 | A/B prompt rollout | New prompt versions ship as draft → 5% sticky-by-tenant traffic → eval + 7-day production review → promote to active |
| 15 | Edge model manifest | Signed JSON manifest packaged with Electron installer; integrity verified on every model load |
| 16 | Telemetry | Token counts, latency, cost, cache hits per (tenant_id, capability, model) to BigQuery ai_calls_fact + Cloud Monitoring |
| 17 | Cache | Per-tenant prompt+input hash cache in Memorystore; TTL per capability |
| 18 | GDPR participation | Purge per-tenant embeddings + RAG corpora + cached artifacts within 7 d on guest erasure |
4. Non-Responsibilities (explicitly out of scope)
| # | Concern | Owner |
|---|---|---|
| 1 | Deciding whether to invoke AI for a feature | Calling service / BFF |
| 2 | Authoring business logic that consumes AI output (e.g., when to actually update a price) | pricing-service and others |
| 3 | Storing AI artifacts that are domain-owned (e.g., the published price) | Owning service (with aiProvenanceRef foreign reference) |
| 4 | Sending guest-facing messages | notification-service (we hand it the HITL-accepted draft) |
| 5 | Persisting reviewer identity profile | iam-service + tenant-service |
| 6 | Long-term archive of AI artifacts beyond audit | audit-service (we feed it events) |
| 7 | Authoring tenant policies / FAQ / SOP source documents | theme-config-service / tenant authoring tool |
| 8 | Sync engine to the desktop | sync-service (we publish the snapshot of the registry + manifest) |
5. Dependencies
5.1 Upstream (we depend on)
| Dependency | Relationship | Failure handling |
|---|---|---|
Vertex AI (Gemini 1.5 Pro / Flash / Flash-8B; text-embedding-004; Document AI; Speech-to-Text) | Synchronous — primary cloud provider | Fallback chain: gemini-pro → claude-3-5-sonnet → gpt-4.1; gemini-flash → claude-3-5-sonnet → gpt-4o-mini; circuit breaker per provider |
| Anthropic Claude (via Vertex partner endpoint) | Synchronous — fallback adapter | Marked degraded after 5 consecutive errors; resumes on probe |
| OpenAI | Synchronous — tertiary fallback | Same circuit-breaker model; restricted to capabilities where data residency permits |
| Cloud SQL Postgres + pgvector | Synchronous — capability catalog, prompt registry, RAG, audit | Read replica fallback for catalog reads; writes block until primary recovers |
| Memorystore Redis (HA) | Synchronous — cache, HITL SLA timers, rate limiter | Postgres fallback for catalog reads (degraded latency); cache miss tolerated |
GCS (melmastoon-ai-artifacts-<env>) | Synchronous — eval datasets, signed ONNX models, prompt fixtures | Cached in memory; degraded ingestion blocks eval runs only |
| Cloud KMS | Synchronous — manifest signature key, secrets envelope | Boot fails if KMS unreachable; cached signing context (5 min) |
| Secret Manager | Synchronous on boot | Cached; rotated via SIGHUP |
| Pub/Sub | Asynchronous — outbox publish + event consumption | Outbox table buffers; retry with backoff |
tenant-service | Asynchronous — tenant region pin + plan limits | Cached for 5 min; degraded to last-known on error |
iam-service | Synchronous — JWT verification + reviewer role assertion on HITL | JWKS cached; circuit breaker |
notification-service | Asynchronous — HITL gate notification dispatch | Outbox; never blocks the inference response |
5.2 Downstream (depend on us)
| Consumer | What they consume | Coupling |
|---|---|---|
Every feature service invoking AI (pricing-service, housekeeping-service, reservation-service, billing-service, lock-integration-service, iam-service adaptive MFA, theme-config-service, search-aggregation-service, etc.) | POST /api/v1/ai/complete and friends | OHS / Conformist |
bff-backoffice-service | Tutor + draft assist + eval dashboards | Direct REST |
bff-tenant-booking-service | Booking conversion assist (consumer hint) | Direct REST |
bff-consumer-service | Booking conversion assist | Direct REST |
Electron desktop (@ghasi/app-desktop-backoffice) | Edge inference via local ONNX + cloud passthrough; pulls signed EdgeModelManifest; pulls prompt registry snapshot | Sync + REST |
audit-service | Every melmastoon.ai_orchestrator.* event (regulated retention) | Append-only ingest |
reporting-service | Cost + acceptance dashboards via BigQuery ai_calls_fact | Read-only |
6. Architecture Diagram
┌──────────────────────────────────┐
│ 1. Edge / API Gateway │
│ Cloud Armor + WAF + mTLS │
│ rate-limit per (tenant, feature) │
└──────────────┬───────────────────┘
│
┌───────────────────────────────────────────┴────────────────────────────────────────┐
│ │
▼ ▼
┌────────────────────┐ ┌─────────────────────┐
│ 2. Inference API │ │ 8. Admin API │
│ /ai/complete /embed│ │ prompts, eval, mfst │
│ /moderate /rag │ └────────┬────────────┘
│ /vision /transcribe│ │
└─────┬──────────────┘ │
│ │
▼ │
┌─────────────────────────────────────────────────────────────────────────────────────┐ │
│ 3. Pre-call pipeline │ │
│ ─ moderate input ─ redact PII ─ check budget ─ pin prompt version │ │
│ ─ assemble system prompt ─ hash input ─ cache lookup ─ HITL pre-check │ │
└─────┬───────────────────────────────────────────────────────────────────────────────┘ │
│ cache hit ──────────────────────────────────────────────────────────────────────┐ │
│ ▼ │
▼ ┌──────────────────┐
┌─────────────────────────┐ ┌───────────────────────────────┐ │ Memorystore │
│ 4. Router │────────▶│ 5. Provider adapters │ │ prompt+input │
│ pickProvider(...) │ │ ─ vertex.adapter.ts │ │ hash cache │
│ capability fallback │ │ ─ anthropic.adapter.ts │ └──────────────────┘
└─────┬───────────────────┘ │ ─ openai.adapter.ts │
│ │ ─ onnx-edge passthrough │ ┌──────────────────┐
│ └────────────┬──────────────────┘ │ Vertex AI │
▼ │ │ (primary cloud) │
┌─────────────────────────────────┐ ▼ └──────────────────┘
│ 6. Post-call pipeline │ ┌────────────────────────────┐
│ ─ moderate output │ │ 7. Persist + emit │
│ ─ schema validate (+ repair) │─────▶│ ─ provenance row │
│ ─ stamp AIProvenance │ │ ─ outbox events │
│ ─ open HITL gate if required │ │ ─ BigQuery streaming │
└─────────────────────────────────┘ └────────────────────────────┘
Sections:
- Edge gateway — Cloud Armor + WAF; per-
(tenant, feature)rate limits. - Inference API — REST surface; mTLS for service-to-service; JWT for BFF callers.
- Pre-call pipeline — moderation, redaction, budget check, prompt pinning, system-prompt assembly, input hash + cache lookup, HITL pre-check (some capabilities require HITL on every call, e.g., guest-facing message dispatch).
- Router —
pickProvider(capability, context); per-capability fallback chain. - Provider adapters — exactly four, each implementing
AIProviderPort. Adapters are the only modules that import provider SDKs. - Post-call pipeline — output moderation, JSON-schema validation with one repair attempt, provenance stamping, optional HITL-gate opening.
- Persistence + outbox — provenance row written transactionally with the artifact; outbox events for
inference.completed, capability-specificsuggestion.*, andhitl.gate_opened; BigQuery streaming forai_calls_fact. - Admin API — prompt CRUD, eval triggers, edge model manifest publish; restricted to platform admins via JWT scope
melmastoon:ai:admin.
7. Capability Catalog (one row per capability — implementation detail)
The catalog is the single source of truth for the gateway. Every capability id used by callers must be a row here. New entries require an ADR-or-equivalent record.
| Capability id | Prompt | Default model | Fallback chain | HITL gate | Latency target (p95) | Cost class | Output schema | Eval suite |
|---|---|---|---|---|---|---|---|---|
pricing.suggest | PRMP_PRICING_001_v3 | gemini-1.5-flash | claude-sonnet → gpt-4o-mini → fallback-deterministic | Yes if deviation > 5% from BAR | 1.5 s | Medium | PricingSuggestion | EVAL_PRICING_001 |
pricing.demand_forecast | PRMP_PRICING_002_v2 | tabular + gemini-1.5-flash-8b | last-year naive | No | 1 h batch | Low | DemandForecast | EVAL_FORECAST_001 |
housekeeping.route | none (solver) + optional PRMP_HK_001_v1 annotation | melmastoon-edge-hkt-v2.onnx | gemini-1.5-flash (annotation only) → greedy nearest-floor | Yes (lead accepts) | 500 ms on device | Free | HousekeepingRoute | EVAL_HK_001 |
staff.shift_optimize | PRMP_STAFF_001_v1 | gemini-1.5-flash | rule-based scheduler | Yes (manager accepts) | 2 s | Medium | ShiftSchedule | EVAL_STAFF_001 |
anomaly.detect | PRMP_ANOMALY_001_v4 (explanation) | melmastoon-edge-anomaly-v3.onnx (classification); gemini-1.5-flash (explanation) | rule-based heuristics | Yes for any auto-block | 200 ms (edge) / 2 s (explanation) | Low | AnomalyVerdict | EVAL_ANOMALY_001 |
upsell.recommend | PRMP_UPSELL_001_v2 | gemini-1.5-flash | static rules | No | 1 s | Medium | UpsellOffer[] | EVAL_UPSELL_001 |
message.draft | PRMP_MSG_001_v3 | gemini-1.5-flash (online) / phi-3-mini (offline) | static templates per intent + locale | Yes — always | 1.5 s online / 4 s offline | Medium | MessageDraft | EVAL_MSG_001 |
review.summarize | PRMP_REVIEW_001_v2 | gemini-1.5-pro | claude-3-5-sonnet → rule-based theme extraction | No | 8 s for ≤200 reviews | High | ReviewSummary | EVAL_REVIEW_001 |
booking.conversion_hint | PRMP_BOOKING_001_v1 | gemini-1.5-flash | static FAQ links | No | 800 ms | Medium | ConversionHint | EVAL_BOOKING_001 |
tutor.answer | PRMP_TUTOR_001_v2 | gemini-1.5-flash-8b (online) / phi-3-mini (offline) | local FAQ vector search (MiniLM + cosine) | No | 1.5 s | Low | TutorAnswer | EVAL_TUTOR_001 |
description.generate | PRMP_DESC_001_v3 | gemini-1.5-flash-8b | template fill | Yes (tenant accepts before publish) | 2 s | Low | Description | EVAL_DESC_001 |
translation.draft | PRMP_TRANSLATE_001_v2 | gemini-1.5-flash-8b | Cloud Translation API | Yes (tenant per-locale) | 3 s per chunk | Low | TranslationDraft | EVAL_TRANSLATE_001 |
ocr.id_scan | Document AI + PRMP_OCR_001_v1 | Document AI + gemini-1.5-flash-8b | manual entry (image attached) | Yes — always | 4 s end-to-end | Medium | IdScanFields | EVAL_OCR_001 |
stt.transcribe | Vertex Speech + PRMP_STT_001_v1 | Vertex Speech-to-Text + gemini-1.5-flash-8b | manual taps | No (reversible action) | 2 s | Low | Transcription | EVAL_STT_001 |
vision.photo_quality | none | mobilenet-v3-small-image-quality.onnx (edge) / Vertex Vision (cloud) | accept all | No | 200 ms (edge) | Free | PhotoQualityScore | EVAL_VISION_001 |
8. Key Decisions
| # | Decision | Rationale |
|---|---|---|
| 1 | Single AI gateway, no exceptions | Cost control, audit, model swap, prompt versioning, moderation, PII redaction, cache, observability — all impossible to enforce uniformly otherwise. CI dependency-graph gate on model SDK imports. |
| 2 | Vertex AI primary, Anthropic + OpenAI fallback | Native to GCP; private VPC; CMEK; data residency support. Heterogeneous fallback survives a single-provider outage. |
| 3 | Edge inference via ONNX Runtime Node on Electron main process | Renderer never has model bytes. Target markets are bandwidth-constrained. Phi-3-mini + MiniLM + custom anomaly classifier ship signed with the installer. |
| 4 | Capability catalog, not direct model calls | Callers request capability: 'pricing.suggest', never model: 'gemini-1.5-flash'. Lets the gateway swap models, prompts, and providers without caller changes. |
| 5 | Prompts are first-class versioned artifacts | PRMP_<DOMAIN>_<NUMBER>_v<n> registry; draft → active → deprecated → retired lifecycle; A/B promotion gated on eval green. |
| 6 | AIProvenance is mandatory | Every persisted AI artifact carries provenance or it is not persisted. CI gate enforces in sibling services. |
| 7 | HITL gate is a first-class aggregate | Decisions are auditable, SLA-tracked, and linked to the resulting state-change event via decisionId. |
| 8 | pgvector for RAG with per-tenant namespacing | Postgres-native; one fewer engine to operate; RLS gives a second isolation line; HNSW indexes meet recall + latency targets at our scale. |
| 9 | Per-tenant token budget (soft + hard) | Cost is the most likely incident vector; degrade gracefully to deterministic fallback rather than blocking the user. |
| 10 | Edge model manifest signed with KMS, verified on every load | Tampering blocks load; supply-chain attack on the installer is detected at runtime. |
9. Service Boundaries Summary
- Inputs: capability invocations from any service or BFF (REST or Pub/Sub event request/reply); prompt + capability + model admin operations from platform admins; HITL decisions from reviewers.
- Outputs: structured AI artifacts with
AIProvenance;melmastoon.ai_orchestrator.*outbox events; BigQueryai_calls_factrows; Cloud Monitoring metrics; signedEdgeModelManifestsnapshots. - State owned: capability catalog, prompt registry, model catalog, inference + result audit, provenance, HITL gates + decisions, budget counters, eval suites + runs, RAG corpora + chunks + embeddings (per-tenant pgvector), edge model manifests.
- State referenced (not owned): tenant region pin and plan limits (from
tenant-service); reviewer identity (fromiam-service); recipient + delivery for HITL notifications (notification-service).
10. Phased Maturity
| Phase | Capabilities live | Edge models | Notes |
|---|---|---|---|
| 0 (MVP) | pricing.suggest, anomaly.detect, upsell.recommend, message.draft, tutor.answer, vision.photo_quality | anomaly classifier, image-quality scorer | Cloud-only LLM via single Vertex Flash model; provenance + HITL + budget from day 1 |
| 1 | + description.generate, translation.draft, ocr.id_scan, review.summarize, housekeeping.route | + phi-3-mini (offline drafting), MiniLM (offline RAG), melmastoon-edge-hkt-v2 (route optimizer) | Full model catalog; A/B prompt rollout; per-tenant cost dashboard |
| 2 | + stt.transcribe, staff.shift_optimize, booking.conversion_hint | unchanged | Per-tenant RAG over policies/FAQ/SOPs (cloud + edge); residency-aware routing |
| 3 | Self-tuning prompts, per-tenant LoRA fine-tunes (with consent), federated edge model updates via signed differential packs | + per-tenant LoRA adapters where data permits | Continuous eval pipelines auto-promote prompts on green |
11. Cross-Reference Quick Index
- Aggregates + invariants:
DOMAIN_MODEL.md - Use cases + ports + orchestration:
APPLICATION_LOGIC.md - REST contracts + error codes:
API_CONTRACTS.md - Events published + consumed:
EVENT_SCHEMAS.md - Tables, indexes, RLS, pgvector schemas:
DATA_MODEL.md - Desktop sync of registry + manifest:
SYNC_CONTRACT.md - Self-AI integration (eval harness as customer of own gateway):
AI_INTEGRATION.md - Prompt injection, PII, cross-tenant:
SECURITY_MODEL.md - SLOs, cost dashboards:
OBSERVABILITY.md - Eval harness in detail:
TESTING_STRATEGY.md - Cloud Run + Vertex AI region pinning:
DEPLOYMENT_TOPOLOGY.md - Failure catalog + runbooks:
FAILURE_MODES.md - Local LLM emulator + ONNX dev loop:
LOCAL_DEV_SETUP.md - Readiness gate:
SERVICE_READINESS.md - Risks + mitigations:
SERVICE_RISK_REGISTER.md - Migration of existing prompts + RAG corpora:
MIGRATION_PLAN.md