Skip to main content

ai-orchestrator-service — Service Overview

Catalog summary: docs/03-microservices/ai-orchestrator-service.md · Strategic refs: 02 Enterprise Architecture · 04 Event-Driven Architecture · 05 API Design · 06 Data Models · 07 Security & Tenancy · 08 AI Architecture (canonical) · ADR-0003 Electron Offline-First · Standards · NAMING · Standards · ERROR_CODES · Standards · SERVICE_TEMPLATE

1. Purpose

ai-orchestrator-service is the single AI gateway of Ghasi Melmastoon. It is the only service in the platform that talks to model providers — Vertex AI (primary), Anthropic Claude and OpenAI (cloud fallback adapters), and ONNX Runtime Node on the Electron desktop (edge inference). Every AI capability advertised in docs/08-ai-architecture.md §7 — dynamic pricing suggestion, demand forecast, housekeeping schedule optimization, anomaly detection, upsell recommendation, smart guest message draft, review summarization, OCR for ID scan at check-in, voice transcription, description generation, translation drafts, AI tutor — is delivered by other services and BFFs only through this service.

It owns the capability catalog, the prompt registry with semver versioning + A/B rollout + deprecation policy, the model catalog spanning cloud and edge, per-tenant + per-feature cost control with soft and hard caps, content moderation (pre + post call), PII redaction, RAG over per-tenant pgvector namespaces, mandatory provenance metadata generation (AIProvenance), HITL gate orchestration (open, notify, capture decision, audit), the eval harness (golden sets, A/B promotion gates, drift detection on active), and the edge model manifest — the signed list of ONNX models packaged with the Electron installer with SHA-256 + signature verified on every load.

It does not decide whether to invoke an AI capability — that is the calling feature's choice, expressed by capability id and tenant context. The gateway decides how to fulfill it: which model, which provider, with which prompt version, with which cache, with which guardrails, and how to record what happened.

2. Bounded Context

FieldValue
Bounded contextAI
Subdomain typeCore — the AI thesis is the platform's force multiplier for small/medium hotels in low-resource markets
Strategic patternsOpen Host Service (capability catalog + REST surface) · Conformist (callers conform to the typed AIClient port) · Anti-Corruption Layer (provider adapters isolate Vertex AI / Anthropic / OpenAI / ONNX from the rest of the platform)
Bounded context mapai-orchestrator-service ◀── (capability calls) ── every service + every BFF + Electron desktop · ai-orchestrator-service ──▶ (only) Vertex AI / Anthropic / OpenAI / ONNX edge
Ubiquitous languageCapability, Prompt, PromptVersion, Model, ModelDeployment, Provider, InferenceRequest, InferenceResult, Provenance, EvalSuite, EvalRun, RAGCorpus, Embedding, BudgetCounter, HitlGate, HitlDecision, EdgeModelManifest, FallbackChain, CostClass, LatencyClass, SafetyVerdict, RoutingDecision

3. Responsibilities (in scope)

#ResponsibilityDetail
1Capability catalogVersioned registry binding capability id → prompt template + default model + fallback chain + HITL gate config + eval suite + latency target + cost class + JSON output schema
2Prompt registrySemver versions per (domain, ordinal); draft → active → deprecated → retired lifecycle; eval-gated promotion
3Model catalogCloud + edge models with cost class, latency class, modality, context window
4Provider routingpickProvider(capability, context) — cloud (Vertex primary) vs edge; per-capability fallback chain on provider error or unhealthy circuit
5Cost & budgetPer-tenant monthly token budget, soft cap (warn 80%) + hard cap (degrade 100%); per-feature sub-budgets; real-time counters
6RAGPer-tenant pgvector namespaces; HNSW indexes; chunking + ingestion pipeline; cross-tenant guard
7ProvenanceAIProvenance stamped on every artifact persisted or returned; CI gate refuses persistence without it in any sibling service
8HITL gatesOpen HitlGate row, notify reviewer (via notification-service), capture HitlDecision within SLA, audit
9Content moderationPre-call on input (block on harm_high); post-call on output (block on harm_*, hate, sexual, dangerous, pii_exposed)
10PII redactionPre-call strip of emails, phones, government IDs, credit-card-shaped strings, IBANs from anything bound for the model and from logs
11Embedding generationVertex text-embedding-004 (cloud) or all-MiniLM-L6-v2 (edge); batched
12Vector searchk-NN over per-tenant HNSW indexes; per-call SET LOCAL hnsw.ef_search tuned per use case
13Eval harnessGolden sets per capability; precision/recall + acceptance metrics; A/B promotion gate; drift alerts on active
14A/B prompt rolloutNew prompt versions ship as draft → 5% sticky-by-tenant traffic → eval + 7-day production review → promote to active
15Edge model manifestSigned JSON manifest packaged with Electron installer; integrity verified on every model load
16TelemetryToken counts, latency, cost, cache hits per (tenant_id, capability, model) to BigQuery ai_calls_fact + Cloud Monitoring
17CachePer-tenant prompt+input hash cache in Memorystore; TTL per capability
18GDPR participationPurge per-tenant embeddings + RAG corpora + cached artifacts within 7 d on guest erasure

4. Non-Responsibilities (explicitly out of scope)

#ConcernOwner
1Deciding whether to invoke AI for a featureCalling service / BFF
2Authoring business logic that consumes AI output (e.g., when to actually update a price)pricing-service and others
3Storing AI artifacts that are domain-owned (e.g., the published price)Owning service (with aiProvenanceRef foreign reference)
4Sending guest-facing messagesnotification-service (we hand it the HITL-accepted draft)
5Persisting reviewer identity profileiam-service + tenant-service
6Long-term archive of AI artifacts beyond auditaudit-service (we feed it events)
7Authoring tenant policies / FAQ / SOP source documentstheme-config-service / tenant authoring tool
8Sync engine to the desktopsync-service (we publish the snapshot of the registry + manifest)

5. Dependencies

5.1 Upstream (we depend on)

DependencyRelationshipFailure handling
Vertex AI (Gemini 1.5 Pro / Flash / Flash-8B; text-embedding-004; Document AI; Speech-to-Text)Synchronous — primary cloud providerFallback chain: gemini-pro → claude-3-5-sonnet → gpt-4.1; gemini-flash → claude-3-5-sonnet → gpt-4o-mini; circuit breaker per provider
Anthropic Claude (via Vertex partner endpoint)Synchronous — fallback adapterMarked degraded after 5 consecutive errors; resumes on probe
OpenAISynchronous — tertiary fallbackSame circuit-breaker model; restricted to capabilities where data residency permits
Cloud SQL Postgres + pgvectorSynchronous — capability catalog, prompt registry, RAG, auditRead replica fallback for catalog reads; writes block until primary recovers
Memorystore Redis (HA)Synchronous — cache, HITL SLA timers, rate limiterPostgres fallback for catalog reads (degraded latency); cache miss tolerated
GCS (melmastoon-ai-artifacts-<env>)Synchronous — eval datasets, signed ONNX models, prompt fixturesCached in memory; degraded ingestion blocks eval runs only
Cloud KMSSynchronous — manifest signature key, secrets envelopeBoot fails if KMS unreachable; cached signing context (5 min)
Secret ManagerSynchronous on bootCached; rotated via SIGHUP
Pub/SubAsynchronous — outbox publish + event consumptionOutbox table buffers; retry with backoff
tenant-serviceAsynchronous — tenant region pin + plan limitsCached for 5 min; degraded to last-known on error
iam-serviceSynchronous — JWT verification + reviewer role assertion on HITLJWKS cached; circuit breaker
notification-serviceAsynchronous — HITL gate notification dispatchOutbox; never blocks the inference response

5.2 Downstream (depend on us)

ConsumerWhat they consumeCoupling
Every feature service invoking AI (pricing-service, housekeeping-service, reservation-service, billing-service, lock-integration-service, iam-service adaptive MFA, theme-config-service, search-aggregation-service, etc.)POST /api/v1/ai/complete and friendsOHS / Conformist
bff-backoffice-serviceTutor + draft assist + eval dashboardsDirect REST
bff-tenant-booking-serviceBooking conversion assist (consumer hint)Direct REST
bff-consumer-serviceBooking conversion assistDirect REST
Electron desktop (@ghasi/app-desktop-backoffice)Edge inference via local ONNX + cloud passthrough; pulls signed EdgeModelManifest; pulls prompt registry snapshotSync + REST
audit-serviceEvery melmastoon.ai_orchestrator.* event (regulated retention)Append-only ingest
reporting-serviceCost + acceptance dashboards via BigQuery ai_calls_factRead-only

6. Architecture Diagram

┌──────────────────────────────────┐
│ 1. Edge / API Gateway │
│ Cloud Armor + WAF + mTLS │
│ rate-limit per (tenant, feature) │
└──────────────┬───────────────────┘

┌───────────────────────────────────────────┴────────────────────────────────────────┐
│ │
▼ ▼
┌────────────────────┐ ┌─────────────────────┐
│ 2. Inference API │ │ 8. Admin API │
│ /ai/complete /embed│ │ prompts, eval, mfst │
│ /moderate /rag │ └────────┬────────────┘
│ /vision /transcribe│ │
└─────┬──────────────┘ │
│ │
▼ │
┌─────────────────────────────────────────────────────────────────────────────────────┐ │
│ 3. Pre-call pipeline │ │
│ ─ moderate input ─ redact PII ─ check budget ─ pin prompt version │ │
│ ─ assemble system prompt ─ hash input ─ cache lookup ─ HITL pre-check │ │
└─────┬───────────────────────────────────────────────────────────────────────────────┘ │
│ cache hit ──────────────────────────────────────────────────────────────────────┐ │
│ ▼ │
▼ ┌──────────────────┐
┌─────────────────────────┐ ┌───────────────────────────────┐ │ Memorystore │
│ 4. Router │────────▶│ 5. Provider adapters │ │ prompt+input │
│ pickProvider(...) │ │ ─ vertex.adapter.ts │ │ hash cache │
│ capability fallback │ │ ─ anthropic.adapter.ts │ └──────────────────┘
└─────┬───────────────────┘ │ ─ openai.adapter.ts │
│ │ ─ onnx-edge passthrough │ ┌──────────────────┐
│ └────────────┬──────────────────┘ │ Vertex AI │
▼ │ │ (primary cloud) │
┌─────────────────────────────────┐ ▼ └──────────────────┘
│ 6. Post-call pipeline │ ┌────────────────────────────┐
│ ─ moderate output │ │ 7. Persist + emit │
│ ─ schema validate (+ repair) │─────▶│ ─ provenance row │
│ ─ stamp AIProvenance │ │ ─ outbox events │
│ ─ open HITL gate if required │ │ ─ BigQuery streaming │
└─────────────────────────────────┘ └────────────────────────────┘

Sections:

  1. Edge gateway — Cloud Armor + WAF; per-(tenant, feature) rate limits.
  2. Inference API — REST surface; mTLS for service-to-service; JWT for BFF callers.
  3. Pre-call pipeline — moderation, redaction, budget check, prompt pinning, system-prompt assembly, input hash + cache lookup, HITL pre-check (some capabilities require HITL on every call, e.g., guest-facing message dispatch).
  4. RouterpickProvider(capability, context); per-capability fallback chain.
  5. Provider adapters — exactly four, each implementing AIProviderPort. Adapters are the only modules that import provider SDKs.
  6. Post-call pipeline — output moderation, JSON-schema validation with one repair attempt, provenance stamping, optional HITL-gate opening.
  7. Persistence + outbox — provenance row written transactionally with the artifact; outbox events for inference.completed, capability-specific suggestion.*, and hitl.gate_opened; BigQuery streaming for ai_calls_fact.
  8. Admin API — prompt CRUD, eval triggers, edge model manifest publish; restricted to platform admins via JWT scope melmastoon:ai:admin.

7. Capability Catalog (one row per capability — implementation detail)

The catalog is the single source of truth for the gateway. Every capability id used by callers must be a row here. New entries require an ADR-or-equivalent record.

Capability idPromptDefault modelFallback chainHITL gateLatency target (p95)Cost classOutput schemaEval suite
pricing.suggestPRMP_PRICING_001_v3gemini-1.5-flashclaude-sonnet → gpt-4o-mini → fallback-deterministicYes if deviation > 5% from BAR1.5 sMediumPricingSuggestionEVAL_PRICING_001
pricing.demand_forecastPRMP_PRICING_002_v2tabular + gemini-1.5-flash-8blast-year naiveNo1 h batchLowDemandForecastEVAL_FORECAST_001
housekeeping.routenone (solver) + optional PRMP_HK_001_v1 annotationmelmastoon-edge-hkt-v2.onnxgemini-1.5-flash (annotation only) → greedy nearest-floorYes (lead accepts)500 ms on deviceFreeHousekeepingRouteEVAL_HK_001
staff.shift_optimizePRMP_STAFF_001_v1gemini-1.5-flashrule-based schedulerYes (manager accepts)2 sMediumShiftScheduleEVAL_STAFF_001
anomaly.detectPRMP_ANOMALY_001_v4 (explanation)melmastoon-edge-anomaly-v3.onnx (classification); gemini-1.5-flash (explanation)rule-based heuristicsYes for any auto-block200 ms (edge) / 2 s (explanation)LowAnomalyVerdictEVAL_ANOMALY_001
upsell.recommendPRMP_UPSELL_001_v2gemini-1.5-flashstatic rulesNo1 sMediumUpsellOffer[]EVAL_UPSELL_001
message.draftPRMP_MSG_001_v3gemini-1.5-flash (online) / phi-3-mini (offline)static templates per intent + localeYes — always1.5 s online / 4 s offlineMediumMessageDraftEVAL_MSG_001
review.summarizePRMP_REVIEW_001_v2gemini-1.5-proclaude-3-5-sonnet → rule-based theme extractionNo8 s for ≤200 reviewsHighReviewSummaryEVAL_REVIEW_001
booking.conversion_hintPRMP_BOOKING_001_v1gemini-1.5-flashstatic FAQ linksNo800 msMediumConversionHintEVAL_BOOKING_001
tutor.answerPRMP_TUTOR_001_v2gemini-1.5-flash-8b (online) / phi-3-mini (offline)local FAQ vector search (MiniLM + cosine)No1.5 sLowTutorAnswerEVAL_TUTOR_001
description.generatePRMP_DESC_001_v3gemini-1.5-flash-8btemplate fillYes (tenant accepts before publish)2 sLowDescriptionEVAL_DESC_001
translation.draftPRMP_TRANSLATE_001_v2gemini-1.5-flash-8bCloud Translation APIYes (tenant per-locale)3 s per chunkLowTranslationDraftEVAL_TRANSLATE_001
ocr.id_scanDocument AI + PRMP_OCR_001_v1Document AI + gemini-1.5-flash-8bmanual entry (image attached)Yes — always4 s end-to-endMediumIdScanFieldsEVAL_OCR_001
stt.transcribeVertex Speech + PRMP_STT_001_v1Vertex Speech-to-Text + gemini-1.5-flash-8bmanual tapsNo (reversible action)2 sLowTranscriptionEVAL_STT_001
vision.photo_qualitynonemobilenet-v3-small-image-quality.onnx (edge) / Vertex Vision (cloud)accept allNo200 ms (edge)FreePhotoQualityScoreEVAL_VISION_001

8. Key Decisions

#DecisionRationale
1Single AI gateway, no exceptionsCost control, audit, model swap, prompt versioning, moderation, PII redaction, cache, observability — all impossible to enforce uniformly otherwise. CI dependency-graph gate on model SDK imports.
2Vertex AI primary, Anthropic + OpenAI fallbackNative to GCP; private VPC; CMEK; data residency support. Heterogeneous fallback survives a single-provider outage.
3Edge inference via ONNX Runtime Node on Electron main processRenderer never has model bytes. Target markets are bandwidth-constrained. Phi-3-mini + MiniLM + custom anomaly classifier ship signed with the installer.
4Capability catalog, not direct model callsCallers request capability: 'pricing.suggest', never model: 'gemini-1.5-flash'. Lets the gateway swap models, prompts, and providers without caller changes.
5Prompts are first-class versioned artifactsPRMP_<DOMAIN>_<NUMBER>_v<n> registry; draft → active → deprecated → retired lifecycle; A/B promotion gated on eval green.
6AIProvenance is mandatoryEvery persisted AI artifact carries provenance or it is not persisted. CI gate enforces in sibling services.
7HITL gate is a first-class aggregateDecisions are auditable, SLA-tracked, and linked to the resulting state-change event via decisionId.
8pgvector for RAG with per-tenant namespacingPostgres-native; one fewer engine to operate; RLS gives a second isolation line; HNSW indexes meet recall + latency targets at our scale.
9Per-tenant token budget (soft + hard)Cost is the most likely incident vector; degrade gracefully to deterministic fallback rather than blocking the user.
10Edge model manifest signed with KMS, verified on every loadTampering blocks load; supply-chain attack on the installer is detected at runtime.

9. Service Boundaries Summary

  • Inputs: capability invocations from any service or BFF (REST or Pub/Sub event request/reply); prompt + capability + model admin operations from platform admins; HITL decisions from reviewers.
  • Outputs: structured AI artifacts with AIProvenance; melmastoon.ai_orchestrator.* outbox events; BigQuery ai_calls_fact rows; Cloud Monitoring metrics; signed EdgeModelManifest snapshots.
  • State owned: capability catalog, prompt registry, model catalog, inference + result audit, provenance, HITL gates + decisions, budget counters, eval suites + runs, RAG corpora + chunks + embeddings (per-tenant pgvector), edge model manifests.
  • State referenced (not owned): tenant region pin and plan limits (from tenant-service); reviewer identity (from iam-service); recipient + delivery for HITL notifications (notification-service).

10. Phased Maturity

PhaseCapabilities liveEdge modelsNotes
0 (MVP)pricing.suggest, anomaly.detect, upsell.recommend, message.draft, tutor.answer, vision.photo_qualityanomaly classifier, image-quality scorerCloud-only LLM via single Vertex Flash model; provenance + HITL + budget from day 1
1+ description.generate, translation.draft, ocr.id_scan, review.summarize, housekeeping.route+ phi-3-mini (offline drafting), MiniLM (offline RAG), melmastoon-edge-hkt-v2 (route optimizer)Full model catalog; A/B prompt rollout; per-tenant cost dashboard
2+ stt.transcribe, staff.shift_optimize, booking.conversion_hintunchangedPer-tenant RAG over policies/FAQ/SOPs (cloud + edge); residency-aware routing
3Self-tuning prompts, per-tenant LoRA fine-tunes (with consent), federated edge model updates via signed differential packs+ per-tenant LoRA adapters where data permitsContinuous eval pipelines auto-promote prompts on green

11. Cross-Reference Quick Index