ai-orchestrator-service

Bounded Context: AI (Core) · Owner: AI Platform squad · Phase: 0 (gateway + minimal use cases) → 4 (self-tuning) · Storage: Cloud SQL Postgres + pgvector + Memorystore Redis + GCS · Bundle: services/ai-orchestrator-service/ · Canonical AI thesis: 08 AI Architecture

ai-orchestrator-service is the single AI gateway of Ghasi Melmastoon — the multi-tenant hotel SaaS platform whose backoffice is an Electron offline-first desktop app and whose cloud is GCP with Vertex AI as the primary model provider. No other service is allowed to import a model SDK (@google-cloud/vertexai, @anthropic-ai/sdk, openai, onnxruntime-node). Every AI capability — dynamic pricing suggestion, demand forecast, housekeeping route optimization, anomaly detection, upsell, smart guest message draft, review summarization, OCR for ID scan, voice transcription, description generation, translation drafts, AI tutor — funnels through this service via REST or Pub/Sub event request/reply.

The service owns the capability catalog, prompt registry (with semver versioning, A/B rollout, deprecation policy), model catalog (cloud + edge), provider routing (Vertex AI primary, Anthropic + OpenAI fallback adapters, ONNX Runtime Node on Electron for edge), cost & budget control (per-tenant token caps, soft + hard, per-feature quotas), content moderation (pre + post), PII redaction, eval harness (golden sets, A/B promotion gates), RAG over per-tenant pgvector namespaces, provenance metadata generation (AIProvenance stamped on every artifact), HITL gate orchestration (request, capture decision, audit), and the edge model manifest — the signed list of ONNX models packaged with the Electron installer with SHA-256 + signature integrity check at every load.

Purpose

Be the only path between Melmastoon and any model provider. CI fails any service whose dependency graph reaches a model SDK outside this service.
Enforce uniform provenance, moderation, redaction, budget, cache, and eval so a feature team building "review summarization" cannot ship a regression on those concerns.
Provide a typed, discoverable capability catalog (one row per capability with prompt, model, latency target, cost class, fallback chain, HITL gate config, eval suite) so other services and BFFs request inference by capability id, never by model name.
Carry the edge AI surface for the Electron desktop: ship signed ONNX models with the installer, expose window.melmastoon.ai.infer(capability, input) via preload, replay edge-inference audit events on next sync.
Run the eval harness that gates any prompt or model rollout (draft → 5% A/B → active), and surface drift on the active set.

Key responsibilities

Capability catalog management — versioned registry of every AI capability (pricing.suggest, housekeeping.route, anomaly.detect, upsell.recommend, message.draft, review.summarize, ocr.id_scan, stt.transcribe, description.generate, translation.draft, tutor.answer, etc.) with all attributes pinned at the gateway.
Prompt registry with semver, eval suite reference, ownership, deprecation timeline (active → deprecated ≥14 d → retired).
Model routing — pickProvider(capability, context) chooses cloud vs edge, with a per-capability fallback chain executed on provider error or unhealthy circuit.
Per-tenant + per-feature cost control — soft cap (warn at 80%) + hard cap (degrade to deterministic fallback at 100%); per-feature sub-budgets.
RAG — per-tenant pgvector namespaces; HNSW indexes; cross-tenant query path strictly forbidden by RLS + session GUC + assertion.
Provenance — every persisted artifact carries AIProvenance; CI gate refuses persistence in any sibling service that omits it.
HITL gate orchestration — opens a HitlGate row, notifies the right role, captures the HitlDecision (accept/modify/reject) within SLA, audits both the request and the decision.
Content moderation — pre-call on input, post-call on output; blocked outputs return deterministic fallback and emit melmastoon.ai_orchestrator.moderation.flagged.v1.
Embedding generation — batched via Vertex text-embedding-004 (cloud) or all-MiniLM-L6-v2 (edge); written to embeddings_* per-tenant tables.
Eval harness — golden sets per capability; precision/recall + acceptance metrics; A/B promotion gate; drift alerts.
Edge model manifest — signed JSON manifest of ONNX models packaged with the Electron installer; integrity verified on every model load by the desktop main process.
Telemetry — token counts, latency, cost, cache hits per (tenant_id, capability, model) to BigQuery melmastoon_analytics_prod.ai_calls_fact.
Cache — per-tenant prompt+input hash cache in Memorystore; TTL per capability; cache hit returns instantly with cacheHit: true provenance.
GDPR participation — purges per-tenant embeddings + RAG corpora + cached prompt artifacts on melmastoon.tenant.guest.erasure_requested.v1 within 7 days.

Hotel-specific shape

Edge inference is critical — target markets (Afghanistan, Tajikistan rural, Iran provinces, Pakistan KPK) routinely lose connectivity. The Electron desktop must keep producing message drafts, anomaly flags, and offline RAG answers when the cloud is unreachable.
Phi-3-mini-4k-instruct (INT4, ~2.4 GB) ships with the installer for offline drafting; all-MiniLM-L6-v2 (FP16) for offline embeddings; melmastoon-edge-anomaly-v3.onnx for booking/payment/lock anomaly heuristics; melmastoon-edge-hkt-v2.onnx for housekeeping route optimization; mobilenet-v3-small-image-quality.onnx for property photo upload triage.
All edge models are packaged with the installer (no first-launch download — onboarding may be offline), signed by the Melmastoon release key, verified on every load.
Cloud-first when available: same capability id, gateway picks Vertex AI Gemini family by default; falls back to edge only when explicitly requested or when cloud is unhealthy.
Multilingual by default — Pashto, Dari, Arabic (RTL), English, French (LTR). Translation drafts always HITL.
Hard rules — edge inference never sees PCI data; never dispatches guest-facing messages without HITL; never executes destructive actions even when tutor is invoked.

Aggregates owned

Aggregate	Cardinality	Purpose	Identity prefix
`Capability`	1 per capability id	Declarative row binding capability id → prompt template + default model + fallback chain + HITL config + eval suite + latency target + cost class	`cap_`
`Prompt`	1 per `(domain, ordinal)`	Logical prompt; carries pointer to active version	`prm_`
`PromptVersion`	1..N per Prompt	Immutable system + user template + output schema; `draft → active → deprecated → retired`	`pmv_`
`Model`	1 per registered model	Model catalog row (provider, modality, context, cost class, latency class)	`mdl_`
`ModelDeployment`	1 per active deployment	Per-region deployment + traffic share (used during model rollouts)	`mdp_`
`Provider`	1 per provider	`vertex`, `anthropic`, `openai`, `onnx-edge` health + circuit state	`prv_`
`InferenceRequest`	1 per call	Captured input hash + capability + tenant + caller; PII-redacted	`ifr_`
`InferenceResult`	1 per request	Captured output + provenance reference; PII-redacted	`ifs_`
`Provenance`	1 per result	`AIProvenance` row (model, prompt, tokens, cost, safety verdicts, decision)	`prv_p_`
`EvalSuite`	1 per suite	Golden set + scoring rubric per capability	`eva_`
`EvalRun`	1 per scheduled or ad-hoc run	Suite + prompt version + model + scores	`evr_`
`RAGCorpus`	1 per (tenant, namespace)	Logical corpus of policies / FAQ / SOPs / amenity catalog	`rag_`
`Embedding`	1 per chunk	768-dim (cloud) or 384-dim (edge) vector + chunk text	(composite)
`BudgetCounter`	1 per (tenant, period, scope)	Real-time token + cost burn vs cap	`bdg_`
`HitlGate`	1 per gated artifact	Open request to a human; carries SLA timer	`hgt_`
`HitlDecision`	1 per gate	`accepted` / `modified` / `rejected` + justification + reviewer	`dec_`
`EdgeModelManifest`	1 per published manifest	Signed JSON of ONNX models packaged with installer	`emm_`

Key APIs (REST, `/api/v1/ai/*`)

Method	Path	Purpose	Auth
`POST`	`/api/v1/ai/complete`	Synchronous completion for a capability	service-to-service mTLS or BFF JWT
`POST`	`/api/v1/ai/embed`	Embedding generation (single or batch)	service-to-service
`POST`	`/api/v1/ai/moderate`	Standalone moderation pass	service-to-service
`POST`	`/api/v1/ai/rag/query`	RAG retrieval for a tenant corpus	service-to-service
`POST`	`/api/v1/ai/vision`	Vision capability (photo quality, OCR)	service-to-service
`POST`	`/api/v1/ai/transcribe`	STT (cloud or edge fallback)	service-to-service
`GET`	`/api/v1/ai/capabilities`	List capability catalog visible to caller	service or BFF
`GET`	`/api/v1/ai/capabilities/:capabilityId`	Capability detail	service or BFF
`POST`	`/api/v1/ai/prompts`	Create new prompt or version (admin)	platform admin
`POST`	`/api/v1/ai/prompts/:id/promote`	Promote `draft` to `active` after eval green	platform admin
`POST`	`/api/v1/ai/prompts/:id/deprecate`	Mark `active` row `deprecated` (≥14 d before retire)	platform admin
`POST`	`/api/v1/ai/eval/runs`	Trigger an eval run for a capability + prompt version + model	AI team
`GET`	`/api/v1/ai/eval/runs/:runId`	Eval run results	AI team
`POST`	`/api/v1/ai/hitl/gates/:gateId/decision`	Submit HITL decision (accept/modify/reject)	reviewer (RBAC)
`GET`	`/api/v1/ai/hitl/gates`	List open HITL gates for the caller's role + tenant	tenant member
`GET`	`/api/v1/ai/budget`	Per-tenant + per-feature budget snapshot	tenant `owner` / `gm`
`GET`	`/api/v1/ai/edge-model-manifest`	Current signed manifest (consumed by Electron installer + runtime check)	desktop device-bound
`POST`	`/bff/backoffice/v1/ai/tutor/ask`	AI tutor question (BFF entry)	tenant member

Top events published

Event	When
`melmastoon.ai_orchestrator.inference.requested.v1`	On every accepted call
`melmastoon.ai_orchestrator.inference.completed.v1`	On every successful return
`melmastoon.ai_orchestrator.inference.failed.v1`	On call failure (provider error, schema invalid after repair)
`melmastoon.ai_orchestrator.inference.cached_hit.v1`	On cache hit (cost = 0)
`melmastoon.ai_orchestrator.suggestion.dynamic_pricing.v1`	Pricing suggestion produced
`melmastoon.ai_orchestrator.suggestion.demand_forecast.v1`	Forecast produced
`melmastoon.ai_orchestrator.suggestion.housekeeping_routing.v1`	Route suggested
`melmastoon.ai_orchestrator.suggestion.shift_optimization.v1`	Shift schedule suggested
`melmastoon.ai_orchestrator.anomaly.detected.v1`	Anomaly flagged
`melmastoon.ai_orchestrator.upsell.recommended.v1`	Upsell produced
`melmastoon.ai_orchestrator.message.drafted.v1`	Guest-message draft produced (always HITL)
`melmastoon.ai_orchestrator.review.summarized.v1`	Review summary produced
`melmastoon.ai_orchestrator.ocr.completed.v1`	OCR + structured extraction returned
`melmastoon.ai_orchestrator.transcription.completed.v1`	STT returned
`melmastoon.ai_orchestrator.description.drafted.v1`	Description draft produced
`melmastoon.ai_orchestrator.translation.drafted.v1`	Translation draft produced
`melmastoon.ai_orchestrator.hitl.gate_opened.v1`	HITL gate opened, SLA timer started
`melmastoon.ai_orchestrator.hitl.gate_decided.v1`	Reviewer accepted / modified / rejected
`melmastoon.ai_orchestrator.budget.warning.v1`	80% soft cap crossed
`melmastoon.ai_orchestrator.budget.exceeded.v1`	100% hard cap crossed; degraded to deterministic fallback
`melmastoon.ai_orchestrator.eval.run_completed.v1`	Scheduled or ad-hoc eval finished
`melmastoon.ai_orchestrator.prompt.version_published.v1`	Prompt version promoted to `active`
`melmastoon.ai_orchestrator.model.deployment_changed.v1`	Model traffic shift or deployment activation
`melmastoon.ai_orchestrator.edge_model.manifest_updated.v1`	New signed manifest published for installer
`melmastoon.ai_orchestrator.moderation.flagged.v1`	Pre or post moderation blocked content

Top events consumed

Event	Triggers capability
`melmastoon.reservation.booking.confirmed.v1`	`upsell.recommend` (immediate + pre-arrival), `anomaly.detect`
`melmastoon.reservation.booking.cancelled.v1`	`anomaly.detect` (cancellation pattern)
`melmastoon.iam.user.login_failed.v1`	`anomaly.detect` (credential stuffing)
`melmastoon.payment.transaction.failed.v1`	`anomaly.detect` (payment fraud signal)
`melmastoon.payment.intent.captured.v1`	`anomaly.detect` (rapid-fire pattern)
`melmastoon.lock_integration.key_credential.issued.v1`	`anomaly.detect` (key issuance pattern)
`melmastoon.lock_integration.key_credential.not_returned.v1`	`anomaly.detect` (key-not-returned)
`melmastoon.housekeeping.task.assigned.v1` (batch)	`housekeeping.route`
`melmastoon.inventory.allocation.committed.v1` (occupancy ≥ 70%)	`pricing.suggest`
`melmastoon.tenant.guest.erasure_requested.v1`	Purge embeddings + cached artifacts (saga participant)

Storage

Cloud SQL Postgres (HA, regional) with the pgvector extension: capability catalog, prompt registry, model catalog, inference + result audit, provenance, HITL gates + decisions, budget counters, eval suites + runs, RAG corpora + chunks + embeddings (HNSW indexes per tenant namespace).
Memorystore Redis (HA): prompt+input hash cache (per-tenant keyspace), result cache, hot capability config, in-flight HITL SLA timers, rate limiters.
GCS bucket gs://melmastoon-ai-artifacts-<env>: eval datasets (versioned), large prompt fixtures, model artifacts (signed ONNX models for edge installer), RAG source documents pre-chunking.
BigQuery melmastoon_analytics_prod.ai_calls_fact: long-tail analytics; per-call fact row (token counts, latency, cost) for cost dashboards and capacity planning.
Vertex AI: cloud inference + embeddings + Document AI for OCR + Speech-to-Text; private VPC connectivity; CMEK; region pinned to nearest available (me-central1 preferred for AF/IR; europe-west4 failover).
ONNX Runtime Node: only on the Electron desktop main process (@ghasi/app-desktop-backoffice); never inside this service.

Multi-tenancy & residency

Every persisted row carries tenant_id + RLS policy <table>_tenant_isolation; session GUC app.tenant_id is set on every connection.
Per-tenant residency preference (region_pin) honored by the router; tenants pinned to me-central1 never have their data egressed to us-*.
Per-tenant model preference (Plus + Enterprise plans) honored — e.g., "Anthropic-only" tenants never route to OpenAI.

Edge cases & invariants

Budget hard-cap exceeded → router selects the deterministic fallback registered for the capability (template fill, rule-based ranker, BAR baseline) and stamps costUsd: 0, model: 'fallback-deterministic'.
Provider down → fallback chain executed in order; melmastoon.ai_orchestrator.inference.failed.v1 emitted only after the chain is exhausted.
HITL gate timeout → conservative default applied (reject the AI suggestion; decision: 'rejected', reason: 'timeout'); hitl.gate_decided.v1 emitted with auto: true.
Edge model integrity fail → ONNX Runtime refuses to load; capability falls back to cloud or deterministic; emits model.deployment_changed.v1 with degradation: 'edge_signature_invalid'.
Prompt-injection attempt → input length cap + system prompt isolation + output schema validation; offending output replaced with deterministic fallback; emits moderation.flagged.v1.
RAG cross-tenant leak attempt → namespace assertion fails before query reaches pgvector; returns MELMASTOON.GENERAL.CROSS_TENANT_REFERENCE; pages on-call.
A/B routing — draft prompt versions get exactly 5% sticky-by-tenant traffic; promotion requires green eval + 7-day production metric review.

Non-functional targets

Concern	Target
Availability (cloud gateway)	99.9% monthly
Latency p95 — `gemini-1.5-flash` capabilities	< 1.5 s end-to-end (gateway overhead + provider)
Latency p95 — edge `phi-3-mini` on M1/i7 baseline	< 4 s
Edge model load (cold)	< 1.5 s for MiniLM, < 8 s for Phi-3-mini INT4
Cache hit rate (target)	≥ 35% across all capabilities
Provenance completeness	100% of persisted artifacts
Eval drift detection latency	≤ 24 h on `active` prompts
GDPR purge SLA	7 days from `tenant.guest.erasure_requested.v1`
Budget enforcement accuracy	≤ 1% over hard cap before degradation
Cross-tenant leak	0 incidents (CI + RLS + assertion)

Purpose​

Key responsibilities​

Hotel-specific shape​

Aggregates owned​

Key APIs (REST, /api/v1/ai/*)​

Top events published​

Top events consumed​

Storage​

Multi-tenancy & residency​

Edge cases & invariants​

Non-functional targets​