Bounded Context: AI (Core) · Owner: AI Platform squad · Phase: 0 (gateway + minimal use cases) → 4 (self-tuning) · Storage: Cloud SQL Postgres + pgvector + Memorystore Redis + GCS · Bundle: services/ai-orchestrator-service/ · Canonical AI thesis: 08 AI Architecture
ai-orchestrator-service is the single AI gateway of Ghasi Melmastoon — the multi-tenant hotel SaaS platform whose backoffice is an Electron offline-first desktop app and whose cloud is GCP with Vertex AI as the primary model provider. No other service is allowed to import a model SDK (@google-cloud/vertexai, @anthropic-ai/sdk, openai, onnxruntime-node). Every AI capability — dynamic pricing suggestion, demand forecast, housekeeping route optimization, anomaly detection, upsell, smart guest message draft, review summarization, OCR for ID scan, voice transcription, description generation, translation drafts, AI tutor — funnels through this service via REST or Pub/Sub event request/reply.
The service owns the capability catalog, prompt registry (with semver versioning, A/B rollout, deprecation policy), model catalog (cloud + edge), provider routing (Vertex AI primary, Anthropic + OpenAI fallback adapters, ONNX Runtime Node on Electron for edge), cost & budget control (per-tenant token caps, soft + hard, per-feature quotas), content moderation (pre + post), PII redaction, eval harness (golden sets, A/B promotion gates), RAG over per-tenant pgvector namespaces, provenance metadata generation (AIProvenance stamped on every artifact), HITL gate orchestration (request, capture decision, audit), and the edge model manifest — the signed list of ONNX models packaged with the Electron installer with SHA-256 + signature integrity check at every load.
Purpose
- Be the only path between Melmastoon and any model provider. CI fails any service whose dependency graph reaches a model SDK outside this service.
- Enforce uniform provenance, moderation, redaction, budget, cache, and eval so a feature team building "review summarization" cannot ship a regression on those concerns.
- Provide a typed, discoverable capability catalog (one row per capability with prompt, model, latency target, cost class, fallback chain, HITL gate config, eval suite) so other services and BFFs request inference by capability id, never by model name.
- Carry the edge AI surface for the Electron desktop: ship signed ONNX models with the installer, expose
window.melmastoon.ai.infer(capability, input) via preload, replay edge-inference audit events on next sync.
- Run the eval harness that gates any prompt or model rollout (
draft → 5% A/B → active), and surface drift on the active set.
Key responsibilities
- Capability catalog management — versioned registry of every AI capability (
pricing.suggest, housekeeping.route, anomaly.detect, upsell.recommend, message.draft, review.summarize, ocr.id_scan, stt.transcribe, description.generate, translation.draft, tutor.answer, etc.) with all attributes pinned at the gateway.
- Prompt registry with semver, eval suite reference, ownership, deprecation timeline (
active → deprecated ≥14 d → retired).
- Model routing —
pickProvider(capability, context) chooses cloud vs edge, with a per-capability fallback chain executed on provider error or unhealthy circuit.
- Per-tenant + per-feature cost control — soft cap (warn at 80%) + hard cap (degrade to deterministic fallback at 100%); per-feature sub-budgets.
- RAG — per-tenant pgvector namespaces; HNSW indexes; cross-tenant query path strictly forbidden by RLS + session GUC + assertion.
- Provenance — every persisted artifact carries
AIProvenance; CI gate refuses persistence in any sibling service that omits it.
- HITL gate orchestration — opens a
HitlGate row, notifies the right role, captures the HitlDecision (accept/modify/reject) within SLA, audits both the request and the decision.
- Content moderation — pre-call on input, post-call on output; blocked outputs return deterministic fallback and emit
melmastoon.ai_orchestrator.moderation.flagged.v1.
- Embedding generation — batched via Vertex
text-embedding-004 (cloud) or all-MiniLM-L6-v2 (edge); written to embeddings_* per-tenant tables.
- Eval harness — golden sets per capability; precision/recall + acceptance metrics; A/B promotion gate; drift alerts.
- Edge model manifest — signed JSON manifest of ONNX models packaged with the Electron installer; integrity verified on every model load by the desktop main process.
- Telemetry — token counts, latency, cost, cache hits per
(tenant_id, capability, model) to BigQuery melmastoon_analytics_prod.ai_calls_fact.
- Cache — per-tenant prompt+input hash cache in Memorystore; TTL per capability; cache hit returns instantly with
cacheHit: true provenance.
- GDPR participation — purges per-tenant embeddings + RAG corpora + cached prompt artifacts on
melmastoon.tenant.guest.erasure_requested.v1 within 7 days.
Hotel-specific shape
- Edge inference is critical — target markets (Afghanistan, Tajikistan rural, Iran provinces, Pakistan KPK) routinely lose connectivity. The Electron desktop must keep producing message drafts, anomaly flags, and offline RAG answers when the cloud is unreachable.
- Phi-3-mini-4k-instruct (INT4, ~2.4 GB) ships with the installer for offline drafting; all-MiniLM-L6-v2 (FP16) for offline embeddings;
melmastoon-edge-anomaly-v3.onnx for booking/payment/lock anomaly heuristics; melmastoon-edge-hkt-v2.onnx for housekeeping route optimization; mobilenet-v3-small-image-quality.onnx for property photo upload triage.
- All edge models are packaged with the installer (no first-launch download — onboarding may be offline), signed by the Melmastoon release key, verified on every load.
- Cloud-first when available: same capability id, gateway picks Vertex AI Gemini family by default; falls back to edge only when explicitly requested or when cloud is unhealthy.
- Multilingual by default — Pashto, Dari, Arabic (RTL), English, French (LTR). Translation drafts always HITL.
- Hard rules — edge inference never sees PCI data; never dispatches guest-facing messages without HITL; never executes destructive actions even when tutor is invoked.
Aggregates owned
| Aggregate | Cardinality | Purpose | Identity prefix |
|---|
Capability | 1 per capability id | Declarative row binding capability id → prompt template + default model + fallback chain + HITL config + eval suite + latency target + cost class | cap_ |
Prompt | 1 per (domain, ordinal) | Logical prompt; carries pointer to active version | prm_ |
PromptVersion | 1..N per Prompt | Immutable system + user template + output schema; draft → active → deprecated → retired | pmv_ |
Model | 1 per registered model | Model catalog row (provider, modality, context, cost class, latency class) | mdl_ |
ModelDeployment | 1 per active deployment | Per-region deployment + traffic share (used during model rollouts) | mdp_ |
Provider | 1 per provider | vertex, anthropic, openai, onnx-edge health + circuit state | prv_ |
InferenceRequest | 1 per call | Captured input hash + capability + tenant + caller; PII-redacted | ifr_ |
InferenceResult | 1 per request | Captured output + provenance reference; PII-redacted | ifs_ |
Provenance | 1 per result | AIProvenance row (model, prompt, tokens, cost, safety verdicts, decision) | prv_p_ |
EvalSuite | 1 per suite | Golden set + scoring rubric per capability | eva_ |
EvalRun | 1 per scheduled or ad-hoc run | Suite + prompt version + model + scores | evr_ |
RAGCorpus | 1 per (tenant, namespace) | Logical corpus of policies / FAQ / SOPs / amenity catalog | rag_ |
Embedding | 1 per chunk | 768-dim (cloud) or 384-dim (edge) vector + chunk text | (composite) |
BudgetCounter | 1 per (tenant, period, scope) | Real-time token + cost burn vs cap | bdg_ |
HitlGate | 1 per gated artifact | Open request to a human; carries SLA timer | hgt_ |
HitlDecision | 1 per gate | accepted / modified / rejected + justification + reviewer | dec_ |
EdgeModelManifest | 1 per published manifest | Signed JSON of ONNX models packaged with installer | emm_ |
Key APIs (REST, /api/v1/ai/*)
| Method | Path | Purpose | Auth |
|---|
POST | /api/v1/ai/complete | Synchronous completion for a capability | service-to-service mTLS or BFF JWT |
POST | /api/v1/ai/embed | Embedding generation (single or batch) | service-to-service |
POST | /api/v1/ai/moderate | Standalone moderation pass | service-to-service |
POST | /api/v1/ai/rag/query | RAG retrieval for a tenant corpus | service-to-service |
POST | /api/v1/ai/vision | Vision capability (photo quality, OCR) | service-to-service |
POST | /api/v1/ai/transcribe | STT (cloud or edge fallback) | service-to-service |
GET | /api/v1/ai/capabilities | List capability catalog visible to caller | service or BFF |
GET | /api/v1/ai/capabilities/:capabilityId | Capability detail | service or BFF |
POST | /api/v1/ai/prompts | Create new prompt or version (admin) | platform admin |
POST | /api/v1/ai/prompts/:id/promote | Promote draft to active after eval green | platform admin |
POST | /api/v1/ai/prompts/:id/deprecate | Mark active row deprecated (≥14 d before retire) | platform admin |
POST | /api/v1/ai/eval/runs | Trigger an eval run for a capability + prompt version + model | AI team |
GET | /api/v1/ai/eval/runs/:runId | Eval run results | AI team |
POST | /api/v1/ai/hitl/gates/:gateId/decision | Submit HITL decision (accept/modify/reject) | reviewer (RBAC) |
GET | /api/v1/ai/hitl/gates | List open HITL gates for the caller's role + tenant | tenant member |
GET | /api/v1/ai/budget | Per-tenant + per-feature budget snapshot | tenant owner / gm |
GET | /api/v1/ai/edge-model-manifest | Current signed manifest (consumed by Electron installer + runtime check) | desktop device-bound |
POST | /bff/backoffice/v1/ai/tutor/ask | AI tutor question (BFF entry) | tenant member |
Top events published
| Event | When |
|---|
melmastoon.ai_orchestrator.inference.requested.v1 | On every accepted call |
melmastoon.ai_orchestrator.inference.completed.v1 | On every successful return |
melmastoon.ai_orchestrator.inference.failed.v1 | On call failure (provider error, schema invalid after repair) |
melmastoon.ai_orchestrator.inference.cached_hit.v1 | On cache hit (cost = 0) |
melmastoon.ai_orchestrator.suggestion.dynamic_pricing.v1 | Pricing suggestion produced |
melmastoon.ai_orchestrator.suggestion.demand_forecast.v1 | Forecast produced |
melmastoon.ai_orchestrator.suggestion.housekeeping_routing.v1 | Route suggested |
melmastoon.ai_orchestrator.suggestion.shift_optimization.v1 | Shift schedule suggested |
melmastoon.ai_orchestrator.anomaly.detected.v1 | Anomaly flagged |
melmastoon.ai_orchestrator.upsell.recommended.v1 | Upsell produced |
melmastoon.ai_orchestrator.message.drafted.v1 | Guest-message draft produced (always HITL) |
melmastoon.ai_orchestrator.review.summarized.v1 | Review summary produced |
melmastoon.ai_orchestrator.ocr.completed.v1 | OCR + structured extraction returned |
melmastoon.ai_orchestrator.transcription.completed.v1 | STT returned |
melmastoon.ai_orchestrator.description.drafted.v1 | Description draft produced |
melmastoon.ai_orchestrator.translation.drafted.v1 | Translation draft produced |
melmastoon.ai_orchestrator.hitl.gate_opened.v1 | HITL gate opened, SLA timer started |
melmastoon.ai_orchestrator.hitl.gate_decided.v1 | Reviewer accepted / modified / rejected |
melmastoon.ai_orchestrator.budget.warning.v1 | 80% soft cap crossed |
melmastoon.ai_orchestrator.budget.exceeded.v1 | 100% hard cap crossed; degraded to deterministic fallback |
melmastoon.ai_orchestrator.eval.run_completed.v1 | Scheduled or ad-hoc eval finished |
melmastoon.ai_orchestrator.prompt.version_published.v1 | Prompt version promoted to active |
melmastoon.ai_orchestrator.model.deployment_changed.v1 | Model traffic shift or deployment activation |
melmastoon.ai_orchestrator.edge_model.manifest_updated.v1 | New signed manifest published for installer |
melmastoon.ai_orchestrator.moderation.flagged.v1 | Pre or post moderation blocked content |
Top events consumed
| Event | Triggers capability |
|---|
melmastoon.reservation.booking.confirmed.v1 | upsell.recommend (immediate + pre-arrival), anomaly.detect |
melmastoon.reservation.booking.cancelled.v1 | anomaly.detect (cancellation pattern) |
melmastoon.iam.user.login_failed.v1 | anomaly.detect (credential stuffing) |
melmastoon.payment.transaction.failed.v1 | anomaly.detect (payment fraud signal) |
melmastoon.payment.intent.captured.v1 | anomaly.detect (rapid-fire pattern) |
melmastoon.lock_integration.key_credential.issued.v1 | anomaly.detect (key issuance pattern) |
melmastoon.lock_integration.key_credential.not_returned.v1 | anomaly.detect (key-not-returned) |
melmastoon.housekeeping.task.assigned.v1 (batch) | housekeeping.route |
melmastoon.inventory.allocation.committed.v1 (occupancy ≥ 70%) | pricing.suggest |
melmastoon.tenant.guest.erasure_requested.v1 | Purge embeddings + cached artifacts (saga participant) |
Storage
- Cloud SQL Postgres (HA, regional) with the pgvector extension: capability catalog, prompt registry, model catalog, inference + result audit, provenance, HITL gates + decisions, budget counters, eval suites + runs, RAG corpora + chunks + embeddings (HNSW indexes per tenant namespace).
- Memorystore Redis (HA): prompt+input hash cache (per-tenant keyspace), result cache, hot capability config, in-flight HITL SLA timers, rate limiters.
- GCS bucket
gs://melmastoon-ai-artifacts-<env>: eval datasets (versioned), large prompt fixtures, model artifacts (signed ONNX models for edge installer), RAG source documents pre-chunking.
- BigQuery
melmastoon_analytics_prod.ai_calls_fact: long-tail analytics; per-call fact row (token counts, latency, cost) for cost dashboards and capacity planning.
- Vertex AI: cloud inference + embeddings + Document AI for OCR + Speech-to-Text; private VPC connectivity; CMEK; region pinned to nearest available (
me-central1 preferred for AF/IR; europe-west4 failover).
- ONNX Runtime Node: only on the Electron desktop main process (
@ghasi/app-desktop-backoffice); never inside this service.
Multi-tenancy & residency
- Every persisted row carries
tenant_id + RLS policy <table>_tenant_isolation; session GUC app.tenant_id is set on every connection.
- Per-tenant residency preference (
region_pin) honored by the router; tenants pinned to me-central1 never have their data egressed to us-*.
- Per-tenant model preference (Plus + Enterprise plans) honored — e.g., "Anthropic-only" tenants never route to OpenAI.
Edge cases & invariants
- Budget hard-cap exceeded → router selects the deterministic fallback registered for the capability (template fill, rule-based ranker, BAR baseline) and stamps
costUsd: 0, model: 'fallback-deterministic'.
- Provider down → fallback chain executed in order;
melmastoon.ai_orchestrator.inference.failed.v1 emitted only after the chain is exhausted.
- HITL gate timeout → conservative default applied (reject the AI suggestion;
decision: 'rejected', reason: 'timeout'); hitl.gate_decided.v1 emitted with auto: true.
- Edge model integrity fail → ONNX Runtime refuses to load; capability falls back to cloud or deterministic; emits
model.deployment_changed.v1 with degradation: 'edge_signature_invalid'.
- Prompt-injection attempt → input length cap + system prompt isolation + output schema validation; offending output replaced with deterministic fallback; emits
moderation.flagged.v1.
- RAG cross-tenant leak attempt → namespace assertion fails before query reaches pgvector; returns
MELMASTOON.GENERAL.CROSS_TENANT_REFERENCE; pages on-call.
- A/B routing —
draft prompt versions get exactly 5% sticky-by-tenant traffic; promotion requires green eval + 7-day production metric review.
Non-functional targets
| Concern | Target |
|---|
| Availability (cloud gateway) | 99.9% monthly |
Latency p95 — gemini-1.5-flash capabilities | < 1.5 s end-to-end (gateway overhead + provider) |
Latency p95 — edge phi-3-mini on M1/i7 baseline | < 4 s |
| Edge model load (cold) | < 1.5 s for MiniLM, < 8 s for Phi-3-mini INT4 |
| Cache hit rate (target) | ≥ 35% across all capabilities |
| Provenance completeness | 100% of persisted artifacts |
| Eval drift detection latency | ≤ 24 h on active prompts |
| GDPR purge SLA | 7 days from tenant.guest.erasure_requested.v1 |
| Budget enforcement accuracy | ≤ 1% over hard cap before degradation |
| Cross-tenant leak | 0 incidents (CI + RLS + assertion) |