08 — AI Architecture

Companion: 02 Enterprise Architecture · 03 Microservices · 05 API Design · 06 Data Models · 07 Security & Tenancy · 09 Lock & Key Integration · 12 Desktop Spec · ADR-0003 Electron Offline-First

This document is the canonical AI architecture for Ghasi Melmastoon. It defines the AI thesis, the gateway pattern, provider routing across cloud (Vertex AI) and edge (ONNX Runtime Node on Electron), the model catalog, the prompt registry, mandatory provenance, the use-case catalog, HITL gates, edge inference rules, evaluation + guardrails, cost discipline, vector storage, safety, observability, and the phased roadmap. Every AI-related claim in any other document defers to this one.

1. AI Thesis

Small and medium hotels in low-resource markets do not get to hire revenue managers, demand-forecasting analysts, or full-time CRM specialists. The cost-per-room of those skills is prohibitive at the 8-50 room scale, where most of the Afghanistan, Tajikistan, and Iran markets sit. AI is the force multiplier that lets one general manager + a small staff run an operation that would otherwise need a much larger team.

Concretely, AI inside Melmastoon is responsible for:

Dynamic pricing suggestions per room-type per day, based on occupancy, seasonality, and historical demand.
Demand forecasting at 30 / 60 / 90 day horizons.
Housekeeping route + schedule optimization for the day.
Anomaly detection on bookings (suspicious rapid-fire bookings, payment risk, key-not-returned patterns, late-checkout patterns, occupancy spikes).
Upsell recommendations at booking and pre-arrival (room upgrade, breakfast add-on, late checkout, airport transfer).
Smart guest communications drafting (multilingual, tone-controlled).
Review summarization across long histories (multilingual; preserves sentiment + actionable themes).
Multilingual content drafting for tenants — descriptions, FAQs, policies, social copy.
AI-assisted operations dashboard — surfaces "what changed since you last logged in" with explanation.
AI tutor — answers staff "how do I…?" questions in-app, with deep links to the right action.

What AI is not allowed to do unilaterally: anything irreversible, anything monetary above a threshold, anything guest-facing without HITL acceptance. AI proposes; humans accept; the audit chain records both.

2. Single AI Gateway (`ai-orchestrator-service`)

ai-orchestrator-service is the only service that talks to model providers. Every other service and every BFF and every client calls it via REST + Pub/Sub through the AIClient port. CI dependency-graph analysis fails any service that imports @google-cloud/vertexai, @anthropic-ai/sdk, openai, onnxruntime-node, or any other model provider SDK outside ai-orchestrator-service (and the desktop edge module that explicitly belongs to app-desktop-backoffice).

2.1 Why one gateway

Concern	Without a gateway	With the gateway
Cost control	Every service has its own usage; no central budget enforcement	Per-tenant budgets + per-feature quotas + soft/hard caps in one place
Audit + provenance	Each service stamps its own (or doesn't); inconsistent	Every call carries `AIProvenance` issued centrally; one schema
Model swap	Every caller hardcodes a provider SDK; rewrite-the-world to switch	One adapter swap inside the gateway; callers see no change
Prompt versioning	Prompts scattered, drift, untested	Prompts in a registry with eval suites + deprecation policy
Moderation	Each caller "should" moderate input/output; in practice doesn't	Pre/post moderation enforced by the gateway
PII redaction	Each caller responsible; easy to miss	Mandatory pre-call redaction enforced by the gateway
Cache	Each caller caches differently; cache poisoning across tenants possible	One per-tenant cache with explicit policy
Observability	Different metrics shape per caller	Uniform traces, token counts, latency, cost in one place

2.2 Surface

Caller side (NestJS DI inside any service):

  constructor(@Inject('AIClient') private ai: AIClient) {}

  const result = await this.ai.complete({
    capability: 'pricing.suggest',
    promptId: 'PRMP_PRICING_001_v3',
    tenantId,
    input: { propertyId, roomTypeId, date, occupancy, baseline, seasonalSignal },
    timeoutMs: 4_000,
    fallback: 'deterministic',
    correlation: { traceId, requestId },
  });

The gateway:

Pre-call — moderate the input, redact PII, check the per-tenant + per-feature budget, pin the prompt version from the registry, attach the system prompt, hash the input for cache lookup.
Route — decide cloud (Vertex AI / fallback OpenAI / Anthropic) vs edge (ONNX Runtime Node on the Electron desktop). If cloud, pick the model from the catalog by (capability, latencyClass, costClass, fallbackChain).
Call — invoke the provider through the appropriate adapter. Enforce timeout + retries with jitter.
Post-call — moderate the output, validate against the use-case JSON schema, stamp AIProvenance, record cost, write to cache, persist the artifact + provenance, emit ai.gateway.call.completed.v1.
Return — the caller receives a typed result + provenance reference; the caller never sees raw model details.

3. Provider Routing

3.1 Cloud — Vertex AI primary

Vertex AI (Gemini family) is the primary cloud provider:

Native to GCP; private VPC connectivity; no extra egress; CMEK supported.
Gemini 1.5 Pro / Flash / Flash-8B cover the bulk of our LLM workload at multiple cost tiers.
Vertex AI Embeddings (text-embedding-004, 768-dim) for tenant content, reviews, room descriptions.
Vertex AI Vision for image quality scoring on property uploads.

3.2 Cloud — fallback adapters

A single fallback chain per capability lets us survive a Vertex AI incident without manual intervention:

Anthropic Claude (Sonnet / Opus tiers) via Vertex AI partner endpoint or direct API. Used when the capability calls for stronger long-context reasoning (review synthesis, policy drafting).
OpenAI (GPT-4.1 / GPT-4o-mini) as a third option for heterogeneous fallback. Restricted to capabilities where data residency allows.

Adapter shape — every provider implements AIProviderPort:

interface AIProviderPort {
  name: 'vertex' | 'anthropic' | 'openai' | 'onnx-edge';
  complete(req: AICompletionRequest): Promise<AICompletionResponse>;
  embed(req: AIEmbeddingRequest): Promise<AIEmbeddingResponse>;
  vision(req: AIVisionRequest): Promise<AIVisionResponse>;
  moderate(req: AIModerationRequest): Promise<AIModerationResponse>;
  capabilities(): ProviderCapabilities;
}

3.3 Edge — ONNX Runtime Node (Electron desktop)

The Electron desktop ships with ONNX Runtime Node running in the main process (Node 20). Renderer never has model bytes; renderer requests inference via the preload-exposed window.melmastoon.ai.infer(...) channel.

Edge models are small and quantized:

Phi-3-mini-4k-instruct (INT4 quantization) for short-context drafting and Q&A when offline. Roughly 2.4 GB on disk; loads on demand; idle-unloaded after 10 minutes of inactivity.
all-MiniLM-L6-v2 (FP16) for sentence embeddings (384-dim). Used for offline RAG over the tenant's local cached policies + FAQ.
Anomaly classifier — small custom-trained ONNX (LightGBM-converted) for booking + payment + lock anomaly heuristics.
Image quality scorer — MobileNet-V3 small for property photo upload quality flags.

Models are signed; the installer ships a manifest with SHA-256 + signature; ONNX Runtime refuses to load a model whose signature does not verify.

3.4 Routing decision

// Pseudocode inside ai-orchestrator-service
function pickProvider(req: AIRequest): AIProviderPort {
  if (req.context.local && capabilityHasEdgeModel(req.capability)) {
    return providers.onnxEdge;
  }
  if (req.context.regionPin === 'me-central1' && req.capability !== 'long-context-policy') {
    return providers.vertex;     // primary
  }
  // Fallback chain configured per capability:
  for (const name of capabilityFallbackChain(req.capability)) {
    if (providers[name].isHealthy()) return providers[name];
  }
  throw new AIError('NO_HEALTHY_PROVIDER');
}

4. Model Catalog

The catalog is the source of truth for which model serves which capability. New entries require an ADR-or-equivalent record.

Model	Provider	Modality	Context	Cost class	Latency class	Primary use cases	Fallback chain
`gemini-1.5-pro`	Vertex AI	LLM, multimodal	1M	High	Medium (~2-6 s)	Review synthesis, policy drafting, long-context analysis	claude-sonnet → gpt-4.1
`gemini-1.5-flash`	Vertex AI	LLM, multimodal	1M	Medium	Low (~0.5-2 s)	Pricing suggestions, anomaly explanations, upsell drafting, guest message draft	claude-sonnet → gpt-4o-mini
`gemini-1.5-flash-8b`	Vertex AI	LLM	1M	Low	Very low (~0.2-0.8 s)	Translation drafts, room description generation, AI tutor, OCR post-processing	gemini-flash
`text-embedding-004`	Vertex AI	Embedding	2k	Very low	Very low	Room descriptions, review summaries, FAQ, RAG	local MiniLM (degraded)
`text-multilingual-embedding-002`	Vertex AI	Embedding	2k	Very low	Very low	Multilingual content embeddings	text-embedding-004
`claude-3-5-sonnet`	Anthropic (via Vertex partner)	LLM	200k	High	Medium	Long-context reasoning fallback; complex policy drafting	gemini-pro
`gpt-4o-mini`	OpenAI	LLM	128k	Medium	Low	Tertiary fallback for short prompts	—
`phi-3-mini-4k-instruct` (INT4)	ONNX Edge	LLM	4k	Free (CPU)	Medium-on-device (~1-3 s)	Offline drafting, AI tutor, simple Q&A	gemini-flash on next sync
`all-MiniLM-L6-v2` (FP16)	ONNX Edge	Embedding	256	Free (CPU)	Very low	Offline RAG over local policies / FAQ	text-embedding-004 on next sync
`melmastoon-edge-anomaly-v3`	ONNX Edge	Classifier	—	Free (CPU)	Very low	Booking / payment / lock anomaly heuristics	gemini-flash on next sync
`mobilenet-v3-small-image-quality`	ONNX Edge	Vision classifier	—	Free (CPU)	Very low	Photo upload quality flag	Vertex Vision

Cost classes (per-1k-tokens reference, indicative): Very low < $0.0005, Low < $0.005, Medium < $0.05, High ≥ $0.05.

5. Prompt Registry

Every prompt template has an ID, a version, an owner, an eval suite, and a deprecation policy. Prompts are first-class versioned artifacts.

5.1 ID format

PRMP_<DOMAIN>_<NUMBER>_v<n>

<DOMAIN> — uppercase, snake_case domain code: PRICING, HK, ANOMALY, UPSELL, MSG, REVIEW, BOOKING, TUTOR, DESC, TRANSLATE, OCR, STT.
<NUMBER> — zero-padded ordinal within the domain.
<n> — integer prompt version (semver-major).

Examples: PRMP_PRICING_001_v3, PRMP_MSG_004_v1, PRMP_ANOMALY_002_v5.

5.2 Storage

Postgres table prompt_templates inside ai-orchestrator-service:

CREATE TABLE prompt_templates (
  id              text PRIMARY KEY,                 -- 'PRMP_PRICING_001_v3'
  domain          text NOT NULL,
  ordinal         int  NOT NULL,
  version         int  NOT NULL,
  status          text NOT NULL CHECK (status IN ('draft','active','deprecated','retired')),
  owner_user_id   uuid NOT NULL,
  capability      text NOT NULL,
  system_prompt   text NOT NULL,
  user_template   text NOT NULL,
  output_schema   jsonb NOT NULL,                   -- JSON Schema
  default_model   text NOT NULL,
  eval_suite_id   text NOT NULL,
  notes           text,
  created_at      timestamptz NOT NULL DEFAULT now(),
  retired_at      timestamptz,
  UNIQUE (domain, ordinal, version)
);

Replicated to the Electron desktop's prompt_templates SQLite table on session start so offline edge inference uses the same templates.

5.3 Versioning + deprecation

New prompt versions ship as new rows (never overwrite). The active row for a given (domain, ordinal) is what new traffic hits.
A previously active row flips to deprecated for at least 14 days before retiring (gives consumers time to drain).
retired rows remain in the table for audit but cannot be served.

5.4 Eval suites

Every prompt template references an eval_suite_id that points to a curated set of inputs + expected outcomes (or scoring criteria). New versions must beat or match the active version on the eval suite before promotion.

6. Provenance Metadata (Mandatory)

No AI artifact is persisted, displayed, or used to drive a decision without AIProvenance. The schema is defined in 02 Enterprise Architecture §9.3 and reproduced here for completeness:

interface AIProvenance {
  promptId: string;                                  // 'PRMP_PRICING_001_v3'
  promptVersion: SemVer;
  model: string;                                      // 'gemini-1.5-flash' or 'phi-3-mini-4k-instruct'
  modelVersion?: string;
  traceId: string;                                    // W3C traceparent
  occurredAt: ISODate;
  tokensIn: number;
  tokensOut: number;
  costUsd: number;                                    // computed at the gateway
  local: boolean;                                     // true iff edge inference
  cacheHit: boolean;
  safety: { input: SafetyVerdict; output: SafetyVerdict };
  reviewedBy?: UserId;                                // populated on HITL acceptance
  reviewedAt?: ISODate;
  decision?: 'accepted'|'rejected'|'modified';        // HITL outcome
}

The UI surfaces an "AI" badge on any artifact carrying provenance; click reveals the metadata.
audit-service retains provenance for 7 years.

7. Use Case Catalog

For each use case below: trigger, prompt template, model, latency target, HITL gate, fallback strategy, eval method.

7.1 Dynamic pricing suggestion (per room-type per day)

Field	Value
Trigger	Daily 02:00 local; on-demand from `pricing-service` UI; on `inventory.allocated.v1` for tomorrow when occupancy crosses 70% threshold
Prompt	`PRMP_PRICING_001_v3` — system: pricing analyst persona; user: occupancy + 30-day baseline + seasonality + competitor anchor
Model	`gemini-1.5-flash`
Latency target	p95 < 1.5 s
HITL gate	Yes when suggestion deviates >5% from BAR baseline; otherwise auto-applied within ±5% band
Fallback	Deterministic baseline (BAR + day-of-week multiplier)
Eval	Backtest against last 12 months: did the suggestion improve revenue vs deterministic? Precision on "should-raise" / "should-lower" labels

7.2 Demand forecast (30 / 60 / 90 days)

Field	Value
Trigger	Nightly 03:00 local
Prompt	`PRMP_PRICING_002_v2` — explanation only; the numeric forecast is from a Vertex AI Forecast or local quantile model, the LLM annotates
Model	Forecast: tabular model (Vertex AI Forecast or local LightGBM); annotation: `gemini-1.5-flash-8b`
Latency target	Batch; SLA 1 hour
HITL gate	No (informational)
Fallback	Last-year same-period naive forecast
Eval	MAPE per horizon; quantile coverage

7.3 Housekeeping schedule optimization

Field	Value
Trigger	At shift start; on `housekeeping.task.assigned.v1` batch flush
Prompt	None; this runs on a small TSP-like solver (`melmastoon-edge-hkt-v2.onnx`) on the Electron desktop
Model	Edge ONNX (or fallback Vertex AI flash for explanation only)
Latency target	< 500 ms on device
HITL gate	Yes — the lead must accept the proposed order; can reorder before dispatch
Fallback	Greedy nearest-floor heuristic
Eval	Time-to-complete vs baseline; staff acceptance rate

7.4 Anomaly detection

Field	Value
Trigger	On every `reservation.confirmed.v1`, `payment.captured.v1`, `lock.key.issued.v1`, `reservation.checkout.v1`; daily aggregation for occupancy spikes
Prompt	`PRMP_ANOMALY_001_v4` for explanation; classification is the edge anomaly classifier
Model	Edge: `melmastoon-edge-anomaly-v3`; cloud explanation: `gemini-1.5-flash`
Latency target	Edge: < 200 ms; explanation: < 2 s
HITL gate	Yes for any auto-block; alerts only otherwise
Fallback	Rule-based heuristics (rapid-fire booking from same IP, payment failure pattern, key not returned > 24 h, etc.)
Eval	Precision / recall on labeled incidents; false-positive rate per tenant

7.5 Upsell recommendation

Field	Value
Trigger	At booking confirmation; pre-arrival 48 h before check-in
Prompt	`PRMP_UPSELL_001_v2` — system: hospitality concierge persona; user: reservation details + property amenity catalog
Model	`gemini-1.5-flash`
Latency target	p95 < 1 s
HITL gate	No (suggested to guest, guest decides)
Fallback	Static rule set (breakfast for stays >2 nights, late checkout for premium room types)
Eval	Conversion rate per recommendation type

7.6 Smart guest message draft (multilingual, tone-controlled)

Field	Value
Trigger	Front desk requests a draft; pre-arrival template fills; post-stay thank-you
Prompt	`PRMP_MSG_001_v3` — variables: tone (formal/warm), locale, message intent, reservation context
Model	`gemini-1.5-flash` (online) or `phi-3-mini` (offline)
Latency target	p95 < 1.5 s online; < 4 s offline
HITL gate	Yes — always for guest-facing messages; the staff edits + sends
Fallback	Static templates per intent + locale
Eval	Staff acceptance rate; edit-distance from draft to send

7.7 Review summarization (multilingual)

Field	Value
Trigger	Weekly summary across last 30 / 90 days; on-demand for GM dashboard
Prompt	`PRMP_REVIEW_001_v2` — produces `{ themes:[…], sentiment, actionable:[…], topQuotes:[…] }`
Model	`gemini-1.5-pro` (long context); fallback `claude-3-5-sonnet`
Latency target	p95 < 8 s for ≤ 200 reviews
HITL gate	No (informational)
Fallback	Rule-based theme extraction
Eval	Theme F1 against a hand-labeled set; sentiment accuracy

7.8 Booking conversion assist (consumer chat hint on the meta layer)

Field	Value
Trigger	User idles >15 s on a results page; explicit "I need help"
Prompt	`PRMP_BOOKING_001_v1` — short prompt; outputs a single suggestion or a clarifying question
Model	`gemini-1.5-flash`
Latency target	p95 < 800 ms
HITL gate	No
Fallback	Static FAQ links
Eval	Hint → click-through → booking-completion uplift

7.9 AI tutor for backoffice

Field	Value
Trigger	Staff opens the help drawer; types a question
Prompt	`PRMP_TUTOR_001_v2` — system: helpful product expert; tools: `linkToScreen(screenId)`, `runWalkthrough(walkthroughId)`
Model	`gemini-1.5-flash-8b` (online) or `phi-3-mini` (offline)
Latency target	p95 < 1.5 s
HITL gate	No (informational; tutor never executes destructive actions)
Fallback	Local FAQ vector search (MiniLM + cosine)
Eval	Resolution rate; thumbs-up rate; deflection from support tickets

7.10 Description generation (room types, property)

Field	Value
Trigger	Tenant clicks "Generate description" on a room-type / property
Prompt	`PRMP_DESC_001_v3` — variables: room features, brand voice, target audience, locale
Model	`gemini-1.5-flash-8b`
Latency target	p95 < 2 s
HITL gate	Yes — tenant edits + accepts before publishing
Fallback	Template fill
Eval	Acceptance rate; edit distance

7.11 Translation drafts for tenant content

Field	Value
Trigger	Tenant adds content in source locale; target locales auto-draft
Prompt	`PRMP_TRANSLATE_001_v2` — preserves brand voice tokens; flags untranslatable terms
Model	`gemini-1.5-flash-8b`
Latency target	p95 < 3 s per chunk
HITL gate	Yes — tenant reviews per locale
Fallback	Cloud Translation API as a baseline
Eval	Native-speaker review acceptance rate per locale

7.12 OCR for ID scan at check-in (with HITL)

Field	Value
Trigger	Front desk scans guest ID at check-in
Prompt	OCR via Vertex AI Document AI; LLM post-process (`PRMP_OCR_001_v1`) to extract structured fields
Model	Document AI + `gemini-1.5-flash-8b`
Latency target	p95 < 4 s end-to-end
HITL gate	Yes — always — staff verifies extracted fields before save
Fallback	Manual entry (ID image still attached)
Eval	Field-level precision; staff edit rate per field

7.13 Voice transcription for staff hands-free updates

Field	Value
Trigger	Housekeeper holds the "voice" button on the desktop or mobile to update task state
Prompt	STT via Vertex AI Speech (or Whisper-large-v3 via ONNX edge if offline); intent extraction `PRMP_STT_001_v1`
Model	Vertex AI Speech-to-Text + `gemini-1.5-flash-8b` for intent
Latency target	p95 < 2 s
HITL gate	No (the action is always reversible — flip a status; it's logged and undoable)
Fallback	Manual taps
Eval	WER per locale; intent classification accuracy

8. HITL Gates (consolidated list)

The following actions must be gated by human acceptance before they take effect. Each is recorded as a Decision (dec_…) and linked to the resulting state-change event.

Action	Why HITL is required
Pricing publish where deviation > 5% from baseline	Material revenue impact
Reservation auto-cancel triggered by anomaly	Irreversible to the guest
Refund initiated by AI	Money leaving the tenant
Bulk lock-credential revoke triggered by anomaly	Operational + guest-experience impact
Guest-facing AI-drafted message dispatch	Brand + relationship risk
Tenant content publish (description, translation)	Public-facing brand artifact
OCR-extracted ID fields written to guest profile	Data integrity + privacy
Housekeeping schedule dispatch	Staff scheduling depends on it
Auto-block of a flagged booking beyond temporary hold	Revenue + customer impact

UI affordances:

A draft_ai badge on the artifact, with an "Accept", "Modify", "Reject" trio.
Required justification on "Reject" (free-text; logged).
The accepted state-change event carries decisionId; downstream services can correlate.

9. Offline / Edge AI

9.1 What runs on Electron via ONNX Runtime Node

Anomaly classification (booking, payment, lock-key) — melmastoon-edge-anomaly-v3.onnx.
Embedding generation for offline RAG over the tenant's cached policies + FAQ — all-MiniLM-L6-v2.
Draft message suggestions when offline — phi-3-mini-4k-instruct INT4.
Simple forecasting for next 7 days — small LightGBM-converted ONNX (melmastoon-edge-forecast-v2.onnx).
Image quality scoring for photo upload — mobilenet-v3-small-image-quality.onnx.
Housekeeping route optimizer — melmastoon-edge-hkt-v2.onnx.

All edge inference happens in the main process (Node 20). The renderer requests inference via window.melmastoon.ai.infer(capability, input) exposed by contextBridge.

9.2 Packaging + verification

Models are packaged with the installer (no first-launch download — the user may be onboarding offline).
Each model is shipped with its SHA-256 + a manifest signature signed by the Melmastoon release key.
On first run (and on every load) ONNX Runtime verifies the signature against the public key embedded in the binary; tampering invalidates the signature and the model refuses to load.
Model updates ship via electron-updater with the rest of the app; partial model updates are atomic (download to temp → verify → swap).

9.3 Audit trail for edge inference

Edge inference still emits ai.inference.local.completed.v1 to the local outbox. On next sync the event is replayed for audit; the cloud ai-orchestrator-service accepts these events and persists provenance with local: true.

9.4 Hard rules

Edge inference never sees PCI data (no cards), never sees lock vendor secrets, never runs guest-facing message dispatch without HITL.
Model files live under app.getPath('userData')/models/ with restrictive ACLs.
Idle-unload: large models (Phi-3-mini) unload after 10 minutes of inactivity to free RAM.

10. Eval & Guardrails

10.1 Golden sets

Every use case has a curated golden set (eval_suite_id referenced from the prompt template).
Stored in ai-orchestrator-service Postgres + version-controlled in a companion repo (melmastoon-ai-evals) for reproducibility.
Golden sets include both positive and adversarial examples (prompt-injection attempts, edge-case inputs).

10.2 Precision / recall targets

Use case	Metric	Target
Pricing suggestion	Direction-accuracy on labeled "should-raise / hold / lower"	≥ 75%
Demand forecast	MAPE @ 30-day horizon	≤ 18%
Anomaly detection	Precision @ recall 0.9	≥ 0.7
Upsell	Conversion uplift vs static rules	≥ 1.3×
OCR	Field-level precision	≥ 0.95
Translation	Native-speaker acceptance	≥ 0.85
Tutor	Resolution rate	≥ 0.7

10.3 A/B routing for prompt changes

New prompt versions ship as draft → routed to 5% of traffic → eval suite + production metrics monitored for 7 days → promoted to active on green.
Per-tenant opt-out for prompt experimentation (Plus + Enterprise plans).

10.4 Cost guardrails

Per-tenant monthly token budget with soft cap (warn at 80%) and hard cap (degrade to deterministic fallback at 100%).
Per-feature quotas inside a tenant — pricing + anomaly + upsell each have independent caps.
Real-time cost dashboard surfaced to tenant owner + gm roles.
Alerting: email + in-app at 80%; in-app + on-call SRE at 100%.

11. Cost & Budget

Lever	Mechanism
Tiered model routing	Default to the cheapest model that meets latency + quality; escalate only when needed
Edge first	Capabilities with edge models run on the desktop; cloud is the fallback
Cache	Per-tenant prompt+input hash cache; cache hit returns instantly with `cacheHit: true` provenance; TTL per capability
Per-tenant budgets	Soft + hard caps; degrade to deterministic fallback on hard cap; per-feature sub-budgets
Batch where possible	Embeddings batched; nightly forecast batched per property
Prompt economy	Templates iteratively shortened; outputs schema-constrained to minimize output tokens
Model right-sizing	`gemini-1.5-flash-8b` is the default for short-form generation; only escalate to `flash` or `pro` when justified

The cost dashboard (Looker Studio + BigQuery melmastoon_analytics_prod.ai_calls_fact) breaks down spend per (tenant_id, capability, model) and surfaces top consumers per period.

12. Vector Storage

All embeddings live in pgvector inside ai-orchestrator-service's Postgres schema. Per-tenant namespacing via tenant_id column + RLS (see 06 Data Models §7).

12.1 Per-tenant namespace

Every k-NN query carries WHERE tenant_id = $1 and runs under a session with app.tenant_id set; RLS is the second line of defense.
Cross-tenant query embeddings (embeddings_search_queries) live in a separate table with tenant_id nullable for anonymous queries; no PII.

12.2 HNSW indexes

m=16, ef_construction=64 as defaults.
Per-call SET LOCAL hnsw.ef_search = 40 for the typical recall-vs-latency target; tuned per index after observing recall in production.
Re-indexing triggered when corpus growth exceeds 25% since last build.

12.3 RAG over tenant-private content

The gateway's rag() method runs a per-tenant retrieval over the tenant's:

Policies (cancellation, house rules, child + pet policy, etc.)
FAQ
Staff playbook (uploaded SOPs, training docs)
Property amenity catalog

The retrieved chunks are injected into the prompt as context. The model is instructed to ground answers in the provided context and to refuse if the answer is not present.

Edge RAG runs the same pattern with the local SQLite-cached subset and all-MiniLM-L6-v2 embeddings.

13. Safety

13.1 Prompt injection defense

System prompt isolation — assembled centrally; never composed from user input.
Input length limits — per capability (4 KB guest-facing, 16 KB admin-side).
Output schema validation — every response validated against a JSON schema; non-conforming outputs are rejected.
Tool-call allowlist — for capabilities with tool use, only the declared tools are callable; tool execution is server-side.
Adversarial eval examples — every golden set includes known prompt-injection patterns; new attacks added on detection.

13.2 PII redaction in logs

Pre-call redaction strips emails, phones, government IDs, credit-card-shaped strings, IBANs from anything bound for the model and from anything written to logs.
Logs use pino with declared redactors per service; CI verifies no new field is added without a redactor entry.
AI traces capture token counts and model decisions, never raw user content (unless redacted).

13.3 Content moderation on guest-facing output

Pre-moderation on input (block on harm_high).
Post-moderation on output (block on harm_*, hate, sexual, dangerous, pii_exposed).
Blocked outputs return a deterministic fallback and raise ai.moderation.blocked.v1 to the audit log.

14. Observability

Signal	Source	Destination
Per-call traces	OpenTelemetry inside `ai-orchestrator-service`; spans annotated with `model`, `promptId`, `cacheHit`, `tokens`, `costUsd`, `latencyMs`	Cloud Trace; sampled at 10% (100% on errors + on > p95 latency)
Token counts + cost	Computed at the gateway from provider response	BigQuery `melmastoon_analytics_prod.ai_calls_fact`
Latency histograms	Per `(capability, model, provider)`	Cloud Monitoring; per-tenant breakdown for hot tenants
Cache hit rate	Per `(capability, tenantId)`	Cloud Monitoring
Model error rates	Per provider	Cloud Monitoring + alert on sustained > 1%
HITL acceptance rate	Per `(capability, tenantId)`	BigQuery + Looker Studio dashboard
Eval drift	Comparing `active` prompt eval vs golden set on schedule	Alert on regression
Provenance integrity	Sample audit job verifies every persisted AI artifact has provenance	Daily report

A purpose-built AI eval logging dashboard (Looker Studio over BigQuery) gives the AI team a real-time view of:

Calls per capability per tenant per model.
Cost burn vs budget.
HITL acceptance / rejection / modification rates.
Drift signals on key prompts.
Top failure modes (output-schema violations, moderation blocks, provider errors).

15. Roadmap

Phase 1 — Minimal AI (MVP)

Heuristic + simple pricing suggestions (rule-based with optional LLM annotation).
Cloud-only AI through ai-orchestrator-service.
One model (gemini-1.5-flash) covering most capabilities.
Provenance + HITL gates implemented from day 1.
Per-tenant budget + soft cap.
Edge inference: anomaly classifier + image-quality scorer only.

Phase 2 — Expansion

Full model catalog (Gemini Pro + Flash + Flash-8B; Anthropic + OpenAI fallbacks).
All Phase-1-listed use cases live.
Edge: Phi-3-mini for offline drafting; MiniLM for offline RAG; HK route optimizer.
Prompt registry with eval suites + A/B routing.
Per-tenant cost dashboard.
Translation drafts + multilingual review summarization.

Phase 3 — Personalization

Per-tenant RAG over policies / FAQ / playbook (cloud + edge).
Booking conversion assist on the consumer meta layer.
AI tutor with deep-link tools.
Voice STT for hands-free housekeeping updates.
Bug bounty + adversarial eval expansion.
Per-feature quotas + per-tenant residency-aware routing.

Phase 4 — Self-tuning

Per-tenant LoRA fine-tunes (where data volume + tenant consent permit) for tenant-voice messaging + property-specific descriptions.
Continuous eval pipelines that promote prompts automatically on green.
Federated edge model updates via signed differential model packs.
Per-tenant model preference (e.g., Anthropic-only for tenants with that contractual preference).

Cross-references: per-service AI integration details live in services/<service-name>/AI_INTEGRATION.md. The model catalog, prompt registry schema, and eval suites are owned by ai-orchestrator-service and version-controlled. Safety + provenance + HITL contracts are referenced from 07 Security & Tenancy §8 and from every use-case implementation.

1. AI Thesis​

2. Single AI Gateway (ai-orchestrator-service)​

2.1 Why one gateway​

2.2 Surface​

3. Provider Routing​

3.1 Cloud — Vertex AI primary​

3.2 Cloud — fallback adapters​

3.3 Edge — ONNX Runtime Node (Electron desktop)​

3.4 Routing decision​

4. Model Catalog​

5. Prompt Registry​

5.1 ID format​

5.2 Storage​

5.3 Versioning + deprecation​

5.4 Eval suites​

6. Provenance Metadata (Mandatory)​

7. Use Case Catalog​

7.1 Dynamic pricing suggestion (per room-type per day)​

7.2 Demand forecast (30 / 60 / 90 days)​

7.3 Housekeeping schedule optimization​

7.4 Anomaly detection​

7.5 Upsell recommendation​

7.6 Smart guest message draft (multilingual, tone-controlled)​

7.7 Review summarization (multilingual)​

7.8 Booking conversion assist (consumer chat hint on the meta layer)​

7.9 AI tutor for backoffice​

7.10 Description generation (room types, property)​

7.11 Translation drafts for tenant content​

7.12 OCR for ID scan at check-in (with HITL)​

7.13 Voice transcription for staff hands-free updates​

8. HITL Gates (consolidated list)​

9. Offline / Edge AI​

9.1 What runs on Electron via ONNX Runtime Node​

9.2 Packaging + verification​

9.3 Audit trail for edge inference​

9.4 Hard rules​

10. Eval & Guardrails​

10.1 Golden sets​

10.2 Precision / recall targets​

10.3 A/B routing for prompt changes​

10.4 Cost guardrails​

11. Cost & Budget​

12. Vector Storage​

12.1 Per-tenant namespace​

12.2 HNSW indexes​

12.3 RAG over tenant-private content​

13. Safety​

13.1 Prompt injection defense​

13.2 PII redaction in logs​

13.3 Content moderation on guest-facing output​

14. Observability​

15. Roadmap​

Phase 1 — Minimal AI (MVP)​

Phase 2 — Expansion​

Phase 3 — Personalization​

Phase 4 — Self-tuning​