Skip to main content

08 — AI Architecture

Companion: 02 Enterprise Architecture · 03 Microservices · 05 API Design · 06 Data Models · 07 Security & Tenancy · 09 Lock & Key Integration · 12 Desktop Spec · ADR-0003 Electron Offline-First

This document is the canonical AI architecture for Ghasi Melmastoon. It defines the AI thesis, the gateway pattern, provider routing across cloud (Vertex AI) and edge (ONNX Runtime Node on Electron), the model catalog, the prompt registry, mandatory provenance, the use-case catalog, HITL gates, edge inference rules, evaluation + guardrails, cost discipline, vector storage, safety, observability, and the phased roadmap. Every AI-related claim in any other document defers to this one.


1. AI Thesis

Small and medium hotels in low-resource markets do not get to hire revenue managers, demand-forecasting analysts, or full-time CRM specialists. The cost-per-room of those skills is prohibitive at the 8-50 room scale, where most of the Afghanistan, Tajikistan, and Iran markets sit. AI is the force multiplier that lets one general manager + a small staff run an operation that would otherwise need a much larger team.

Concretely, AI inside Melmastoon is responsible for:

  • Dynamic pricing suggestions per room-type per day, based on occupancy, seasonality, and historical demand.
  • Demand forecasting at 30 / 60 / 90 day horizons.
  • Housekeeping route + schedule optimization for the day.
  • Anomaly detection on bookings (suspicious rapid-fire bookings, payment risk, key-not-returned patterns, late-checkout patterns, occupancy spikes).
  • Upsell recommendations at booking and pre-arrival (room upgrade, breakfast add-on, late checkout, airport transfer).
  • Smart guest communications drafting (multilingual, tone-controlled).
  • Review summarization across long histories (multilingual; preserves sentiment + actionable themes).
  • Multilingual content drafting for tenants — descriptions, FAQs, policies, social copy.
  • AI-assisted operations dashboard — surfaces "what changed since you last logged in" with explanation.
  • AI tutor — answers staff "how do I…?" questions in-app, with deep links to the right action.

What AI is not allowed to do unilaterally: anything irreversible, anything monetary above a threshold, anything guest-facing without HITL acceptance. AI proposes; humans accept; the audit chain records both.


2. Single AI Gateway (ai-orchestrator-service)

ai-orchestrator-service is the only service that talks to model providers. Every other service and every BFF and every client calls it via REST + Pub/Sub through the AIClient port. CI dependency-graph analysis fails any service that imports @google-cloud/vertexai, @anthropic-ai/sdk, openai, onnxruntime-node, or any other model provider SDK outside ai-orchestrator-service (and the desktop edge module that explicitly belongs to app-desktop-backoffice).

2.1 Why one gateway

ConcernWithout a gatewayWith the gateway
Cost controlEvery service has its own usage; no central budget enforcementPer-tenant budgets + per-feature quotas + soft/hard caps in one place
Audit + provenanceEach service stamps its own (or doesn't); inconsistentEvery call carries AIProvenance issued centrally; one schema
Model swapEvery caller hardcodes a provider SDK; rewrite-the-world to switchOne adapter swap inside the gateway; callers see no change
Prompt versioningPrompts scattered, drift, untestedPrompts in a registry with eval suites + deprecation policy
ModerationEach caller "should" moderate input/output; in practice doesn'tPre/post moderation enforced by the gateway
PII redactionEach caller responsible; easy to missMandatory pre-call redaction enforced by the gateway
CacheEach caller caches differently; cache poisoning across tenants possibleOne per-tenant cache with explicit policy
ObservabilityDifferent metrics shape per callerUniform traces, token counts, latency, cost in one place

2.2 Surface

Caller side (NestJS DI inside any service):

constructor(@Inject('AIClient') private ai: AIClient) {}

const result = await this.ai.complete({
capability: 'pricing.suggest',
promptId: 'PRMP_PRICING_001_v3',
tenantId,
input: { propertyId, roomTypeId, date, occupancy, baseline, seasonalSignal },
timeoutMs: 4_000,
fallback: 'deterministic',
correlation: { traceId, requestId },
});

The gateway:

  1. Pre-call — moderate the input, redact PII, check the per-tenant + per-feature budget, pin the prompt version from the registry, attach the system prompt, hash the input for cache lookup.
  2. Route — decide cloud (Vertex AI / fallback OpenAI / Anthropic) vs edge (ONNX Runtime Node on the Electron desktop). If cloud, pick the model from the catalog by (capability, latencyClass, costClass, fallbackChain).
  3. Call — invoke the provider through the appropriate adapter. Enforce timeout + retries with jitter.
  4. Post-call — moderate the output, validate against the use-case JSON schema, stamp AIProvenance, record cost, write to cache, persist the artifact + provenance, emit ai.gateway.call.completed.v1.
  5. Return — the caller receives a typed result + provenance reference; the caller never sees raw model details.

3. Provider Routing

3.1 Cloud — Vertex AI primary

Vertex AI (Gemini family) is the primary cloud provider:

  • Native to GCP; private VPC connectivity; no extra egress; CMEK supported.
  • Gemini 1.5 Pro / Flash / Flash-8B cover the bulk of our LLM workload at multiple cost tiers.
  • Vertex AI Embeddings (text-embedding-004, 768-dim) for tenant content, reviews, room descriptions.
  • Vertex AI Vision for image quality scoring on property uploads.

3.2 Cloud — fallback adapters

A single fallback chain per capability lets us survive a Vertex AI incident without manual intervention:

  • Anthropic Claude (Sonnet / Opus tiers) via Vertex AI partner endpoint or direct API. Used when the capability calls for stronger long-context reasoning (review synthesis, policy drafting).
  • OpenAI (GPT-4.1 / GPT-4o-mini) as a third option for heterogeneous fallback. Restricted to capabilities where data residency allows.

Adapter shape — every provider implements AIProviderPort:

interface AIProviderPort {
name: 'vertex' | 'anthropic' | 'openai' | 'onnx-edge';
complete(req: AICompletionRequest): Promise<AICompletionResponse>;
embed(req: AIEmbeddingRequest): Promise<AIEmbeddingResponse>;
vision(req: AIVisionRequest): Promise<AIVisionResponse>;
moderate(req: AIModerationRequest): Promise<AIModerationResponse>;
capabilities(): ProviderCapabilities;
}

3.3 Edge — ONNX Runtime Node (Electron desktop)

The Electron desktop ships with ONNX Runtime Node running in the main process (Node 20). Renderer never has model bytes; renderer requests inference via the preload-exposed window.melmastoon.ai.infer(...) channel.

Edge models are small and quantized:

  • Phi-3-mini-4k-instruct (INT4 quantization) for short-context drafting and Q&A when offline. Roughly 2.4 GB on disk; loads on demand; idle-unloaded after 10 minutes of inactivity.
  • all-MiniLM-L6-v2 (FP16) for sentence embeddings (384-dim). Used for offline RAG over the tenant's local cached policies + FAQ.
  • Anomaly classifier — small custom-trained ONNX (LightGBM-converted) for booking + payment + lock anomaly heuristics.
  • Image quality scorer — MobileNet-V3 small for property photo upload quality flags.

Models are signed; the installer ships a manifest with SHA-256 + signature; ONNX Runtime refuses to load a model whose signature does not verify.

3.4 Routing decision

// Pseudocode inside ai-orchestrator-service
function pickProvider(req: AIRequest): AIProviderPort {
if (req.context.local && capabilityHasEdgeModel(req.capability)) {
return providers.onnxEdge;
}
if (req.context.regionPin === 'me-central1' && req.capability !== 'long-context-policy') {
return providers.vertex; // primary
}
// Fallback chain configured per capability:
for (const name of capabilityFallbackChain(req.capability)) {
if (providers[name].isHealthy()) return providers[name];
}
throw new AIError('NO_HEALTHY_PROVIDER');
}

4. Model Catalog

The catalog is the source of truth for which model serves which capability. New entries require an ADR-or-equivalent record.

ModelProviderModalityContextCost classLatency classPrimary use casesFallback chain
gemini-1.5-proVertex AILLM, multimodal1MHighMedium (~2-6 s)Review synthesis, policy drafting, long-context analysisclaude-sonnet → gpt-4.1
gemini-1.5-flashVertex AILLM, multimodal1MMediumLow (~0.5-2 s)Pricing suggestions, anomaly explanations, upsell drafting, guest message draftclaude-sonnet → gpt-4o-mini
gemini-1.5-flash-8bVertex AILLM1MLowVery low (~0.2-0.8 s)Translation drafts, room description generation, AI tutor, OCR post-processinggemini-flash
text-embedding-004Vertex AIEmbedding2kVery lowVery lowRoom descriptions, review summaries, FAQ, RAGlocal MiniLM (degraded)
text-multilingual-embedding-002Vertex AIEmbedding2kVery lowVery lowMultilingual content embeddingstext-embedding-004
claude-3-5-sonnetAnthropic (via Vertex partner)LLM200kHighMediumLong-context reasoning fallback; complex policy draftinggemini-pro
gpt-4o-miniOpenAILLM128kMediumLowTertiary fallback for short prompts
phi-3-mini-4k-instruct (INT4)ONNX EdgeLLM4kFree (CPU)Medium-on-device (~1-3 s)Offline drafting, AI tutor, simple Q&Agemini-flash on next sync
all-MiniLM-L6-v2 (FP16)ONNX EdgeEmbedding256Free (CPU)Very lowOffline RAG over local policies / FAQtext-embedding-004 on next sync
melmastoon-edge-anomaly-v3ONNX EdgeClassifierFree (CPU)Very lowBooking / payment / lock anomaly heuristicsgemini-flash on next sync
mobilenet-v3-small-image-qualityONNX EdgeVision classifierFree (CPU)Very lowPhoto upload quality flagVertex Vision

Cost classes (per-1k-tokens reference, indicative): Very low < $0.0005, Low < $0.005, Medium < $0.05, High ≥ $0.05.


5. Prompt Registry

Every prompt template has an ID, a version, an owner, an eval suite, and a deprecation policy. Prompts are first-class versioned artifacts.

5.1 ID format

PRMP_<DOMAIN>_<NUMBER>_v<n>

  • <DOMAIN> — uppercase, snake_case domain code: PRICING, HK, ANOMALY, UPSELL, MSG, REVIEW, BOOKING, TUTOR, DESC, TRANSLATE, OCR, STT.
  • <NUMBER> — zero-padded ordinal within the domain.
  • <n> — integer prompt version (semver-major).

Examples: PRMP_PRICING_001_v3, PRMP_MSG_004_v1, PRMP_ANOMALY_002_v5.

5.2 Storage

  • Postgres table prompt_templates inside ai-orchestrator-service:
CREATE TABLE prompt_templates (
id text PRIMARY KEY, -- 'PRMP_PRICING_001_v3'
domain text NOT NULL,
ordinal int NOT NULL,
version int NOT NULL,
status text NOT NULL CHECK (status IN ('draft','active','deprecated','retired')),
owner_user_id uuid NOT NULL,
capability text NOT NULL,
system_prompt text NOT NULL,
user_template text NOT NULL,
output_schema jsonb NOT NULL, -- JSON Schema
default_model text NOT NULL,
eval_suite_id text NOT NULL,
notes text,
created_at timestamptz NOT NULL DEFAULT now(),
retired_at timestamptz,
UNIQUE (domain, ordinal, version)
);
  • Replicated to the Electron desktop's prompt_templates SQLite table on session start so offline edge inference uses the same templates.

5.3 Versioning + deprecation

  • New prompt versions ship as new rows (never overwrite). The active row for a given (domain, ordinal) is what new traffic hits.
  • A previously active row flips to deprecated for at least 14 days before retiring (gives consumers time to drain).
  • retired rows remain in the table for audit but cannot be served.

5.4 Eval suites

Every prompt template references an eval_suite_id that points to a curated set of inputs + expected outcomes (or scoring criteria). New versions must beat or match the active version on the eval suite before promotion.


6. Provenance Metadata (Mandatory)

No AI artifact is persisted, displayed, or used to drive a decision without AIProvenance. The schema is defined in 02 Enterprise Architecture §9.3 and reproduced here for completeness:

interface AIProvenance {
promptId: string; // 'PRMP_PRICING_001_v3'
promptVersion: SemVer;
model: string; // 'gemini-1.5-flash' or 'phi-3-mini-4k-instruct'
modelVersion?: string;
traceId: string; // W3C traceparent
occurredAt: ISODate;
tokensIn: number;
tokensOut: number;
costUsd: number; // computed at the gateway
local: boolean; // true iff edge inference
cacheHit: boolean;
safety: { input: SafetyVerdict; output: SafetyVerdict };
reviewedBy?: UserId; // populated on HITL acceptance
reviewedAt?: ISODate;
decision?: 'accepted'|'rejected'|'modified'; // HITL outcome
}
  • The UI surfaces an "AI" badge on any artifact carrying provenance; click reveals the metadata.
  • audit-service retains provenance for 7 years.

7. Use Case Catalog

For each use case below: trigger, prompt template, model, latency target, HITL gate, fallback strategy, eval method.

7.1 Dynamic pricing suggestion (per room-type per day)

FieldValue
TriggerDaily 02:00 local; on-demand from pricing-service UI; on inventory.allocated.v1 for tomorrow when occupancy crosses 70% threshold
PromptPRMP_PRICING_001_v3 — system: pricing analyst persona; user: occupancy + 30-day baseline + seasonality + competitor anchor
Modelgemini-1.5-flash
Latency targetp95 < 1.5 s
HITL gateYes when suggestion deviates >5% from BAR baseline; otherwise auto-applied within ±5% band
FallbackDeterministic baseline (BAR + day-of-week multiplier)
EvalBacktest against last 12 months: did the suggestion improve revenue vs deterministic? Precision on "should-raise" / "should-lower" labels

7.2 Demand forecast (30 / 60 / 90 days)

FieldValue
TriggerNightly 03:00 local
PromptPRMP_PRICING_002_v2 — explanation only; the numeric forecast is from a Vertex AI Forecast or local quantile model, the LLM annotates
ModelForecast: tabular model (Vertex AI Forecast or local LightGBM); annotation: gemini-1.5-flash-8b
Latency targetBatch; SLA 1 hour
HITL gateNo (informational)
FallbackLast-year same-period naive forecast
EvalMAPE per horizon; quantile coverage

7.3 Housekeeping schedule optimization

FieldValue
TriggerAt shift start; on housekeeping.task.assigned.v1 batch flush
PromptNone; this runs on a small TSP-like solver (melmastoon-edge-hkt-v2.onnx) on the Electron desktop
ModelEdge ONNX (or fallback Vertex AI flash for explanation only)
Latency target< 500 ms on device
HITL gateYes — the lead must accept the proposed order; can reorder before dispatch
FallbackGreedy nearest-floor heuristic
EvalTime-to-complete vs baseline; staff acceptance rate

7.4 Anomaly detection

FieldValue
TriggerOn every reservation.confirmed.v1, payment.captured.v1, lock.key.issued.v1, reservation.checkout.v1; daily aggregation for occupancy spikes
PromptPRMP_ANOMALY_001_v4 for explanation; classification is the edge anomaly classifier
ModelEdge: melmastoon-edge-anomaly-v3; cloud explanation: gemini-1.5-flash
Latency targetEdge: < 200 ms; explanation: < 2 s
HITL gateYes for any auto-block; alerts only otherwise
FallbackRule-based heuristics (rapid-fire booking from same IP, payment failure pattern, key not returned > 24 h, etc.)
EvalPrecision / recall on labeled incidents; false-positive rate per tenant

7.5 Upsell recommendation

FieldValue
TriggerAt booking confirmation; pre-arrival 48 h before check-in
PromptPRMP_UPSELL_001_v2 — system: hospitality concierge persona; user: reservation details + property amenity catalog
Modelgemini-1.5-flash
Latency targetp95 < 1 s
HITL gateNo (suggested to guest, guest decides)
FallbackStatic rule set (breakfast for stays >2 nights, late checkout for premium room types)
EvalConversion rate per recommendation type

7.6 Smart guest message draft (multilingual, tone-controlled)

FieldValue
TriggerFront desk requests a draft; pre-arrival template fills; post-stay thank-you
PromptPRMP_MSG_001_v3 — variables: tone (formal/warm), locale, message intent, reservation context
Modelgemini-1.5-flash (online) or phi-3-mini (offline)
Latency targetp95 < 1.5 s online; < 4 s offline
HITL gateYes — always for guest-facing messages; the staff edits + sends
FallbackStatic templates per intent + locale
EvalStaff acceptance rate; edit-distance from draft to send

7.7 Review summarization (multilingual)

FieldValue
TriggerWeekly summary across last 30 / 90 days; on-demand for GM dashboard
PromptPRMP_REVIEW_001_v2 — produces { themes:[…], sentiment, actionable:[…], topQuotes:[…] }
Modelgemini-1.5-pro (long context); fallback claude-3-5-sonnet
Latency targetp95 < 8 s for ≤ 200 reviews
HITL gateNo (informational)
FallbackRule-based theme extraction
EvalTheme F1 against a hand-labeled set; sentiment accuracy

7.8 Booking conversion assist (consumer chat hint on the meta layer)

FieldValue
TriggerUser idles >15 s on a results page; explicit "I need help"
PromptPRMP_BOOKING_001_v1 — short prompt; outputs a single suggestion or a clarifying question
Modelgemini-1.5-flash
Latency targetp95 < 800 ms
HITL gateNo
FallbackStatic FAQ links
EvalHint → click-through → booking-completion uplift

7.9 AI tutor for backoffice

FieldValue
TriggerStaff opens the help drawer; types a question
PromptPRMP_TUTOR_001_v2 — system: helpful product expert; tools: linkToScreen(screenId), runWalkthrough(walkthroughId)
Modelgemini-1.5-flash-8b (online) or phi-3-mini (offline)
Latency targetp95 < 1.5 s
HITL gateNo (informational; tutor never executes destructive actions)
FallbackLocal FAQ vector search (MiniLM + cosine)
EvalResolution rate; thumbs-up rate; deflection from support tickets

7.10 Description generation (room types, property)

FieldValue
TriggerTenant clicks "Generate description" on a room-type / property
PromptPRMP_DESC_001_v3 — variables: room features, brand voice, target audience, locale
Modelgemini-1.5-flash-8b
Latency targetp95 < 2 s
HITL gateYes — tenant edits + accepts before publishing
FallbackTemplate fill
EvalAcceptance rate; edit distance

7.11 Translation drafts for tenant content

FieldValue
TriggerTenant adds content in source locale; target locales auto-draft
PromptPRMP_TRANSLATE_001_v2 — preserves brand voice tokens; flags untranslatable terms
Modelgemini-1.5-flash-8b
Latency targetp95 < 3 s per chunk
HITL gateYes — tenant reviews per locale
FallbackCloud Translation API as a baseline
EvalNative-speaker review acceptance rate per locale

7.12 OCR for ID scan at check-in (with HITL)

FieldValue
TriggerFront desk scans guest ID at check-in
PromptOCR via Vertex AI Document AI; LLM post-process (PRMP_OCR_001_v1) to extract structured fields
ModelDocument AI + gemini-1.5-flash-8b
Latency targetp95 < 4 s end-to-end
HITL gateYes — always — staff verifies extracted fields before save
FallbackManual entry (ID image still attached)
EvalField-level precision; staff edit rate per field

7.13 Voice transcription for staff hands-free updates

FieldValue
TriggerHousekeeper holds the "voice" button on the desktop or mobile to update task state
PromptSTT via Vertex AI Speech (or Whisper-large-v3 via ONNX edge if offline); intent extraction PRMP_STT_001_v1
ModelVertex AI Speech-to-Text + gemini-1.5-flash-8b for intent
Latency targetp95 < 2 s
HITL gateNo (the action is always reversible — flip a status; it's logged and undoable)
FallbackManual taps
EvalWER per locale; intent classification accuracy

8. HITL Gates (consolidated list)

The following actions must be gated by human acceptance before they take effect. Each is recorded as a Decision (dec_…) and linked to the resulting state-change event.

ActionWhy HITL is required
Pricing publish where deviation > 5% from baselineMaterial revenue impact
Reservation auto-cancel triggered by anomalyIrreversible to the guest
Refund initiated by AIMoney leaving the tenant
Bulk lock-credential revoke triggered by anomalyOperational + guest-experience impact
Guest-facing AI-drafted message dispatchBrand + relationship risk
Tenant content publish (description, translation)Public-facing brand artifact
OCR-extracted ID fields written to guest profileData integrity + privacy
Housekeeping schedule dispatchStaff scheduling depends on it
Auto-block of a flagged booking beyond temporary holdRevenue + customer impact

UI affordances:

  • A draft_ai badge on the artifact, with an "Accept", "Modify", "Reject" trio.
  • Required justification on "Reject" (free-text; logged).
  • The accepted state-change event carries decisionId; downstream services can correlate.

9. Offline / Edge AI

9.1 What runs on Electron via ONNX Runtime Node

  • Anomaly classification (booking, payment, lock-key) — melmastoon-edge-anomaly-v3.onnx.
  • Embedding generation for offline RAG over the tenant's cached policies + FAQ — all-MiniLM-L6-v2.
  • Draft message suggestions when offline — phi-3-mini-4k-instruct INT4.
  • Simple forecasting for next 7 days — small LightGBM-converted ONNX (melmastoon-edge-forecast-v2.onnx).
  • Image quality scoring for photo upload — mobilenet-v3-small-image-quality.onnx.
  • Housekeeping route optimizermelmastoon-edge-hkt-v2.onnx.

All edge inference happens in the main process (Node 20). The renderer requests inference via window.melmastoon.ai.infer(capability, input) exposed by contextBridge.

9.2 Packaging + verification

  • Models are packaged with the installer (no first-launch download — the user may be onboarding offline).
  • Each model is shipped with its SHA-256 + a manifest signature signed by the Melmastoon release key.
  • On first run (and on every load) ONNX Runtime verifies the signature against the public key embedded in the binary; tampering invalidates the signature and the model refuses to load.
  • Model updates ship via electron-updater with the rest of the app; partial model updates are atomic (download to temp → verify → swap).

9.3 Audit trail for edge inference

Edge inference still emits ai.inference.local.completed.v1 to the local outbox. On next sync the event is replayed for audit; the cloud ai-orchestrator-service accepts these events and persists provenance with local: true.

9.4 Hard rules

  • Edge inference never sees PCI data (no cards), never sees lock vendor secrets, never runs guest-facing message dispatch without HITL.
  • Model files live under app.getPath('userData')/models/ with restrictive ACLs.
  • Idle-unload: large models (Phi-3-mini) unload after 10 minutes of inactivity to free RAM.

10. Eval & Guardrails

10.1 Golden sets

  • Every use case has a curated golden set (eval_suite_id referenced from the prompt template).
  • Stored in ai-orchestrator-service Postgres + version-controlled in a companion repo (melmastoon-ai-evals) for reproducibility.
  • Golden sets include both positive and adversarial examples (prompt-injection attempts, edge-case inputs).

10.2 Precision / recall targets

Use caseMetricTarget
Pricing suggestionDirection-accuracy on labeled "should-raise / hold / lower"≥ 75%
Demand forecastMAPE @ 30-day horizon≤ 18%
Anomaly detectionPrecision @ recall 0.9≥ 0.7
UpsellConversion uplift vs static rules≥ 1.3×
OCRField-level precision≥ 0.95
TranslationNative-speaker acceptance≥ 0.85
TutorResolution rate≥ 0.7

10.3 A/B routing for prompt changes

  • New prompt versions ship as draft → routed to 5% of traffic → eval suite + production metrics monitored for 7 days → promoted to active on green.
  • Per-tenant opt-out for prompt experimentation (Plus + Enterprise plans).

10.4 Cost guardrails

  • Per-tenant monthly token budget with soft cap (warn at 80%) and hard cap (degrade to deterministic fallback at 100%).
  • Per-feature quotas inside a tenant — pricing + anomaly + upsell each have independent caps.
  • Real-time cost dashboard surfaced to tenant owner + gm roles.
  • Alerting: email + in-app at 80%; in-app + on-call SRE at 100%.

11. Cost & Budget

LeverMechanism
Tiered model routingDefault to the cheapest model that meets latency + quality; escalate only when needed
Edge firstCapabilities with edge models run on the desktop; cloud is the fallback
CachePer-tenant prompt+input hash cache; cache hit returns instantly with cacheHit: true provenance; TTL per capability
Per-tenant budgetsSoft + hard caps; degrade to deterministic fallback on hard cap; per-feature sub-budgets
Batch where possibleEmbeddings batched; nightly forecast batched per property
Prompt economyTemplates iteratively shortened; outputs schema-constrained to minimize output tokens
Model right-sizinggemini-1.5-flash-8b is the default for short-form generation; only escalate to flash or pro when justified

The cost dashboard (Looker Studio + BigQuery melmastoon_analytics_prod.ai_calls_fact) breaks down spend per (tenant_id, capability, model) and surfaces top consumers per period.


12. Vector Storage

All embeddings live in pgvector inside ai-orchestrator-service's Postgres schema. Per-tenant namespacing via tenant_id column + RLS (see 06 Data Models §7).

12.1 Per-tenant namespace

  • Every k-NN query carries WHERE tenant_id = $1 and runs under a session with app.tenant_id set; RLS is the second line of defense.
  • Cross-tenant query embeddings (embeddings_search_queries) live in a separate table with tenant_id nullable for anonymous queries; no PII.

12.2 HNSW indexes

  • m=16, ef_construction=64 as defaults.
  • Per-call SET LOCAL hnsw.ef_search = 40 for the typical recall-vs-latency target; tuned per index after observing recall in production.
  • Re-indexing triggered when corpus growth exceeds 25% since last build.

12.3 RAG over tenant-private content

The gateway's rag() method runs a per-tenant retrieval over the tenant's:

  • Policies (cancellation, house rules, child + pet policy, etc.)
  • FAQ
  • Staff playbook (uploaded SOPs, training docs)
  • Property amenity catalog

The retrieved chunks are injected into the prompt as context. The model is instructed to ground answers in the provided context and to refuse if the answer is not present.

Edge RAG runs the same pattern with the local SQLite-cached subset and all-MiniLM-L6-v2 embeddings.


13. Safety

13.1 Prompt injection defense

  • System prompt isolation — assembled centrally; never composed from user input.
  • Input length limits — per capability (4 KB guest-facing, 16 KB admin-side).
  • Output schema validation — every response validated against a JSON schema; non-conforming outputs are rejected.
  • Tool-call allowlist — for capabilities with tool use, only the declared tools are callable; tool execution is server-side.
  • Adversarial eval examples — every golden set includes known prompt-injection patterns; new attacks added on detection.

13.2 PII redaction in logs

  • Pre-call redaction strips emails, phones, government IDs, credit-card-shaped strings, IBANs from anything bound for the model and from anything written to logs.
  • Logs use pino with declared redactors per service; CI verifies no new field is added without a redactor entry.
  • AI traces capture token counts and model decisions, never raw user content (unless redacted).

13.3 Content moderation on guest-facing output

  • Pre-moderation on input (block on harm_high).
  • Post-moderation on output (block on harm_*, hate, sexual, dangerous, pii_exposed).
  • Blocked outputs return a deterministic fallback and raise ai.moderation.blocked.v1 to the audit log.

14. Observability

SignalSourceDestination
Per-call tracesOpenTelemetry inside ai-orchestrator-service; spans annotated with model, promptId, cacheHit, tokens, costUsd, latencyMsCloud Trace; sampled at 10% (100% on errors + on > p95 latency)
Token counts + costComputed at the gateway from provider responseBigQuery melmastoon_analytics_prod.ai_calls_fact
Latency histogramsPer (capability, model, provider)Cloud Monitoring; per-tenant breakdown for hot tenants
Cache hit ratePer (capability, tenantId)Cloud Monitoring
Model error ratesPer providerCloud Monitoring + alert on sustained > 1%
HITL acceptance ratePer (capability, tenantId)BigQuery + Looker Studio dashboard
Eval driftComparing active prompt eval vs golden set on scheduleAlert on regression
Provenance integritySample audit job verifies every persisted AI artifact has provenanceDaily report

A purpose-built AI eval logging dashboard (Looker Studio over BigQuery) gives the AI team a real-time view of:

  • Calls per capability per tenant per model.
  • Cost burn vs budget.
  • HITL acceptance / rejection / modification rates.
  • Drift signals on key prompts.
  • Top failure modes (output-schema violations, moderation blocks, provider errors).

15. Roadmap

Phase 1 — Minimal AI (MVP)

  • Heuristic + simple pricing suggestions (rule-based with optional LLM annotation).
  • Cloud-only AI through ai-orchestrator-service.
  • One model (gemini-1.5-flash) covering most capabilities.
  • Provenance + HITL gates implemented from day 1.
  • Per-tenant budget + soft cap.
  • Edge inference: anomaly classifier + image-quality scorer only.

Phase 2 — Expansion

  • Full model catalog (Gemini Pro + Flash + Flash-8B; Anthropic + OpenAI fallbacks).
  • All Phase-1-listed use cases live.
  • Edge: Phi-3-mini for offline drafting; MiniLM for offline RAG; HK route optimizer.
  • Prompt registry with eval suites + A/B routing.
  • Per-tenant cost dashboard.
  • Translation drafts + multilingual review summarization.

Phase 3 — Personalization

  • Per-tenant RAG over policies / FAQ / playbook (cloud + edge).
  • Booking conversion assist on the consumer meta layer.
  • AI tutor with deep-link tools.
  • Voice STT for hands-free housekeeping updates.
  • Bug bounty + adversarial eval expansion.
  • Per-feature quotas + per-tenant residency-aware routing.

Phase 4 — Self-tuning

  • Per-tenant LoRA fine-tunes (where data volume + tenant consent permit) for tenant-voice messaging + property-specific descriptions.
  • Continuous eval pipelines that promote prompts automatically on green.
  • Federated edge model updates via signed differential model packs.
  • Per-tenant model preference (e.g., Anthropic-only for tenants with that contractual preference).

Cross-references: per-service AI integration details live in services/<service-name>/AI_INTEGRATION.md. The model catalog, prompt registry schema, and eval suites are owned by ai-orchestrator-service and version-controlled. Safety + provenance + HITL contracts are referenced from 07 Security & Tenancy §8 and from every use-case implementation.