08 — AI Architecture
Companion: 02 Enterprise Architecture · 03 Microservices · 05 API Design · 06 Data Models · 07 Security & Tenancy · 09 Lock & Key Integration · 12 Desktop Spec · ADR-0003 Electron Offline-First
This document is the canonical AI architecture for Ghasi Melmastoon. It defines the AI thesis, the gateway pattern, provider routing across cloud (Vertex AI) and edge (ONNX Runtime Node on Electron), the model catalog, the prompt registry, mandatory provenance, the use-case catalog, HITL gates, edge inference rules, evaluation + guardrails, cost discipline, vector storage, safety, observability, and the phased roadmap. Every AI-related claim in any other document defers to this one.
1. AI Thesis
Small and medium hotels in low-resource markets do not get to hire revenue managers, demand-forecasting analysts, or full-time CRM specialists. The cost-per-room of those skills is prohibitive at the 8-50 room scale, where most of the Afghanistan, Tajikistan, and Iran markets sit. AI is the force multiplier that lets one general manager + a small staff run an operation that would otherwise need a much larger team.
Concretely, AI inside Melmastoon is responsible for:
- Dynamic pricing suggestions per room-type per day, based on occupancy, seasonality, and historical demand.
- Demand forecasting at 30 / 60 / 90 day horizons.
- Housekeeping route + schedule optimization for the day.
- Anomaly detection on bookings (suspicious rapid-fire bookings, payment risk, key-not-returned patterns, late-checkout patterns, occupancy spikes).
- Upsell recommendations at booking and pre-arrival (room upgrade, breakfast add-on, late checkout, airport transfer).
- Smart guest communications drafting (multilingual, tone-controlled).
- Review summarization across long histories (multilingual; preserves sentiment + actionable themes).
- Multilingual content drafting for tenants — descriptions, FAQs, policies, social copy.
- AI-assisted operations dashboard — surfaces "what changed since you last logged in" with explanation.
- AI tutor — answers staff "how do I…?" questions in-app, with deep links to the right action.
What AI is not allowed to do unilaterally: anything irreversible, anything monetary above a threshold, anything guest-facing without HITL acceptance. AI proposes; humans accept; the audit chain records both.
2. Single AI Gateway (ai-orchestrator-service)
ai-orchestrator-service is the only service that talks to model providers. Every other service and every BFF and every client calls it via REST + Pub/Sub through the AIClient port. CI dependency-graph analysis fails any service that imports @google-cloud/vertexai, @anthropic-ai/sdk, openai, onnxruntime-node, or any other model provider SDK outside ai-orchestrator-service (and the desktop edge module that explicitly belongs to app-desktop-backoffice).
2.1 Why one gateway
| Concern | Without a gateway | With the gateway |
|---|---|---|
| Cost control | Every service has its own usage; no central budget enforcement | Per-tenant budgets + per-feature quotas + soft/hard caps in one place |
| Audit + provenance | Each service stamps its own (or doesn't); inconsistent | Every call carries AIProvenance issued centrally; one schema |
| Model swap | Every caller hardcodes a provider SDK; rewrite-the-world to switch | One adapter swap inside the gateway; callers see no change |
| Prompt versioning | Prompts scattered, drift, untested | Prompts in a registry with eval suites + deprecation policy |
| Moderation | Each caller "should" moderate input/output; in practice doesn't | Pre/post moderation enforced by the gateway |
| PII redaction | Each caller responsible; easy to miss | Mandatory pre-call redaction enforced by the gateway |
| Cache | Each caller caches differently; cache poisoning across tenants possible | One per-tenant cache with explicit policy |
| Observability | Different metrics shape per caller | Uniform traces, token counts, latency, cost in one place |
2.2 Surface
Caller side (NestJS DI inside any service):
constructor(@Inject('AIClient') private ai: AIClient) {}
const result = await this.ai.complete({
capability: 'pricing.suggest',
promptId: 'PRMP_PRICING_001_v3',
tenantId,
input: { propertyId, roomTypeId, date, occupancy, baseline, seasonalSignal },
timeoutMs: 4_000,
fallback: 'deterministic',
correlation: { traceId, requestId },
});
The gateway:
- Pre-call — moderate the input, redact PII, check the per-tenant + per-feature budget, pin the prompt version from the registry, attach the system prompt, hash the input for cache lookup.
- Route — decide cloud (Vertex AI / fallback OpenAI / Anthropic) vs edge (ONNX Runtime Node on the Electron desktop). If cloud, pick the model from the catalog by
(capability, latencyClass, costClass, fallbackChain). - Call — invoke the provider through the appropriate adapter. Enforce timeout + retries with jitter.
- Post-call — moderate the output, validate against the use-case JSON schema, stamp
AIProvenance, record cost, write to cache, persist the artifact + provenance, emitai.gateway.call.completed.v1. - Return — the caller receives a typed result + provenance reference; the caller never sees raw model details.
3. Provider Routing
3.1 Cloud — Vertex AI primary
Vertex AI (Gemini family) is the primary cloud provider:
- Native to GCP; private VPC connectivity; no extra egress; CMEK supported.
- Gemini 1.5 Pro / Flash / Flash-8B cover the bulk of our LLM workload at multiple cost tiers.
- Vertex AI Embeddings (
text-embedding-004, 768-dim) for tenant content, reviews, room descriptions. - Vertex AI Vision for image quality scoring on property uploads.
3.2 Cloud — fallback adapters
A single fallback chain per capability lets us survive a Vertex AI incident without manual intervention:
- Anthropic Claude (Sonnet / Opus tiers) via Vertex AI partner endpoint or direct API. Used when the capability calls for stronger long-context reasoning (review synthesis, policy drafting).
- OpenAI (GPT-4.1 / GPT-4o-mini) as a third option for heterogeneous fallback. Restricted to capabilities where data residency allows.
Adapter shape — every provider implements AIProviderPort:
interface AIProviderPort {
name: 'vertex' | 'anthropic' | 'openai' | 'onnx-edge';
complete(req: AICompletionRequest): Promise<AICompletionResponse>;
embed(req: AIEmbeddingRequest): Promise<AIEmbeddingResponse>;
vision(req: AIVisionRequest): Promise<AIVisionResponse>;
moderate(req: AIModerationRequest): Promise<AIModerationResponse>;
capabilities(): ProviderCapabilities;
}
3.3 Edge — ONNX Runtime Node (Electron desktop)
The Electron desktop ships with ONNX Runtime Node running in the main process (Node 20). Renderer never has model bytes; renderer requests inference via the preload-exposed window.melmastoon.ai.infer(...) channel.
Edge models are small and quantized:
- Phi-3-mini-4k-instruct (INT4 quantization) for short-context drafting and Q&A when offline. Roughly 2.4 GB on disk; loads on demand; idle-unloaded after 10 minutes of inactivity.
- all-MiniLM-L6-v2 (FP16) for sentence embeddings (384-dim). Used for offline RAG over the tenant's local cached policies + FAQ.
- Anomaly classifier — small custom-trained ONNX (LightGBM-converted) for booking + payment + lock anomaly heuristics.
- Image quality scorer — MobileNet-V3 small for property photo upload quality flags.
Models are signed; the installer ships a manifest with SHA-256 + signature; ONNX Runtime refuses to load a model whose signature does not verify.
3.4 Routing decision
// Pseudocode inside ai-orchestrator-service
function pickProvider(req: AIRequest): AIProviderPort {
if (req.context.local && capabilityHasEdgeModel(req.capability)) {
return providers.onnxEdge;
}
if (req.context.regionPin === 'me-central1' && req.capability !== 'long-context-policy') {
return providers.vertex; // primary
}
// Fallback chain configured per capability:
for (const name of capabilityFallbackChain(req.capability)) {
if (providers[name].isHealthy()) return providers[name];
}
throw new AIError('NO_HEALTHY_PROVIDER');
}
4. Model Catalog
The catalog is the source of truth for which model serves which capability. New entries require an ADR-or-equivalent record.
| Model | Provider | Modality | Context | Cost class | Latency class | Primary use cases | Fallback chain |
|---|---|---|---|---|---|---|---|
gemini-1.5-pro | Vertex AI | LLM, multimodal | 1M | High | Medium (~2-6 s) | Review synthesis, policy drafting, long-context analysis | claude-sonnet → gpt-4.1 |
gemini-1.5-flash | Vertex AI | LLM, multimodal | 1M | Medium | Low (~0.5-2 s) | Pricing suggestions, anomaly explanations, upsell drafting, guest message draft | claude-sonnet → gpt-4o-mini |
gemini-1.5-flash-8b | Vertex AI | LLM | 1M | Low | Very low (~0.2-0.8 s) | Translation drafts, room description generation, AI tutor, OCR post-processing | gemini-flash |
text-embedding-004 | Vertex AI | Embedding | 2k | Very low | Very low | Room descriptions, review summaries, FAQ, RAG | local MiniLM (degraded) |
text-multilingual-embedding-002 | Vertex AI | Embedding | 2k | Very low | Very low | Multilingual content embeddings | text-embedding-004 |
claude-3-5-sonnet | Anthropic (via Vertex partner) | LLM | 200k | High | Medium | Long-context reasoning fallback; complex policy drafting | gemini-pro |
gpt-4o-mini | OpenAI | LLM | 128k | Medium | Low | Tertiary fallback for short prompts | — |
phi-3-mini-4k-instruct (INT4) | ONNX Edge | LLM | 4k | Free (CPU) | Medium-on-device (~1-3 s) | Offline drafting, AI tutor, simple Q&A | gemini-flash on next sync |
all-MiniLM-L6-v2 (FP16) | ONNX Edge | Embedding | 256 | Free (CPU) | Very low | Offline RAG over local policies / FAQ | text-embedding-004 on next sync |
melmastoon-edge-anomaly-v3 | ONNX Edge | Classifier | — | Free (CPU) | Very low | Booking / payment / lock anomaly heuristics | gemini-flash on next sync |
mobilenet-v3-small-image-quality | ONNX Edge | Vision classifier | — | Free (CPU) | Very low | Photo upload quality flag | Vertex Vision |
Cost classes (per-1k-tokens reference, indicative): Very low < $0.0005, Low < $0.005, Medium < $0.05, High ≥ $0.05.
5. Prompt Registry
Every prompt template has an ID, a version, an owner, an eval suite, and a deprecation policy. Prompts are first-class versioned artifacts.
5.1 ID format
PRMP_<DOMAIN>_<NUMBER>_v<n>
<DOMAIN>— uppercase, snake_case domain code:PRICING,HK,ANOMALY,UPSELL,MSG,REVIEW,BOOKING,TUTOR,DESC,TRANSLATE,OCR,STT.<NUMBER>— zero-padded ordinal within the domain.<n>— integer prompt version (semver-major).
Examples: PRMP_PRICING_001_v3, PRMP_MSG_004_v1, PRMP_ANOMALY_002_v5.
5.2 Storage
- Postgres table
prompt_templatesinsideai-orchestrator-service:
CREATE TABLE prompt_templates (
id text PRIMARY KEY, -- 'PRMP_PRICING_001_v3'
domain text NOT NULL,
ordinal int NOT NULL,
version int NOT NULL,
status text NOT NULL CHECK (status IN ('draft','active','deprecated','retired')),
owner_user_id uuid NOT NULL,
capability text NOT NULL,
system_prompt text NOT NULL,
user_template text NOT NULL,
output_schema jsonb NOT NULL, -- JSON Schema
default_model text NOT NULL,
eval_suite_id text NOT NULL,
notes text,
created_at timestamptz NOT NULL DEFAULT now(),
retired_at timestamptz,
UNIQUE (domain, ordinal, version)
);
- Replicated to the Electron desktop's
prompt_templatesSQLite table on session start so offline edge inference uses the same templates.
5.3 Versioning + deprecation
- New prompt versions ship as new rows (never overwrite). The
activerow for a given(domain, ordinal)is what new traffic hits. - A previously
activerow flips todeprecatedfor at least 14 days before retiring (gives consumers time to drain). retiredrows remain in the table for audit but cannot be served.
5.4 Eval suites
Every prompt template references an eval_suite_id that points to a curated set of inputs + expected outcomes (or scoring criteria). New versions must beat or match the active version on the eval suite before promotion.
6. Provenance Metadata (Mandatory)
No AI artifact is persisted, displayed, or used to drive a decision without AIProvenance. The schema is defined in 02 Enterprise Architecture §9.3 and reproduced here for completeness:
interface AIProvenance {
promptId: string; // 'PRMP_PRICING_001_v3'
promptVersion: SemVer;
model: string; // 'gemini-1.5-flash' or 'phi-3-mini-4k-instruct'
modelVersion?: string;
traceId: string; // W3C traceparent
occurredAt: ISODate;
tokensIn: number;
tokensOut: number;
costUsd: number; // computed at the gateway
local: boolean; // true iff edge inference
cacheHit: boolean;
safety: { input: SafetyVerdict; output: SafetyVerdict };
reviewedBy?: UserId; // populated on HITL acceptance
reviewedAt?: ISODate;
decision?: 'accepted'|'rejected'|'modified'; // HITL outcome
}
- The UI surfaces an "AI" badge on any artifact carrying provenance; click reveals the metadata.
audit-serviceretains provenance for 7 years.
7. Use Case Catalog
For each use case below: trigger, prompt template, model, latency target, HITL gate, fallback strategy, eval method.
7.1 Dynamic pricing suggestion (per room-type per day)
| Field | Value |
|---|---|
| Trigger | Daily 02:00 local; on-demand from pricing-service UI; on inventory.allocated.v1 for tomorrow when occupancy crosses 70% threshold |
| Prompt | PRMP_PRICING_001_v3 — system: pricing analyst persona; user: occupancy + 30-day baseline + seasonality + competitor anchor |
| Model | gemini-1.5-flash |
| Latency target | p95 < 1.5 s |
| HITL gate | Yes when suggestion deviates >5% from BAR baseline; otherwise auto-applied within ±5% band |
| Fallback | Deterministic baseline (BAR + day-of-week multiplier) |
| Eval | Backtest against last 12 months: did the suggestion improve revenue vs deterministic? Precision on "should-raise" / "should-lower" labels |
7.2 Demand forecast (30 / 60 / 90 days)
| Field | Value |
|---|---|
| Trigger | Nightly 03:00 local |
| Prompt | PRMP_PRICING_002_v2 — explanation only; the numeric forecast is from a Vertex AI Forecast or local quantile model, the LLM annotates |
| Model | Forecast: tabular model (Vertex AI Forecast or local LightGBM); annotation: gemini-1.5-flash-8b |
| Latency target | Batch; SLA 1 hour |
| HITL gate | No (informational) |
| Fallback | Last-year same-period naive forecast |
| Eval | MAPE per horizon; quantile coverage |
7.3 Housekeeping schedule optimization
| Field | Value |
|---|---|
| Trigger | At shift start; on housekeeping.task.assigned.v1 batch flush |
| Prompt | None; this runs on a small TSP-like solver (melmastoon-edge-hkt-v2.onnx) on the Electron desktop |
| Model | Edge ONNX (or fallback Vertex AI flash for explanation only) |
| Latency target | < 500 ms on device |
| HITL gate | Yes — the lead must accept the proposed order; can reorder before dispatch |
| Fallback | Greedy nearest-floor heuristic |
| Eval | Time-to-complete vs baseline; staff acceptance rate |
7.4 Anomaly detection
| Field | Value |
|---|---|
| Trigger | On every reservation.confirmed.v1, payment.captured.v1, lock.key.issued.v1, reservation.checkout.v1; daily aggregation for occupancy spikes |
| Prompt | PRMP_ANOMALY_001_v4 for explanation; classification is the edge anomaly classifier |
| Model | Edge: melmastoon-edge-anomaly-v3; cloud explanation: gemini-1.5-flash |
| Latency target | Edge: < 200 ms; explanation: < 2 s |
| HITL gate | Yes for any auto-block; alerts only otherwise |
| Fallback | Rule-based heuristics (rapid-fire booking from same IP, payment failure pattern, key not returned > 24 h, etc.) |
| Eval | Precision / recall on labeled incidents; false-positive rate per tenant |
7.5 Upsell recommendation
| Field | Value |
|---|---|
| Trigger | At booking confirmation; pre-arrival 48 h before check-in |
| Prompt | PRMP_UPSELL_001_v2 — system: hospitality concierge persona; user: reservation details + property amenity catalog |
| Model | gemini-1.5-flash |
| Latency target | p95 < 1 s |
| HITL gate | No (suggested to guest, guest decides) |
| Fallback | Static rule set (breakfast for stays >2 nights, late checkout for premium room types) |
| Eval | Conversion rate per recommendation type |
7.6 Smart guest message draft (multilingual, tone-controlled)
| Field | Value |
|---|---|
| Trigger | Front desk requests a draft; pre-arrival template fills; post-stay thank-you |
| Prompt | PRMP_MSG_001_v3 — variables: tone (formal/warm), locale, message intent, reservation context |
| Model | gemini-1.5-flash (online) or phi-3-mini (offline) |
| Latency target | p95 < 1.5 s online; < 4 s offline |
| HITL gate | Yes — always for guest-facing messages; the staff edits + sends |
| Fallback | Static templates per intent + locale |
| Eval | Staff acceptance rate; edit-distance from draft to send |
7.7 Review summarization (multilingual)
| Field | Value |
|---|---|
| Trigger | Weekly summary across last 30 / 90 days; on-demand for GM dashboard |
| Prompt | PRMP_REVIEW_001_v2 — produces { themes:[…], sentiment, actionable:[…], topQuotes:[…] } |
| Model | gemini-1.5-pro (long context); fallback claude-3-5-sonnet |
| Latency target | p95 < 8 s for ≤ 200 reviews |
| HITL gate | No (informational) |
| Fallback | Rule-based theme extraction |
| Eval | Theme F1 against a hand-labeled set; sentiment accuracy |
7.8 Booking conversion assist (consumer chat hint on the meta layer)
| Field | Value |
|---|---|
| Trigger | User idles >15 s on a results page; explicit "I need help" |
| Prompt | PRMP_BOOKING_001_v1 — short prompt; outputs a single suggestion or a clarifying question |
| Model | gemini-1.5-flash |
| Latency target | p95 < 800 ms |
| HITL gate | No |
| Fallback | Static FAQ links |
| Eval | Hint → click-through → booking-completion uplift |
7.9 AI tutor for backoffice
| Field | Value |
|---|---|
| Trigger | Staff opens the help drawer; types a question |
| Prompt | PRMP_TUTOR_001_v2 — system: helpful product expert; tools: linkToScreen(screenId), runWalkthrough(walkthroughId) |
| Model | gemini-1.5-flash-8b (online) or phi-3-mini (offline) |
| Latency target | p95 < 1.5 s |
| HITL gate | No (informational; tutor never executes destructive actions) |
| Fallback | Local FAQ vector search (MiniLM + cosine) |
| Eval | Resolution rate; thumbs-up rate; deflection from support tickets |
7.10 Description generation (room types, property)
| Field | Value |
|---|---|
| Trigger | Tenant clicks "Generate description" on a room-type / property |
| Prompt | PRMP_DESC_001_v3 — variables: room features, brand voice, target audience, locale |
| Model | gemini-1.5-flash-8b |
| Latency target | p95 < 2 s |
| HITL gate | Yes — tenant edits + accepts before publishing |
| Fallback | Template fill |
| Eval | Acceptance rate; edit distance |
7.11 Translation drafts for tenant content
| Field | Value |
|---|---|
| Trigger | Tenant adds content in source locale; target locales auto-draft |
| Prompt | PRMP_TRANSLATE_001_v2 — preserves brand voice tokens; flags untranslatable terms |
| Model | gemini-1.5-flash-8b |
| Latency target | p95 < 3 s per chunk |
| HITL gate | Yes — tenant reviews per locale |
| Fallback | Cloud Translation API as a baseline |
| Eval | Native-speaker review acceptance rate per locale |
7.12 OCR for ID scan at check-in (with HITL)
| Field | Value |
|---|---|
| Trigger | Front desk scans guest ID at check-in |
| Prompt | OCR via Vertex AI Document AI; LLM post-process (PRMP_OCR_001_v1) to extract structured fields |
| Model | Document AI + gemini-1.5-flash-8b |
| Latency target | p95 < 4 s end-to-end |
| HITL gate | Yes — always — staff verifies extracted fields before save |
| Fallback | Manual entry (ID image still attached) |
| Eval | Field-level precision; staff edit rate per field |
7.13 Voice transcription for staff hands-free updates
| Field | Value |
|---|---|
| Trigger | Housekeeper holds the "voice" button on the desktop or mobile to update task state |
| Prompt | STT via Vertex AI Speech (or Whisper-large-v3 via ONNX edge if offline); intent extraction PRMP_STT_001_v1 |
| Model | Vertex AI Speech-to-Text + gemini-1.5-flash-8b for intent |
| Latency target | p95 < 2 s |
| HITL gate | No (the action is always reversible — flip a status; it's logged and undoable) |
| Fallback | Manual taps |
| Eval | WER per locale; intent classification accuracy |
8. HITL Gates (consolidated list)
The following actions must be gated by human acceptance before they take effect. Each is recorded as a Decision (dec_…) and linked to the resulting state-change event.
| Action | Why HITL is required |
|---|---|
| Pricing publish where deviation > 5% from baseline | Material revenue impact |
| Reservation auto-cancel triggered by anomaly | Irreversible to the guest |
| Refund initiated by AI | Money leaving the tenant |
| Bulk lock-credential revoke triggered by anomaly | Operational + guest-experience impact |
| Guest-facing AI-drafted message dispatch | Brand + relationship risk |
| Tenant content publish (description, translation) | Public-facing brand artifact |
| OCR-extracted ID fields written to guest profile | Data integrity + privacy |
| Housekeeping schedule dispatch | Staff scheduling depends on it |
| Auto-block of a flagged booking beyond temporary hold | Revenue + customer impact |
UI affordances:
- A
draft_aibadge on the artifact, with an "Accept", "Modify", "Reject" trio. - Required justification on "Reject" (free-text; logged).
- The accepted state-change event carries
decisionId; downstream services can correlate.
9. Offline / Edge AI
9.1 What runs on Electron via ONNX Runtime Node
- Anomaly classification (booking, payment, lock-key) —
melmastoon-edge-anomaly-v3.onnx. - Embedding generation for offline RAG over the tenant's cached policies + FAQ —
all-MiniLM-L6-v2. - Draft message suggestions when offline —
phi-3-mini-4k-instructINT4. - Simple forecasting for next 7 days — small LightGBM-converted ONNX (
melmastoon-edge-forecast-v2.onnx). - Image quality scoring for photo upload —
mobilenet-v3-small-image-quality.onnx. - Housekeeping route optimizer —
melmastoon-edge-hkt-v2.onnx.
All edge inference happens in the main process (Node 20). The renderer requests inference via window.melmastoon.ai.infer(capability, input) exposed by contextBridge.
9.2 Packaging + verification
- Models are packaged with the installer (no first-launch download — the user may be onboarding offline).
- Each model is shipped with its SHA-256 + a manifest signature signed by the Melmastoon release key.
- On first run (and on every load) ONNX Runtime verifies the signature against the public key embedded in the binary; tampering invalidates the signature and the model refuses to load.
- Model updates ship via
electron-updaterwith the rest of the app; partial model updates are atomic (download to temp → verify → swap).
9.3 Audit trail for edge inference
Edge inference still emits ai.inference.local.completed.v1 to the local outbox. On next sync the event is replayed for audit; the cloud ai-orchestrator-service accepts these events and persists provenance with local: true.
9.4 Hard rules
- Edge inference never sees PCI data (no cards), never sees lock vendor secrets, never runs guest-facing message dispatch without HITL.
- Model files live under
app.getPath('userData')/models/with restrictive ACLs. - Idle-unload: large models (Phi-3-mini) unload after 10 minutes of inactivity to free RAM.
10. Eval & Guardrails
10.1 Golden sets
- Every use case has a curated golden set (
eval_suite_idreferenced from the prompt template). - Stored in
ai-orchestrator-servicePostgres + version-controlled in a companion repo (melmastoon-ai-evals) for reproducibility. - Golden sets include both positive and adversarial examples (prompt-injection attempts, edge-case inputs).
10.2 Precision / recall targets
| Use case | Metric | Target |
|---|---|---|
| Pricing suggestion | Direction-accuracy on labeled "should-raise / hold / lower" | ≥ 75% |
| Demand forecast | MAPE @ 30-day horizon | ≤ 18% |
| Anomaly detection | Precision @ recall 0.9 | ≥ 0.7 |
| Upsell | Conversion uplift vs static rules | ≥ 1.3× |
| OCR | Field-level precision | ≥ 0.95 |
| Translation | Native-speaker acceptance | ≥ 0.85 |
| Tutor | Resolution rate | ≥ 0.7 |
10.3 A/B routing for prompt changes
- New prompt versions ship as
draft→ routed to 5% of traffic → eval suite + production metrics monitored for 7 days → promoted toactiveon green. - Per-tenant opt-out for prompt experimentation (Plus + Enterprise plans).
10.4 Cost guardrails
- Per-tenant monthly token budget with soft cap (warn at 80%) and hard cap (degrade to deterministic fallback at 100%).
- Per-feature quotas inside a tenant — pricing + anomaly + upsell each have independent caps.
- Real-time cost dashboard surfaced to tenant
owner+gmroles. - Alerting: email + in-app at 80%; in-app + on-call SRE at 100%.
11. Cost & Budget
| Lever | Mechanism |
|---|---|
| Tiered model routing | Default to the cheapest model that meets latency + quality; escalate only when needed |
| Edge first | Capabilities with edge models run on the desktop; cloud is the fallback |
| Cache | Per-tenant prompt+input hash cache; cache hit returns instantly with cacheHit: true provenance; TTL per capability |
| Per-tenant budgets | Soft + hard caps; degrade to deterministic fallback on hard cap; per-feature sub-budgets |
| Batch where possible | Embeddings batched; nightly forecast batched per property |
| Prompt economy | Templates iteratively shortened; outputs schema-constrained to minimize output tokens |
| Model right-sizing | gemini-1.5-flash-8b is the default for short-form generation; only escalate to flash or pro when justified |
The cost dashboard (Looker Studio + BigQuery melmastoon_analytics_prod.ai_calls_fact) breaks down spend per (tenant_id, capability, model) and surfaces top consumers per period.
12. Vector Storage
All embeddings live in pgvector inside ai-orchestrator-service's Postgres schema. Per-tenant namespacing via tenant_id column + RLS (see 06 Data Models §7).
12.1 Per-tenant namespace
- Every k-NN query carries
WHERE tenant_id = $1and runs under a session withapp.tenant_idset; RLS is the second line of defense. - Cross-tenant query embeddings (
embeddings_search_queries) live in a separate table withtenant_idnullable for anonymous queries; no PII.
12.2 HNSW indexes
m=16,ef_construction=64as defaults.- Per-call
SET LOCAL hnsw.ef_search = 40for the typical recall-vs-latency target; tuned per index after observing recall in production. - Re-indexing triggered when corpus growth exceeds 25% since last build.
12.3 RAG over tenant-private content
The gateway's rag() method runs a per-tenant retrieval over the tenant's:
- Policies (cancellation, house rules, child + pet policy, etc.)
- FAQ
- Staff playbook (uploaded SOPs, training docs)
- Property amenity catalog
The retrieved chunks are injected into the prompt as context. The model is instructed to ground answers in the provided context and to refuse if the answer is not present.
Edge RAG runs the same pattern with the local SQLite-cached subset and all-MiniLM-L6-v2 embeddings.
13. Safety
13.1 Prompt injection defense
- System prompt isolation — assembled centrally; never composed from user input.
- Input length limits — per capability (4 KB guest-facing, 16 KB admin-side).
- Output schema validation — every response validated against a JSON schema; non-conforming outputs are rejected.
- Tool-call allowlist — for capabilities with tool use, only the declared tools are callable; tool execution is server-side.
- Adversarial eval examples — every golden set includes known prompt-injection patterns; new attacks added on detection.
13.2 PII redaction in logs
- Pre-call redaction strips emails, phones, government IDs, credit-card-shaped strings, IBANs from anything bound for the model and from anything written to logs.
- Logs use
pinowith declared redactors per service; CI verifies no new field is added without a redactor entry. - AI traces capture token counts and model decisions, never raw user content (unless redacted).
13.3 Content moderation on guest-facing output
- Pre-moderation on input (block on
harm_high). - Post-moderation on output (block on
harm_*,hate,sexual,dangerous,pii_exposed). - Blocked outputs return a deterministic fallback and raise
ai.moderation.blocked.v1to the audit log.
14. Observability
| Signal | Source | Destination |
|---|---|---|
| Per-call traces | OpenTelemetry inside ai-orchestrator-service; spans annotated with model, promptId, cacheHit, tokens, costUsd, latencyMs | Cloud Trace; sampled at 10% (100% on errors + on > p95 latency) |
| Token counts + cost | Computed at the gateway from provider response | BigQuery melmastoon_analytics_prod.ai_calls_fact |
| Latency histograms | Per (capability, model, provider) | Cloud Monitoring; per-tenant breakdown for hot tenants |
| Cache hit rate | Per (capability, tenantId) | Cloud Monitoring |
| Model error rates | Per provider | Cloud Monitoring + alert on sustained > 1% |
| HITL acceptance rate | Per (capability, tenantId) | BigQuery + Looker Studio dashboard |
| Eval drift | Comparing active prompt eval vs golden set on schedule | Alert on regression |
| Provenance integrity | Sample audit job verifies every persisted AI artifact has provenance | Daily report |
A purpose-built AI eval logging dashboard (Looker Studio over BigQuery) gives the AI team a real-time view of:
- Calls per capability per tenant per model.
- Cost burn vs budget.
- HITL acceptance / rejection / modification rates.
- Drift signals on key prompts.
- Top failure modes (output-schema violations, moderation blocks, provider errors).
15. Roadmap
Phase 1 — Minimal AI (MVP)
- Heuristic + simple pricing suggestions (rule-based with optional LLM annotation).
- Cloud-only AI through
ai-orchestrator-service. - One model (
gemini-1.5-flash) covering most capabilities. - Provenance + HITL gates implemented from day 1.
- Per-tenant budget + soft cap.
- Edge inference: anomaly classifier + image-quality scorer only.
Phase 2 — Expansion
- Full model catalog (Gemini Pro + Flash + Flash-8B; Anthropic + OpenAI fallbacks).
- All Phase-1-listed use cases live.
- Edge: Phi-3-mini for offline drafting; MiniLM for offline RAG; HK route optimizer.
- Prompt registry with eval suites + A/B routing.
- Per-tenant cost dashboard.
- Translation drafts + multilingual review summarization.
Phase 3 — Personalization
- Per-tenant RAG over policies / FAQ / playbook (cloud + edge).
- Booking conversion assist on the consumer meta layer.
- AI tutor with deep-link tools.
- Voice STT for hands-free housekeeping updates.
- Bug bounty + adversarial eval expansion.
- Per-feature quotas + per-tenant residency-aware routing.
Phase 4 — Self-tuning
- Per-tenant LoRA fine-tunes (where data volume + tenant consent permit) for tenant-voice messaging + property-specific descriptions.
- Continuous eval pipelines that promote prompts automatically on green.
- Federated edge model updates via signed differential model packs.
- Per-tenant model preference (e.g., Anthropic-only for tenants with that contractual preference).
Cross-references: per-service AI integration details live in
services/<service-name>/AI_INTEGRATION.md. The model catalog, prompt registry schema, and eval suites are owned byai-orchestrator-serviceand version-controlled. Safety + provenance + HITL contracts are referenced from 07 Security & Tenancy §8 and from every use-case implementation.