AI Integration

:::info Source Sourced from services/delivery-service/AI_INTEGRATION.md in the documentation repo. :::

Companion: 03 ai-gateway-service · DOMAIN_MODEL · SECURITY_MODEL

1. AI Usage in Delivery

Delivery uses AI for a single core feature: the AI Tutor — a conversational assistant scoped to the learner's current lesson context. The tutor operates in two modes:

Mode	Inference Path	Use Case
Online	ai-gateway-service -> cloud provider (Anthropic, OpenAI)	Device has connectivity
Offline	Local on-device model (via local AI runtime)	Device is offline (mounted bundle scenario)

All AI calls route through the AIClient port — delivery-service never imports vendor SDKs directly.

2. AIClient Port

interface AIClient {
  streamTutorResponse(request: TutorRequest): AsyncIterable<TutorChunk>;
  checkLocalAvailability(): Promise<boolean>;
}

interface TutorRequest {
  tenantId: TenantId;
  userId: UserId;
  sessionId: PlaySessionId;
  turnId: string;
  lessonContext: {
    lessonId: string;
    lessonTitle: string;
    lessonContent: string;        // flattened lesson text
    blockIds: string[];
  };
  conversationHistory: ConversationTurn[];
  prompt: string;
  preferLocal: boolean;            // true when offline or device prefers local
  maxTokens?: number;
  temperature?: number;
}

interface TutorChunk {
  type: 'token' | 'tool_call' | 'done' | 'error';
  text?: string;
  index?: number;
  toolCall?: ToolCallTrace;
  done?: {
    totalTokens: number;
    aiProvenance: AIProvenance;
  };
  error?: { code: string; message: string };
}

interface ConversationTurn {
  role: 'user' | 'assistant';
  content: string;
  timestamp: ISODate;
}

3. Tutor Orchestration Flow

┌─────────────────────────────────────────────────────────────┐
│ 1. Client POSTs /api/v1/play-sessions/{id}/tutor/turn       │
│    Body: { prompt: "Explain polymorphism" }                  │
│                                                              │
│ 2. Delivery:                                                 │
│    a. Verify session ownership + active state                │
│    b. Load current cursor and lesson content from manifest   │
│    c. Fetch last 5 tutor turns for conversation context      │
│    d. Determine inference path:                              │
│       - isOffline session? -> preferLocal = true             │
│       - ai-gateway reachable? -> preferLocal = false         │
│    e. Build TutorRequest                                     │
│                                                              │
│ 3. Call AIClient.streamTutorResponse(request)                │
│    - Stream chunks back to client via SSE                    │
│    - Buffer full response for persistence                    │
│                                                              │
│ 4. On stream complete:                                       │
│    a. Create AssistantTurn record                            │
│    b. Persist with aiProvenance                              │
│    c. Emit delivery.play_session.tutor_turn.completed.v1     │
└─────────────────────────────────────────────────────────────┘

4. RAG Context Strategy

The tutor uses in-context RAG with lesson content (not vector search). Rationale:

Lessons are short and self-contained (typical 500-5000 tokens).
The tutor should be scoped to the current lesson to prevent curriculum drift.
Vector search introduces latency and cost without meaningful quality gains at lesson scale.

For long-form courses or book chapters, the ai-gateway-service can be invoked with a retrieval hint to pull from course-level embeddings — but this is opt-in per course via PlayPackage manifest flag.

5. System Prompt

The delivery-service injects a system prompt controlled by the ai-gateway's prompt registry. The active prompt (pinned per-tenant) is something like:

You are a focused educational tutor helping a learner understand the current lesson.

LESSON: {lessonTitle}
CONTENT:
{lessonContent}

Rules:
- Answer only questions related to this lesson.
- If the learner asks about unrelated topics, politely redirect to the lesson.
- Be concise: aim for 100-300 words.
- Use examples from the lesson when possible.
- Never invent facts not in the lesson.
- If you don't know, say so.

Delivery does not craft this prompt itself; it calls the gateway's getSystemPrompt('delivery.tutor', tenantId) to retrieve the pinned version.

6. Offline AI Inference

When the session is offline:

Aspect	Behavior
Model	Small on-device model (e.g., Phi-4, Qwen 4B quantized) bundled with the app
Inference	Runs in a dedicated worker thread / native runtime
Context window	Reduced (typically 4K vs 32K+ online)
Response quality	Acceptable for Q&A on lesson content; reduced for complex reasoning
Cost	Zero (compute is on-device)
Provenance	`aiProvenance.local = true`, `aiProvenance.cost = undefined`

The local model is updated via app update cycles. PlayPackage bundles may include supplementary fine-tuning artifacts for domain-specific tutoring (planned for S5+).

7. AIProvenance

Every AssistantTurn records full provenance per platform convention:

{
  model: "claude-haiku-4.5" | "qwen-4b-local",
  version: "2026.03",
  promptId: "delivery.tutor",
  promptVersion: "1.4.2",
  traceId: "00-abc...-01",
  decisionId: undefined,           // no HITL for tutor
  local: false,
  generatedAt: "2026-04-15T10:15:00Z",
  cost: {
    microUSD: 450,                 // 0.00045 USD
    tokens: { in: 850, out: 320 }
  }
}

8. Cost Controls

Control	Enforcement
Per-session turn limit	30 turns/hour (via Redis counter)
Per-user daily budget	Inherited from tenant; enforced by ai-gateway
Token limits	`maxTokens` capped at 2048 per turn
Model routing	ai-gateway selects cheapest model satisfying quality threshold
Local-first for offline	Zero cost when offline

9. Safety Controls

Control	Where
Prompt injection classifier	ai-gateway (pre-call)
PII redaction in prompts	ai-gateway (pre-call)
Output toxicity filter	ai-gateway (post-stream)
Curriculum-drift detector	ai-gateway (post-stream); flags off-topic responses
Rate limiting	delivery + ai-gateway

Unsafe content detection results in an error chunk to the client with a user-friendly fallback message, and the turn is recorded with rating: 'unhelpful' auto-set for review.

10. HITL (Human-in-the-Loop)

Tutor responses are not subject to HITL in production (latency requirement). However:

Instructor review surface is provided via GET /api/v1/admin/tutor-turns?flagged=true in admin portal.
Instructors can flag turns for retraining the safety classifier.
rating: 'unhelpful' turns feed into quality improvement dataset.

11. Fallback Behavior

If AIClient is unavailable:

SSE stream sends error event with code ai.unavailable and a friendly fallback message ("The AI tutor is temporarily unavailable. Please try again in a moment, or review the lesson content.").
AssistantTurn is recorded with no response and error metadata.
No charge incurred.
Client-side UI shows fallback state and offers retry.

12. Telemetry

Metric	Type	Labels
`delivery_tutor_turn_duration_seconds`	Histogram	`tenant_id`, `local`, `model`
`delivery_tutor_turn_tokens_total`	Counter	`direction` (in/out), `tenant_id`
`delivery_tutor_cost_microusd_total`	Counter	`tenant_id`, `model`
`delivery_tutor_errors_total`	Counter	`error_code`, `tenant_id`
`delivery_tutor_rating_total`	Counter	`rating`, `tenant_id`

All tutor interactions carry trace_id across the SSE stream so ai-gateway traces can be correlated with delivery sessions.

1. AI Usage in Delivery​

2. AIClient Port​

3. Tutor Orchestration Flow​

4. RAG Context Strategy​

5. System Prompt​

6. Offline AI Inference​

7. AIProvenance​

8. Cost Controls​

9. Safety Controls​

10. HITL (Human-in-the-Loop)​

11. Fallback Behavior​

12. Telemetry​