Compliance Layer — AI Integration
Status: populated | Last updated: 2026-04-18
1. Purpose
The Compliance Layer uses Large Language Models (LLMs) to perform content classification on SMS bodies when keyword and regex rules are insufficient. AI classification is applied only when an active AI_CLASSIFICATION rule is present in the tenant's rule set — most messages never invoke the LLM.
AI classification strengthens detection for:
- Obfuscated spam (misspellings, unicode substitution, zero-width characters)
- Context-dependent phishing (URLs + social engineering patterns)
- Multilingual content not covered by keyword lists
- Emerging fraud patterns that outpace keyword list updates
2. Provider Strategy — Local LLM First
| Provider | Priority | Use Case |
|---|---|---|
| Local LLM (self-hosted) | Primary | Default for all production traffic; preferred for data residency, cost predictability, and no-DPA footprint |
| Anthropic Claude API | Secondary / failover | Optional — enabled per tenant or as automatic failover when local LLM is unavailable |
| OpenAI API | Tertiary / failover | Optional — secondary external failover |
| Mock provider | Dev/test only | In-memory responses for local development and CI |
Provider selection is governed by AI_PROVIDER environment variable and a failover chain. The async pipeline's relaxed latency SLA (500 ms) accommodates local LLM inference comfortably.
3. Local LLM Architecture
3.1 Deployment Topology
The local LLM runs as a separate Kubernetes deployment in the same namespace as compliance-engine, not as a sidecar (different scaling characteristics):
┌───────────────────────┐ ┌───────────────────────┐
│ compliance-engine │ ─────► │ local-llm │
│ (Node.js pods) │ gRPC/ │ (GPU pods, OpenAI- │
│ 3–20 replicas │ HTTP │ compatible API) │
│ │ │ 2–6 replicas │
└───────────────────────┘ └───────────────────────┘
│
▼
Redis (AI cache)
3.2 Recommended Serving Stack
| Component | Choice | Rationale |
|---|---|---|
| Inference server | vLLM (primary) or Ollama (dev) or TGI | vLLM provides OpenAI-compatible API + excellent throughput via PagedAttention |
| Model | Llama-3.1-8B-Instruct or Mistral-7B-Instruct or Qwen2.5-7B-Instruct | Small enough for cost-effective GPU use; strong multilingual (English, Dari/Farsi, Pashto, Arabic) |
| GPU | NVIDIA A10 / L4 (24 GB) or A100 (40 GB) for higher throughput | Smaller GPUs suffice for 7–8B models |
| Quantisation | AWQ or GPTQ 4-bit | ~4× memory reduction with <1% quality loss on classification tasks |
| API interface | OpenAI-compatible (/v1/chat/completions) | Allows seamless provider abstraction — same client code as Claude/OpenAI |
3.3 Capacity Planning
Assumptions:
- 7B model with 4-bit quantisation on A10 GPU
- Typical request: 200 input tokens, 150 output tokens (structured JSON)
- Single-GPU throughput: ~10–20 requests/second
| Expected AI eval RPS | GPU pods | Cache hit target |
|---|---|---|
| 1–5 RPS | 2 (HA) | ≥ 90% |
| 5–20 RPS | 2–4 | ≥ 95% |
| 20–100 RPS | 4–6 | ≥ 95% |
| 100+ RPS | 6+ with load balancing | ≥ 97% |
Cache hit rate is the dominant cost driver — 95% cache hit means only 5% of AI-rule evaluations reach the LLM.
3.4 Provider Abstraction
compliance-engine implements a provider-agnostic LLMClient interface:
interface LLMClient {
classify(body: string, categories: AiContentCategory[]): Promise<ClassificationResult>;
}
// Concrete implementations:
class LocalLLMClient implements LLMClient { /* vLLM endpoint */ }
class ClaudeClient implements LLMClient { /* Anthropic SDK */ }
class OpenAIClient implements LLMClient { /* OpenAI SDK */ }
class MockClient implements LLMClient { /* deterministic test responses */ }
A LLMRouter selects the client based on AI_PROVIDER and handles failover:
class LLMRouter implements LLMClient {
async classify(body: string, categories: AiContentCategory[]) {
try {
return await this.primary.classify(body, categories);
} catch (err) {
this.circuitBreaker.recordFailure();
if (this.secondary && !this.circuitBreaker.isOpenFor(this.primary)) {
return await this.secondary.classify(body, categories);
}
throw err; // caller applies rule.fallbackAction
}
}
}
4. Classification Categories
| Category | Description |
|---|---|
SPAM | Unsolicited commercial messaging, clickbait |
PHISHING | Credential-harvesting URLs, fake bank/service pages, impersonation |
FINANCIAL_FRAUD | Advance-fee fraud, fake investment schemes, money muling |
ADULT_CONTENT | Sexually explicit content |
HATE_SPEECH | Slurs, incitement against protected groups |
POLITICAL_CONTENT | Political campaigning (relevant in election blackout periods) |
DRUG_REFERENCE | Drug sales, narcotic-related content |
GAMBLING | Gambling promotions in restricted jurisdictions |
TERRORISM | Terrorist propaganda, recruitment, incitement |
MALWARE_LINK | URLs pointing to known malware distribution, APK sideload prompts |
HEALTH_MISINFORMATION | False health claims, unlicensed medical advice |
Categories are evaluated independently — a single message can score high on multiple categories.
5. Prompt Design
Single-turn, structured-output prompt. The LLM returns only a JSON object with confidence scores — no free-text reasoning, which minimises prompt-injection risk and token usage.
System:
You are an SMS compliance content classifier. Given an SMS message body,
return a JSON object mapping each of the following categories to a
confidence score between 0.0 and 1.0:
SPAM, PHISHING, FINANCIAL_FRAUD, ADULT_CONTENT, HATE_SPEECH,
POLITICAL_CONTENT, DRUG_REFERENCE, GAMBLING, TERRORISM, MALWARE_LINK,
HEALTH_MISINFORMATION.
A score of 1.0 means certain match; 0.0 means no indication.
Return ONLY the JSON object, no explanation or other text.
User:
[MESSAGE BODY HERE]
Expected response (enforced via grammar-constrained decoding on local LLM):
{
"SPAM": 0.12,
"PHISHING": 0.87,
"FINANCIAL_FRAUD": 0.05,
"ADULT_CONTENT": 0.0,
"HATE_SPEECH": 0.0,
"POLITICAL_CONTENT": 0.0,
"DRUG_REFERENCE": 0.0,
"GAMBLING": 0.0,
"TERRORISM": 0.0,
"MALWARE_LINK": 0.65,
"HEALTH_MISINFORMATION": 0.0
}
Grammar-constrained decoding
Local LLMs (via vLLM, llama.cpp, etc.) support constraint-based decoding that guarantees the output matches a JSON schema. This eliminates the parsing error class entirely — the response is always valid JSON of the expected shape.
Prompt injection resistance
- User input (message body) lives in a clearly delimited user message.
- System prompt restricts output to JSON — injection attempts have no free-text channel.
- Response parser rejects any response not matching the expected schema — treated as LLM failure (fallback action applies).
6. PII Anonymisation Before Inference
Although the local LLM runs inside our trust boundary, anonymisation is still recommended as defence-in-depth (and critical if external LLM failover is ever enabled). When ANONYMIZE_BODY_BEFORE_AI=true:
| Pattern | Replacement |
|---|---|
| E.164 phone numbers | [PHONE] |
| Monetary amounts | [AMOUNT] |
| 5+ digit sequences (OTPs, account numbers) | [NUMERIC] |
| Common first names (curated list) | [NAME] |
| URLs | [URL] (presence preserved for phishing detection) |
7. Caching Strategy
AI classification is the slowest operation in compliance evaluation. Aggressive caching is critical:
| Cache Layer | Key | TTL | Hit Expectation |
|---|---|---|---|
| Redis L1 | ai:cache:{sha256(anonymised_body)} | 24 h | ≥ 95% for templated messages (OTPs, alerts, campaigns) |
| PostgreSQL L2 (future) | ai_classification_cache table | 7 d | ≥ 98% including cross-pod sharing |
Cache entry format
{
"version": "1.0",
"classifiedAt": "2026-04-18T12:00:00Z",
"provider": "local",
"model": "llama-3.1-8b-instruct-awq",
"categories": {
"SPAM": 0.12,
"PHISHING": 0.87,
...
}
}
Cache invalidation
- Time-based TTL (24 h) is the primary mechanism.
- On model version upgrade: cache key format is
ai:cache:{modelVersion}:{sha256}— a model change implicitly bypasses cache without an explicit purge.
8. Budget and Timeout Control
| Control | Value | Purpose |
|---|---|---|
AI_TIMEOUT_MS | 2000 ms (default for local LLM) | Hard limit for a single LLM call |
| Eval budget allocation | 300 ms of 450 ms total | Maximum AI spend within one evaluation |
| Concurrency limit | 200 in-flight LLM calls per pod | Prevent thundering herd on LLM service |
| Circuit breaker | 5 consecutive failures in 30 s opens circuit for 60 s | Shed load when LLM degraded |
When timeout or concurrency limits are reached, the rule's fallbackAction applies. For fail-closed operation, fallbackAction: HOLD is the recommended default for all AI rules — the message is queued for manual review rather than let through.
9. Rule Configuration Examples
High-severity phishing detection (HOLD on AI unavailable)
{
"name": "Phishing URL Detection",
"ruleType": "AI_CLASSIFICATION",
"action": "HOLD",
"priority": 10,
"config": {
"categories": ["PHISHING", "MALWARE_LINK"],
"minConfidence": 0.75,
"fallbackAction": "HOLD"
}
}
Enhanced spam detection (still HOLD on AI unavailable — fail-closed)
{
"name": "Enhanced Spam Detection",
"ruleType": "AI_CLASSIFICATION",
"action": "FLAG",
"priority": 50,
"config": {
"categories": ["SPAM"],
"minConfidence": 0.85,
"fallbackAction": "HOLD"
}
}
National security — combined keyword + AI
{
"name": "Terrorism Content (Combined)",
"ruleType": "COMPOSITE",
"action": "BLOCK",
"priority": 1,
"config": {
"operator": "OR",
"ruleIds": ["keyword-terror-rule-id", "ai-terrorism-rule-id"]
}
}
10. Monitoring AI Usage
| Metric | Target |
|---|---|
compliance_ai_cache_hits_total / compliance_ai_requests_total | ≥ 95% |
compliance_ai_duration_seconds{quantile="0.95"} | ≤ 500 ms (local LLM) |
compliance_ai_fallback_total | ≤ 0.1% of AI requests |
| Local LLM GPU utilisation | 40–70% (headroom for spikes) |
11. Cost Model
Local LLM (primary)
Assumptions:
- 2× A10 GPU nodes, $0.60/hr each = $1.20/hr = ~$864/month
- Throughput: 20 RPS sustained per pod, 40 RPS total
- At 95% cache hit rate, 40 RPS supports ~800 RPS of AI-rule evaluations
- Monthly capacity: ~2 billion AI evaluations at headroom
Fixed cost model — scale cost by GPU capacity, not per-message. Cost-effective at volume.
Fallback external LLM (Claude Haiku — illustrative)
- ~$21.60 per 1M messages evaluated (per prior estimate)
- Used only on local LLM failover or per-tenant opt-in
- Budget cap configurable via Anthropic SDK rate limits
12. Local LLM Operations Runbook (Summary)
| Task | Owner | Cadence |
|---|---|---|
| Model evaluation against labelled SMS dataset | Trust & Safety + ML | Quarterly |
| Model version upgrade (e.g., Llama 3.1 → 3.2) | Platform Engineering | As released, with A/B test first |
| GPU pod scaling review | Platform SRE | Monthly |
| Classification accuracy audit (precision/recall per category) | Trust & Safety | Monthly |
| Prompt tuning based on false-positive patterns | Trust & Safety + ML | Ongoing |
| Fine-tuning on platform-specific examples (future) | ML | 2026 Q4 |
13. Future Enhancements
| Enhancement | Rationale | Timeline |
|---|---|---|
| Fine-tuned SMS-domain classifier | Lower latency, higher accuracy for our specific traffic patterns | Q4 2026 |
| Active learning — reviewer feedback updates training set | Continuous improvement | 2027 Q1 |
| Multilingual keyword generation via LLM | Auto-expand keyword lists for new languages | 2027 |
| Sender-reputation + content joint model | Reduce false positives on legitimate senders | 2027 |
| On-device inference for edge deployments | Regional compliance, offline resilience | 2027+ |