Skip to main content

ai-orchestrator-service — Risk Register

Companion to: FAILURE_MODES.md · SECURITY_MODEL.md · docs/08-ai-architecture.md

Risks are categorised by likelihood (L: 1=rare, 5=expected) × impact (I: 1=minor, 5=catastrophic) → score (L×I). Mitigations are tracked tickets in the AI engineering board with SLAs. The register is reviewed monthly by the AI lead + security + SRE.

1. Risk table

IDRiskCategoryLIScoreStatusOwnerMitigation
R-AI-001Unbounded provider spend by misconfigured tenant or buggy callerCost4416mitigatedAI leadPer-tenant + per-purpose hard caps, cache, per-IP rate limits, per-tenant in-flight cap; daily anomaly detection on cost_micros per tenant
R-AI-002Cross-tenant RAG data leakSecurity/compliance2510mitigatedSecurityRLS + defence-in-depth WHERE clauses + grep guard in CI + integration test IT-AI-007 + red-team cross-tenant-rag.spec.ts
R-AI-003PII egress to non-Vertex provider via fallbackSecurity/compliance2510mitigatedSecurityRouter refuses non-Vertex providers when pii_class >= guest_pii; property test
R-AI-004Prompt injection coerces system to act as agent / reveal dataSecurity4416mitigatedAI lead<user_content> wrapper + denylist + tool-call refusal + output schema enforcement + red-team suite (≥ 200 cases)
R-AI-005Edge model tampering on desktop (post-install)Security248mitigatedDesktop leadKMS-signed manifest + per-load SHA-256 verify + refuse-on-mismatch
R-AI-006Manifest signing key compromiseSecurity155partialSecuritySmallest-blast-radius signer revision + KMS rotation + audit on every sign call. Open: automated revocation playbook to push a "kill manifest" to all desktops
R-AI-007Provider outage cascading into platform-wide degradationReliability4312mitigatedSREFallback chain + per-(provider, model) breaker + deterministic fallbacks per capability
R-AI-008Eval suite drift — model regressions go undetectedQuality3412mitigatedAI leadNightly canary on every active prompt + alerting + promotion gate
R-AI-009Hallucinated guest PII in drafts (model confabulates names, IDs)Compliance339mitigatedAI leadStrict output schemas + RAG-grounded prompts + post-call PII detector that flags fields not present in input
R-AI-010HITL queue starvation — gates expire and auto-decisions cause real-world harmOperational248partialTenant opsPer-policy defaultOnTimeout favours reject; queue alerting; open: SLA dashboard per tenant for HITL latency
R-AI-011Provider key leak (e.g. via accidentally-logged prompt template)Security2510mitigatedSecuritySecret-scanner CI on prompt templates + structured logging that never includes raw prompts/outputs + 60-day key rotation
R-AI-012A/B test drives a tenant to a measurably worse experienceProduct339mitigatedAI leadBayesian early-stop on conversion metric; auto-revert; tenant-level opt-out
R-AI-013Outbox publisher backlog leads to long event-delivery latencyReliability339mitigatedSRELag gauge + alert; horizontally-scalable publisher; DLQ
R-AI-014Pgvector index growth degrades RAG p95 latencyPerformance339partialData-engHNSW indexes + per-corpus chunk caps; open: per-corpus partitioning when chunks > 5M
R-AI-015Edge replay diverges from cloud (model+prompt drift between desktop + cloud)Quality339mitigatedAI leadEdge-replay nightly suite + version pinning of prompts + manifest publish event triggers desktop re-pull
R-AI-016Right-to-erasure (GDPR) cannot fully purge embeddings derived from a guestCompliance248partialComplianceEmbeddings tagged with subject_id; deletion job purges; open: documenting irreversibility for embeddings already shipped to desktops (deleted on next sync)
R-AI-017Schema-evolving event break a downstream consumerCompatibility236mitigatedAI leadMajor version bumps required for breaking changes; 30-day overlap; consumer contract tests
R-AI-018Vertex regional outage with no warm secondaryReliability144partialSREDR plan documented; open: warm Vertex AI secondary in europe-west4 with synchronous routing readiness drill
R-AI-019LLM-as-judge bias inflates eval scores for stylistically similar outputsQuality326partialAI leadFrozen judge prompt + 3-sample majority vote; open: human spot-check sampling each release
R-AI-020Feature scope sprawl — capability catalog grows un-evaluatedGovernance428partialAI leadCapability registration HITL gate + every capability requires an eval suite to leave draft

2. Risk acceptance

Risks scored ≥ 12 require quarterly review with platform leadership; risks ≥ 16 require a written exception signed by CTO if not fully mitigated.

3. Open mitigations (tracked)

  • R-AI-006: ticket AI-2057 — automated kill-manifest push.
  • R-AI-010: ticket AI-2061 — per-tenant HITL latency dashboard.
  • R-AI-014: ticket AI-2074 — corpus partitioning runbook.
  • R-AI-016: ticket COMP-1432 — desktop-side embedding retraction documentation.
  • R-AI-018: ticket SRE-3398 — Vertex secondary readiness drill.
  • R-AI-019: ticket AI-2099 — human spot-check sampling cadence.
  • R-AI-020: ticket AI-2103 — quarterly capability catalog audit.