ai-orchestrator-service — Risk Register
Companion to:
FAILURE_MODES.md·SECURITY_MODEL.md·docs/08-ai-architecture.md
Risks are categorised by likelihood (L: 1=rare, 5=expected) × impact (I: 1=minor, 5=catastrophic) → score (L×I). Mitigations are tracked tickets in the AI engineering board with SLAs. The register is reviewed monthly by the AI lead + security + SRE.
1. Risk table
| ID | Risk | Category | L | I | Score | Status | Owner | Mitigation |
|---|---|---|---|---|---|---|---|---|
| R-AI-001 | Unbounded provider spend by misconfigured tenant or buggy caller | Cost | 4 | 4 | 16 | mitigated | AI lead | Per-tenant + per-purpose hard caps, cache, per-IP rate limits, per-tenant in-flight cap; daily anomaly detection on cost_micros per tenant |
| R-AI-002 | Cross-tenant RAG data leak | Security/compliance | 2 | 5 | 10 | mitigated | Security | RLS + defence-in-depth WHERE clauses + grep guard in CI + integration test IT-AI-007 + red-team cross-tenant-rag.spec.ts |
| R-AI-003 | PII egress to non-Vertex provider via fallback | Security/compliance | 2 | 5 | 10 | mitigated | Security | Router refuses non-Vertex providers when pii_class >= guest_pii; property test |
| R-AI-004 | Prompt injection coerces system to act as agent / reveal data | Security | 4 | 4 | 16 | mitigated | AI lead | <user_content> wrapper + denylist + tool-call refusal + output schema enforcement + red-team suite (≥ 200 cases) |
| R-AI-005 | Edge model tampering on desktop (post-install) | Security | 2 | 4 | 8 | mitigated | Desktop lead | KMS-signed manifest + per-load SHA-256 verify + refuse-on-mismatch |
| R-AI-006 | Manifest signing key compromise | Security | 1 | 5 | 5 | partial | Security | Smallest-blast-radius signer revision + KMS rotation + audit on every sign call. Open: automated revocation playbook to push a "kill manifest" to all desktops |
| R-AI-007 | Provider outage cascading into platform-wide degradation | Reliability | 4 | 3 | 12 | mitigated | SRE | Fallback chain + per-(provider, model) breaker + deterministic fallbacks per capability |
| R-AI-008 | Eval suite drift — model regressions go undetected | Quality | 3 | 4 | 12 | mitigated | AI lead | Nightly canary on every active prompt + alerting + promotion gate |
| R-AI-009 | Hallucinated guest PII in drafts (model confabulates names, IDs) | Compliance | 3 | 3 | 9 | mitigated | AI lead | Strict output schemas + RAG-grounded prompts + post-call PII detector that flags fields not present in input |
| R-AI-010 | HITL queue starvation — gates expire and auto-decisions cause real-world harm | Operational | 2 | 4 | 8 | partial | Tenant ops | Per-policy defaultOnTimeout favours reject; queue alerting; open: SLA dashboard per tenant for HITL latency |
| R-AI-011 | Provider key leak (e.g. via accidentally-logged prompt template) | Security | 2 | 5 | 10 | mitigated | Security | Secret-scanner CI on prompt templates + structured logging that never includes raw prompts/outputs + 60-day key rotation |
| R-AI-012 | A/B test drives a tenant to a measurably worse experience | Product | 3 | 3 | 9 | mitigated | AI lead | Bayesian early-stop on conversion metric; auto-revert; tenant-level opt-out |
| R-AI-013 | Outbox publisher backlog leads to long event-delivery latency | Reliability | 3 | 3 | 9 | mitigated | SRE | Lag gauge + alert; horizontally-scalable publisher; DLQ |
| R-AI-014 | Pgvector index growth degrades RAG p95 latency | Performance | 3 | 3 | 9 | partial | Data-eng | HNSW indexes + per-corpus chunk caps; open: per-corpus partitioning when chunks > 5M |
| R-AI-015 | Edge replay diverges from cloud (model+prompt drift between desktop + cloud) | Quality | 3 | 3 | 9 | mitigated | AI lead | Edge-replay nightly suite + version pinning of prompts + manifest publish event triggers desktop re-pull |
| R-AI-016 | Right-to-erasure (GDPR) cannot fully purge embeddings derived from a guest | Compliance | 2 | 4 | 8 | partial | Compliance | Embeddings tagged with subject_id; deletion job purges; open: documenting irreversibility for embeddings already shipped to desktops (deleted on next sync) |
| R-AI-017 | Schema-evolving event break a downstream consumer | Compatibility | 2 | 3 | 6 | mitigated | AI lead | Major version bumps required for breaking changes; 30-day overlap; consumer contract tests |
| R-AI-018 | Vertex regional outage with no warm secondary | Reliability | 1 | 4 | 4 | partial | SRE | DR plan documented; open: warm Vertex AI secondary in europe-west4 with synchronous routing readiness drill |
| R-AI-019 | LLM-as-judge bias inflates eval scores for stylistically similar outputs | Quality | 3 | 2 | 6 | partial | AI lead | Frozen judge prompt + 3-sample majority vote; open: human spot-check sampling each release |
| R-AI-020 | Feature scope sprawl — capability catalog grows un-evaluated | Governance | 4 | 2 | 8 | partial | AI lead | Capability registration HITL gate + every capability requires an eval suite to leave draft |
2. Risk acceptance
Risks scored ≥ 12 require quarterly review with platform leadership; risks ≥ 16 require a written exception signed by CTO if not fully mitigated.
3. Open mitigations (tracked)
- R-AI-006: ticket
AI-2057— automated kill-manifest push. - R-AI-010: ticket
AI-2061— per-tenant HITL latency dashboard. - R-AI-014: ticket
AI-2074— corpus partitioning runbook. - R-AI-016: ticket
COMP-1432— desktop-side embedding retraction documentation. - R-AI-018: ticket
SRE-3398— Vertex secondary readiness drill. - R-AI-019: ticket
AI-2099— human spot-check sampling cadence. - R-AI-020: ticket
AI-2103— quarterly capability catalog audit.