ai-orchestrator-service — Risk Register

Companion to: FAILURE_MODES.md · SECURITY_MODEL.md · docs/08-ai-architecture.md

Risks are categorised by likelihood (L: 1=rare, 5=expected) × impact (I: 1=minor, 5=catastrophic) → score (L×I). Mitigations are tracked tickets in the AI engineering board with SLAs. The register is reviewed monthly by the AI lead + security + SRE.

1. Risk table

ID	Risk	Category	L	I	Score	Status	Owner	Mitigation
R-AI-001	Unbounded provider spend by misconfigured tenant or buggy caller	Cost	4	4	16	mitigated	AI lead	Per-tenant + per-purpose hard caps, cache, per-IP rate limits, per-tenant in-flight cap; daily anomaly detection on `cost_micros` per tenant
R-AI-002	Cross-tenant RAG data leak	Security/compliance	2	5	10	mitigated	Security	RLS + defence-in-depth WHERE clauses + grep guard in CI + integration test `IT-AI-007` + red-team `cross-tenant-rag.spec.ts`
R-AI-003	PII egress to non-Vertex provider via fallback	Security/compliance	2	5	10	mitigated	Security	Router refuses non-Vertex providers when `pii_class >= guest_pii`; property test
R-AI-004	Prompt injection coerces system to act as agent / reveal data	Security	4	4	16	mitigated	AI lead	`<user_content>` wrapper + denylist + tool-call refusal + output schema enforcement + red-team suite (≥ 200 cases)
R-AI-005	Edge model tampering on desktop (post-install)	Security	2	4	8	mitigated	Desktop lead	KMS-signed manifest + per-load SHA-256 verify + refuse-on-mismatch
R-AI-006	Manifest signing key compromise	Security	1	5	5	partial	Security	Smallest-blast-radius signer revision + KMS rotation + audit on every sign call. Open: automated revocation playbook to push a "kill manifest" to all desktops
R-AI-007	Provider outage cascading into platform-wide degradation	Reliability	4	3	12	mitigated	SRE	Fallback chain + per-(provider, model) breaker + deterministic fallbacks per capability
R-AI-008	Eval suite drift — model regressions go undetected	Quality	3	4	12	mitigated	AI lead	Nightly canary on every active prompt + alerting + promotion gate
R-AI-009	Hallucinated guest PII in drafts (model confabulates names, IDs)	Compliance	3	3	9	mitigated	AI lead	Strict output schemas + RAG-grounded prompts + post-call PII detector that flags fields not present in input
R-AI-010	HITL queue starvation — gates expire and auto-decisions cause real-world harm	Operational	2	4	8	partial	Tenant ops	Per-policy `defaultOnTimeout` favours `reject`; queue alerting; open: SLA dashboard per tenant for HITL latency
R-AI-011	Provider key leak (e.g. via accidentally-logged prompt template)	Security	2	5	10	mitigated	Security	Secret-scanner CI on prompt templates + structured logging that never includes raw prompts/outputs + 60-day key rotation
R-AI-012	A/B test drives a tenant to a measurably worse experience	Product	3	3	9	mitigated	AI lead	Bayesian early-stop on conversion metric; auto-revert; tenant-level opt-out
R-AI-013	Outbox publisher backlog leads to long event-delivery latency	Reliability	3	3	9	mitigated	SRE	Lag gauge + alert; horizontally-scalable publisher; DLQ
R-AI-014	Pgvector index growth degrades RAG p95 latency	Performance	3	3	9	partial	Data-eng	HNSW indexes + per-corpus chunk caps; open: per-corpus partitioning when chunks > 5M
R-AI-015	Edge replay diverges from cloud (model+prompt drift between desktop + cloud)	Quality	3	3	9	mitigated	AI lead	Edge-replay nightly suite + version pinning of prompts + manifest publish event triggers desktop re-pull
R-AI-016	Right-to-erasure (GDPR) cannot fully purge embeddings derived from a guest	Compliance	2	4	8	partial	Compliance	Embeddings tagged with `subject_id`; deletion job purges; open: documenting irreversibility for embeddings already shipped to desktops (deleted on next sync)
R-AI-017	Schema-evolving event break a downstream consumer	Compatibility	2	3	6	mitigated	AI lead	Major version bumps required for breaking changes; 30-day overlap; consumer contract tests
R-AI-018	Vertex regional outage with no warm secondary	Reliability	1	4	4	partial	SRE	DR plan documented; open: warm Vertex AI secondary in `europe-west4` with synchronous routing readiness drill
R-AI-019	LLM-as-judge bias inflates eval scores for stylistically similar outputs	Quality	3	2	6	partial	AI lead	Frozen judge prompt + 3-sample majority vote; open: human spot-check sampling each release
R-AI-020	Feature scope sprawl — capability catalog grows un-evaluated	Governance	4	2	8	partial	AI lead	Capability registration HITL gate + every capability requires an eval suite to leave `draft`

2. Risk acceptance

Risks scored ≥ 12 require quarterly review with platform leadership; risks ≥ 16 require a written exception signed by CTO if not fully mitigated.

3. Open mitigations (tracked)

R-AI-006: ticket AI-2057 — automated kill-manifest push.
R-AI-010: ticket AI-2061 — per-tenant HITL latency dashboard.
R-AI-014: ticket AI-2074 — corpus partitioning runbook.
R-AI-016: ticket COMP-1432 — desktop-side embedding retraction documentation.
R-AI-018: ticket SRE-3398 — Vertex secondary readiness drill.
R-AI-019: ticket AI-2099 — human spot-check sampling cadence.
R-AI-020: ticket AI-2103 — quarterly capability catalog audit.

1. Risk table​

2. Risk acceptance​

3. Open mitigations (tracked)​

1. Risk table

2. Risk acceptance

3. Open mitigations (tracked)