ai-orchestrator-service — Failure Modes

Companion to: docs/standards/ERROR_CODES.md · OBSERVABILITY.md · SECURITY_MODEL.md

The AI service has a wide failure surface: external providers, schema drift, budget pressure, HITL queue starvation, signed-artifact tampering, edge-replay drift, and PII leak vectors. This file enumerates each failure, its detection, blast radius, response, and runbook.

1. Failure-mode summary table

#	Class	Mode	Detection	Blast radius	Response	Error code
1	provider	Vertex AI hard outage	health probe + 5xx rate from adapter	All cloud capabilities	Fallback chain → Anthropic → OpenAI → deterministic	`MELMASTOON.AI.PROVIDER_UNAVAILABLE`
2	provider	Vertex AI rate-limited	429 from adapter	Tenant or capability burst	Token-bucket retry with jitter; on persistent 429 escalate to fallback	`MELMASTOON.AI.PROVIDER_RATE_LIMITED`
3	provider	Anthropic / OpenAI key revoked or expired	401/403 from adapter	Fallback path only	Drop provider from chain for 5 min; page on-call; rotate key	`MELMASTOON.AI.PROVIDER_AUTH_FAILED`
4	provider	Provider returns malformed JSON repeatedly	Output validator after 1 repair retry	Capability calls only	Refuse, emit `output_invalid`; auto-degrade tenant after 5% failure rate over 1 000 calls	`MELMASTOON.AI.OUTPUT_INVALID`
5	budget	Tenant hard cap exceeded	Budget guard pre-call	All AI calls for that tenant for the period	If capability has `degrade_on_budget=true` → deterministic; else refuse	`MELMASTOON.AI.REFUSED_BUDGET`
6	budget	Soft warning crossed	Budget guard	None (informational)	Emit `budget.warning.v1`; surface banner in backoffice	n/a
7	safety	Pre-call moderation blocks input	Moderation port	Single request	Refuse 422 + audit row	`MELMASTOON.AI.REFUSED_SAFETY`
8	safety	Post-call moderation blocks output	Moderation port	Single request	Refuse 422; if capability has `safety_fallback=template` → return safe template	`MELMASTOON.AI.REFUSED_SAFETY`
9	safety	Prompt injection detected	Pattern filter + post-call schema mismatch	Single request	Refuse; increment `injection_attempts_total`; over threshold → tenant-level alert	`MELMASTOON.AI.REFUSED_SAFETY`
10	hitl	HITL gate timeout	Scheduled scanner every 60 s	Per gate	Auto-decision per `defaultOnTimeout` (`reject` for risky capabilities); emit `gate_decided`	n/a (auto-resolve)
11	hitl	HITL queue depth > 50	Gauge alert	Tenant-wide	Page tenant ops + send autoresponder banner; do not auto-flush	n/a
12	edge	Edge model integrity check fails	Desktop SHA-256 verify on load	Single desktop	Refuse load; mark capability `edge_unavailable`; user sees banner; capability falls back to cloud or refuses if offline	`MELMASTOON.AI.EDGE_MODEL_INTEGRITY_FAIL`
13	edge	Edge manifest signature invalid	Desktop verify on download	Single desktop	Refuse manifest; retain previous trusted manifest; alert	`MELMASTOON.AI.EDGE_MANIFEST_SIGNATURE_INVALID`
14	edge	Edge inference output schema invalid	Local validator	Single request	Show "AI offline draft unavailable; tap to draft online when reconnected"	`MELMASTOON.AI.OUTPUT_INVALID`
15	data	RAG cross-tenant leak attempt	Defence-in-depth WHERE tenant assertion	Single request rejected	Reject with audit row + `security.cross_tenant_attempt.v1` event	`MELMASTOON.GENERAL.CROSS_TENANT_REFERENCE`
16	data	pgvector index corruption	DB error on query	RAG capabilities	Failover to read replica; rebuild index	`MELMASTOON.AI.PROVIDER_UNAVAILABLE` (rare path)
17	data	Outbox publisher backlog > 5 min	Lag gauge	Eventual-consistency for downstream consumers	Scale outbox publisher; if persistent, page	n/a
18	data	Idempotency-Key collision (different bodies)	Hash mismatch on key reuse	Single request	409 with explanation	`MELMASTOON.GENERAL.IDEMPOTENCY_CONFLICT`
19	infra	Postgres primary down	Connection failures	All writes	Cloud SQL HA failover (auto); retries succeed	`MELMASTOON.GENERAL.STORAGE_UNAVAILABLE` (transient)
20	infra	Memorystore down	Cache miss escalation + connection errors	Cache layer only	Open-loop bypass cache; warn; alert	n/a (degraded)
21	infra	KMS unavailable for manifest publish	Sign API errors	Manifest publish only	Retry; if persistent, hold publish; alert	`MELMASTOON.AI.MANIFEST_SIGNING_UNAVAILABLE`
22	infra	Pub/Sub publish failure	Publish error from outbox publisher	Eventual events	Outbox holds row; backoff retry; DLQ after N	n/a
23	quality	Eval canary drift > 5% twice	Nightly drift detector	Quality SLO	Page ai-engineering; consider rollback prompt to prior active	n/a
24	quality	A/B variant significantly worse	Bayesian early-stop on conversion metric	Subset of traffic	Auto-stop A/B; revert to control	n/a
25	identity	Device JWT expired or signature invalid	Authn guard	Single desktop session	Force re-bind via desktop UX	`MELMASTOON.IDENTITY.DEVICE_NOT_BOUND`

2. Cascading failures and circuit breakers

The router uses a per-(provider, model) circuit breaker:

state := CLOSED, OPEN, HALF_OPEN
trip threshold: 50% failures over the last 50 calls (sliding window) OR ≥ 10 timeouts in 30 s
open duration: 60 s, then HALF_OPEN with 3 trial requests

When vertex/gemini-1.5-flash opens for a capability, the router records a counter provider.fallback_used_total{from='vertex',to='anthropic',capability_key=...} and continues serving from the fallback until the breaker closes. Capabilities marked pii_class >= 'guest_pii' skip non-Vertex providers and degrade to deterministic on breaker open.

3. Deterministic fallbacks

Per-capability fallback functions live in src/fallbacks/<capability>.ts. They produce a schema-valid output without calling any model:

Capability	Fallback
`pricing.suggest`	Returns BAR ± 0% with `confidence=0` and `reason='deterministic_fallback'`
`pricing.demand_forecast`	Last-year same-day occupancy
`housekeeping.route_optimize`	OR-tools-only solution; LLM rationale field empty
`staff.shift_optimize`	LP-only solution
`anomaly.detect`	Static rule set (e.g. > 5 failed logins / 5 min)
`upsell.recommend`	Top-3 popular per segment lookup
`message.draft`	Template message with merge fields filled
`review.summarize`	Extractive top-3 sentences by TF-IDF
`vision.id_ocr`	Refuse with `MELMASTOON.AI.PROVIDER_UNAVAILABLE` (no safe deterministic OCR)
`audio.transcribe`	Same — refuse
`content.description`	Returns the existing description unchanged
`content.translate`	Returns `[NEEDS TRANSLATION]` placeholder
`tutor.answer`	Returns top RAG snippet verbatim with citation, no synthesis

A capability without a fallback (vision.id_ocr, audio.transcribe) refuses and surfaces a "try again later" banner in the calling UI.

4. Blast-radius limits

To avoid a single failing tenant or capability dragging down the whole service:

Per-tenant in-flight cap (default 50 concurrent inferences); exceeded → 429 MELMASTOON.AI.REFUSED_BUDGET with Retry-After.
Per-capability circuit breaker can mark a capability degraded for a tenant for 10 min if its failure rate spikes; UI gets a per-capability banner.
The Cloud Run revision uses request concurrency 40 and CPU-bound autoscaling; bursty edge replay traffic is throttled by a separate Cloud Run revision (-replay) with its own quota so it can never starve interactive traffic.

5. Runbook index

Each row in §1 has a runbook on Confluence under runbooks/ai/<error-code> mirrored to docs/runbooks/ai/ in the documentation repo. Every runbook has the same shape:

Symptom
Probable cause
First-check (commands / dashboards / queries)
Mitigation (immediate)
Recovery (full)
Post-incident: what to add to the eval suite or red-team to prevent recurrence

6. Game-day exercises

Quarterly game days run the following injected failures against staging during business hours:

Force Vertex AI 503 for 10 minutes (verify fallback and SLO impact).
Drain Memorystore (verify cache-bypass behaviour).
Publish a manifest with a bad signature (verify desktop refusal).
Submit 500 prompt-injection requests (verify red-team detection rates).
Saturate HITL queue (verify alerting and auto-decision behaviour).

Exit criteria: SLOs not breached during exercise OR breach root cause has a follow-up ticket within 48 h.

1. Failure-mode summary table​

2. Cascading failures and circuit breakers​

3. Deterministic fallbacks​

4. Blast-radius limits​

5. Runbook index​

6. Game-day exercises​