ai-orchestrator-service — Failure Modes
Companion to:
docs/standards/ERROR_CODES.md·OBSERVABILITY.md·SECURITY_MODEL.md
The AI service has a wide failure surface: external providers, schema drift, budget pressure, HITL queue starvation, signed-artifact tampering, edge-replay drift, and PII leak vectors. This file enumerates each failure, its detection, blast radius, response, and runbook.
1. Failure-mode summary table
| # | Class | Mode | Detection | Blast radius | Response | Error code |
|---|---|---|---|---|---|---|
| 1 | provider | Vertex AI hard outage | health probe + 5xx rate from adapter | All cloud capabilities | Fallback chain → Anthropic → OpenAI → deterministic | MELMASTOON.AI.PROVIDER_UNAVAILABLE |
| 2 | provider | Vertex AI rate-limited | 429 from adapter | Tenant or capability burst | Token-bucket retry with jitter; on persistent 429 escalate to fallback | MELMASTOON.AI.PROVIDER_RATE_LIMITED |
| 3 | provider | Anthropic / OpenAI key revoked or expired | 401/403 from adapter | Fallback path only | Drop provider from chain for 5 min; page on-call; rotate key | MELMASTOON.AI.PROVIDER_AUTH_FAILED |
| 4 | provider | Provider returns malformed JSON repeatedly | Output validator after 1 repair retry | Capability calls only | Refuse, emit output_invalid; auto-degrade tenant after 5% failure rate over 1 000 calls | MELMASTOON.AI.OUTPUT_INVALID |
| 5 | budget | Tenant hard cap exceeded | Budget guard pre-call | All AI calls for that tenant for the period | If capability has degrade_on_budget=true → deterministic; else refuse | MELMASTOON.AI.REFUSED_BUDGET |
| 6 | budget | Soft warning crossed | Budget guard | None (informational) | Emit budget.warning.v1; surface banner in backoffice | n/a |
| 7 | safety | Pre-call moderation blocks input | Moderation port | Single request | Refuse 422 + audit row | MELMASTOON.AI.REFUSED_SAFETY |
| 8 | safety | Post-call moderation blocks output | Moderation port | Single request | Refuse 422; if capability has safety_fallback=template → return safe template | MELMASTOON.AI.REFUSED_SAFETY |
| 9 | safety | Prompt injection detected | Pattern filter + post-call schema mismatch | Single request | Refuse; increment injection_attempts_total; over threshold → tenant-level alert | MELMASTOON.AI.REFUSED_SAFETY |
| 10 | hitl | HITL gate timeout | Scheduled scanner every 60 s | Per gate | Auto-decision per defaultOnTimeout (reject for risky capabilities); emit gate_decided | n/a (auto-resolve) |
| 11 | hitl | HITL queue depth > 50 | Gauge alert | Tenant-wide | Page tenant ops + send autoresponder banner; do not auto-flush | n/a |
| 12 | edge | Edge model integrity check fails | Desktop SHA-256 verify on load | Single desktop | Refuse load; mark capability edge_unavailable; user sees banner; capability falls back to cloud or refuses if offline | MELMASTOON.AI.EDGE_MODEL_INTEGRITY_FAIL |
| 13 | edge | Edge manifest signature invalid | Desktop verify on download | Single desktop | Refuse manifest; retain previous trusted manifest; alert | MELMASTOON.AI.EDGE_MANIFEST_SIGNATURE_INVALID |
| 14 | edge | Edge inference output schema invalid | Local validator | Single request | Show "AI offline draft unavailable; tap to draft online when reconnected" | MELMASTOON.AI.OUTPUT_INVALID |
| 15 | data | RAG cross-tenant leak attempt | Defence-in-depth WHERE tenant assertion | Single request rejected | Reject with audit row + security.cross_tenant_attempt.v1 event | MELMASTOON.GENERAL.CROSS_TENANT_REFERENCE |
| 16 | data | pgvector index corruption | DB error on query | RAG capabilities | Failover to read replica; rebuild index | MELMASTOON.AI.PROVIDER_UNAVAILABLE (rare path) |
| 17 | data | Outbox publisher backlog > 5 min | Lag gauge | Eventual-consistency for downstream consumers | Scale outbox publisher; if persistent, page | n/a |
| 18 | data | Idempotency-Key collision (different bodies) | Hash mismatch on key reuse | Single request | 409 with explanation | MELMASTOON.GENERAL.IDEMPOTENCY_CONFLICT |
| 19 | infra | Postgres primary down | Connection failures | All writes | Cloud SQL HA failover (auto); retries succeed | MELMASTOON.GENERAL.STORAGE_UNAVAILABLE (transient) |
| 20 | infra | Memorystore down | Cache miss escalation + connection errors | Cache layer only | Open-loop bypass cache; warn; alert | n/a (degraded) |
| 21 | infra | KMS unavailable for manifest publish | Sign API errors | Manifest publish only | Retry; if persistent, hold publish; alert | MELMASTOON.AI.MANIFEST_SIGNING_UNAVAILABLE |
| 22 | infra | Pub/Sub publish failure | Publish error from outbox publisher | Eventual events | Outbox holds row; backoff retry; DLQ after N | n/a |
| 23 | quality | Eval canary drift > 5% twice | Nightly drift detector | Quality SLO | Page ai-engineering; consider rollback prompt to prior active | n/a |
| 24 | quality | A/B variant significantly worse | Bayesian early-stop on conversion metric | Subset of traffic | Auto-stop A/B; revert to control | n/a |
| 25 | identity | Device JWT expired or signature invalid | Authn guard | Single desktop session | Force re-bind via desktop UX | MELMASTOON.IDENTITY.DEVICE_NOT_BOUND |
2. Cascading failures and circuit breakers
The router uses a per-(provider, model) circuit breaker:
state := CLOSED, OPEN, HALF_OPEN
trip threshold: 50% failures over the last 50 calls (sliding window) OR ≥ 10 timeouts in 30 s
open duration: 60 s, then HALF_OPEN with 3 trial requests
When vertex/gemini-1.5-flash opens for a capability, the router records a counter provider.fallback_used_total{from='vertex',to='anthropic',capability_key=...} and continues serving from the fallback until the breaker closes. Capabilities marked pii_class >= 'guest_pii' skip non-Vertex providers and degrade to deterministic on breaker open.
3. Deterministic fallbacks
Per-capability fallback functions live in src/fallbacks/<capability>.ts. They produce a schema-valid output without calling any model:
| Capability | Fallback |
|---|---|
pricing.suggest | Returns BAR ± 0% with confidence=0 and reason='deterministic_fallback' |
pricing.demand_forecast | Last-year same-day occupancy |
housekeeping.route_optimize | OR-tools-only solution; LLM rationale field empty |
staff.shift_optimize | LP-only solution |
anomaly.detect | Static rule set (e.g. > 5 failed logins / 5 min) |
upsell.recommend | Top-3 popular per segment lookup |
message.draft | Template message with merge fields filled |
review.summarize | Extractive top-3 sentences by TF-IDF |
vision.id_ocr | Refuse with MELMASTOON.AI.PROVIDER_UNAVAILABLE (no safe deterministic OCR) |
audio.transcribe | Same — refuse |
content.description | Returns the existing description unchanged |
content.translate | Returns [NEEDS TRANSLATION] placeholder |
tutor.answer | Returns top RAG snippet verbatim with citation, no synthesis |
A capability without a fallback (vision.id_ocr, audio.transcribe) refuses and surfaces a "try again later" banner in the calling UI.
4. Blast-radius limits
To avoid a single failing tenant or capability dragging down the whole service:
- Per-tenant in-flight cap (default 50 concurrent inferences); exceeded → 429
MELMASTOON.AI.REFUSED_BUDGETwithRetry-After. - Per-capability circuit breaker can mark a capability
degradedfor a tenant for 10 min if its failure rate spikes; UI gets a per-capability banner. - The Cloud Run revision uses request concurrency 40 and CPU-bound autoscaling; bursty edge replay traffic is throttled by a separate Cloud Run revision (
-replay) with its own quota so it can never starve interactive traffic.
5. Runbook index
Each row in §1 has a runbook on Confluence under runbooks/ai/<error-code> mirrored to docs/runbooks/ai/ in the documentation repo. Every runbook has the same shape:
- Symptom
- Probable cause
- First-check (commands / dashboards / queries)
- Mitigation (immediate)
- Recovery (full)
- Post-incident: what to add to the eval suite or red-team to prevent recurrence
6. Game-day exercises
Quarterly game days run the following injected failures against staging during business hours:
- Force Vertex AI 503 for 10 minutes (verify fallback and SLO impact).
- Drain Memorystore (verify cache-bypass behaviour).
- Publish a manifest with a bad signature (verify desktop refusal).
- Submit 500 prompt-injection requests (verify red-team detection rates).
- Saturate HITL queue (verify alerting and auto-decision behaviour).
Exit criteria: SLOs not breached during exercise OR breach root cause has a follow-up ticket within 48 h.