Skip to main content

ai-orchestrator-service — Failure Modes

Companion to: docs/standards/ERROR_CODES.md · OBSERVABILITY.md · SECURITY_MODEL.md

The AI service has a wide failure surface: external providers, schema drift, budget pressure, HITL queue starvation, signed-artifact tampering, edge-replay drift, and PII leak vectors. This file enumerates each failure, its detection, blast radius, response, and runbook.

1. Failure-mode summary table

#ClassModeDetectionBlast radiusResponseError code
1providerVertex AI hard outagehealth probe + 5xx rate from adapterAll cloud capabilitiesFallback chain → Anthropic → OpenAI → deterministicMELMASTOON.AI.PROVIDER_UNAVAILABLE
2providerVertex AI rate-limited429 from adapterTenant or capability burstToken-bucket retry with jitter; on persistent 429 escalate to fallbackMELMASTOON.AI.PROVIDER_RATE_LIMITED
3providerAnthropic / OpenAI key revoked or expired401/403 from adapterFallback path onlyDrop provider from chain for 5 min; page on-call; rotate keyMELMASTOON.AI.PROVIDER_AUTH_FAILED
4providerProvider returns malformed JSON repeatedlyOutput validator after 1 repair retryCapability calls onlyRefuse, emit output_invalid; auto-degrade tenant after 5% failure rate over 1 000 callsMELMASTOON.AI.OUTPUT_INVALID
5budgetTenant hard cap exceededBudget guard pre-callAll AI calls for that tenant for the periodIf capability has degrade_on_budget=true → deterministic; else refuseMELMASTOON.AI.REFUSED_BUDGET
6budgetSoft warning crossedBudget guardNone (informational)Emit budget.warning.v1; surface banner in backofficen/a
7safetyPre-call moderation blocks inputModeration portSingle requestRefuse 422 + audit rowMELMASTOON.AI.REFUSED_SAFETY
8safetyPost-call moderation blocks outputModeration portSingle requestRefuse 422; if capability has safety_fallback=template → return safe templateMELMASTOON.AI.REFUSED_SAFETY
9safetyPrompt injection detectedPattern filter + post-call schema mismatchSingle requestRefuse; increment injection_attempts_total; over threshold → tenant-level alertMELMASTOON.AI.REFUSED_SAFETY
10hitlHITL gate timeoutScheduled scanner every 60 sPer gateAuto-decision per defaultOnTimeout (reject for risky capabilities); emit gate_decidedn/a (auto-resolve)
11hitlHITL queue depth > 50Gauge alertTenant-widePage tenant ops + send autoresponder banner; do not auto-flushn/a
12edgeEdge model integrity check failsDesktop SHA-256 verify on loadSingle desktopRefuse load; mark capability edge_unavailable; user sees banner; capability falls back to cloud or refuses if offlineMELMASTOON.AI.EDGE_MODEL_INTEGRITY_FAIL
13edgeEdge manifest signature invalidDesktop verify on downloadSingle desktopRefuse manifest; retain previous trusted manifest; alertMELMASTOON.AI.EDGE_MANIFEST_SIGNATURE_INVALID
14edgeEdge inference output schema invalidLocal validatorSingle requestShow "AI offline draft unavailable; tap to draft online when reconnected"MELMASTOON.AI.OUTPUT_INVALID
15dataRAG cross-tenant leak attemptDefence-in-depth WHERE tenant assertionSingle request rejectedReject with audit row + security.cross_tenant_attempt.v1 eventMELMASTOON.GENERAL.CROSS_TENANT_REFERENCE
16datapgvector index corruptionDB error on queryRAG capabilitiesFailover to read replica; rebuild indexMELMASTOON.AI.PROVIDER_UNAVAILABLE (rare path)
17dataOutbox publisher backlog > 5 minLag gaugeEventual-consistency for downstream consumersScale outbox publisher; if persistent, pagen/a
18dataIdempotency-Key collision (different bodies)Hash mismatch on key reuseSingle request409 with explanationMELMASTOON.GENERAL.IDEMPOTENCY_CONFLICT
19infraPostgres primary downConnection failuresAll writesCloud SQL HA failover (auto); retries succeedMELMASTOON.GENERAL.STORAGE_UNAVAILABLE (transient)
20infraMemorystore downCache miss escalation + connection errorsCache layer onlyOpen-loop bypass cache; warn; alertn/a (degraded)
21infraKMS unavailable for manifest publishSign API errorsManifest publish onlyRetry; if persistent, hold publish; alertMELMASTOON.AI.MANIFEST_SIGNING_UNAVAILABLE
22infraPub/Sub publish failurePublish error from outbox publisherEventual eventsOutbox holds row; backoff retry; DLQ after Nn/a
23qualityEval canary drift > 5% twiceNightly drift detectorQuality SLOPage ai-engineering; consider rollback prompt to prior activen/a
24qualityA/B variant significantly worseBayesian early-stop on conversion metricSubset of trafficAuto-stop A/B; revert to controln/a
25identityDevice JWT expired or signature invalidAuthn guardSingle desktop sessionForce re-bind via desktop UXMELMASTOON.IDENTITY.DEVICE_NOT_BOUND

2. Cascading failures and circuit breakers

The router uses a per-(provider, model) circuit breaker:

state := CLOSED, OPEN, HALF_OPEN
trip threshold: 50% failures over the last 50 calls (sliding window) OR ≥ 10 timeouts in 30 s
open duration: 60 s, then HALF_OPEN with 3 trial requests

When vertex/gemini-1.5-flash opens for a capability, the router records a counter provider.fallback_used_total{from='vertex',to='anthropic',capability_key=...} and continues serving from the fallback until the breaker closes. Capabilities marked pii_class >= 'guest_pii' skip non-Vertex providers and degrade to deterministic on breaker open.

3. Deterministic fallbacks

Per-capability fallback functions live in src/fallbacks/<capability>.ts. They produce a schema-valid output without calling any model:

CapabilityFallback
pricing.suggestReturns BAR ± 0% with confidence=0 and reason='deterministic_fallback'
pricing.demand_forecastLast-year same-day occupancy
housekeeping.route_optimizeOR-tools-only solution; LLM rationale field empty
staff.shift_optimizeLP-only solution
anomaly.detectStatic rule set (e.g. > 5 failed logins / 5 min)
upsell.recommendTop-3 popular per segment lookup
message.draftTemplate message with merge fields filled
review.summarizeExtractive top-3 sentences by TF-IDF
vision.id_ocrRefuse with MELMASTOON.AI.PROVIDER_UNAVAILABLE (no safe deterministic OCR)
audio.transcribeSame — refuse
content.descriptionReturns the existing description unchanged
content.translateReturns [NEEDS TRANSLATION] placeholder
tutor.answerReturns top RAG snippet verbatim with citation, no synthesis

A capability without a fallback (vision.id_ocr, audio.transcribe) refuses and surfaces a "try again later" banner in the calling UI.

4. Blast-radius limits

To avoid a single failing tenant or capability dragging down the whole service:

  • Per-tenant in-flight cap (default 50 concurrent inferences); exceeded → 429 MELMASTOON.AI.REFUSED_BUDGET with Retry-After.
  • Per-capability circuit breaker can mark a capability degraded for a tenant for 10 min if its failure rate spikes; UI gets a per-capability banner.
  • The Cloud Run revision uses request concurrency 40 and CPU-bound autoscaling; bursty edge replay traffic is throttled by a separate Cloud Run revision (-replay) with its own quota so it can never starve interactive traffic.

5. Runbook index

Each row in §1 has a runbook on Confluence under runbooks/ai/<error-code> mirrored to docs/runbooks/ai/ in the documentation repo. Every runbook has the same shape:

  1. Symptom
  2. Probable cause
  3. First-check (commands / dashboards / queries)
  4. Mitigation (immediate)
  5. Recovery (full)
  6. Post-incident: what to add to the eval suite or red-team to prevent recurrence

6. Game-day exercises

Quarterly game days run the following injected failures against staging during business hours:

  • Force Vertex AI 503 for 10 minutes (verify fallback and SLO impact).
  • Drain Memorystore (verify cache-bypass behaviour).
  • Publish a manifest with a bad signature (verify desktop refusal).
  • Submit 500 prompt-injection requests (verify red-team detection rates).
  • Saturate HITL queue (verify alerting and auto-decision behaviour).

Exit criteria: SLOs not breached during exercise OR breach root cause has a follow-up ticket within 48 h.