Skip to main content

Failure Modes

:::info Source Sourced from services/ai-gateway-service/FAILURE_MODES.md in the documentation repo. :::

1. Scenarios

1.1 Primary Provider Outage

  • Fallback chain: primary → secondary → tertiary → local.
  • Alert on fallback usage > 10%.

1.2 All Providers Down

  • Refuse with ai.refused.provider (502).
  • Caller shows graceful UX ("AI temporarily unavailable").
  • Emit ai.provider.all_down.v1 alert.

1.3 Budget Race (Two Concurrent Callers)

  • Atomic UPDATE ... RETURNING ensures no over-spend.
  • Second caller sees budget exhausted.

1.4 Cache Stampede

  • Request coalescing (singleflight): multiple callers same key → one provider call.

1.5 Safety Classifier False Positive

  • Overblocks legitimate content.
  • Mitigation: admin review queue for blocked requests; thresholds tuned quarterly; override path with audit.

1.6 Output Schema Violation (Provider Returned Bad JSON)

  • One retry with stricter system prompt.
  • Still invalid → emit ai.refused.schema; caller error.

1.7 Prompt Version Rollback

  • Previous version still active in DB.
  • Flip status back to active; deprecated current.
  • Consumers pick up on next fetch (60s).

1.8 Provider API Key Leaked

  • Rotate immediately (KMS + provider console).
  • Old key revoked; deploy new key to all pods.
  • Audit log check for unauthorized calls.

1.9 Local Model Crashes

  • Restart GPU pod; route traffic to cloud fallback.
  • Alert if crash rate high.

1.10 Streaming Connection Drop (SSE)

  • Client reconnects; server replays buffer (if within 30s window) or re-issues.

2. Retry / Backoff

OpMaxBackoff
Provider call2200ms, 1s
Postgres310ms–200ms
Redis310ms–100ms
Outboxinfiniteexp cap 5m
Embedding gen3200ms, 1s, 3s

3. Circuit Breakers

Per provider: 10 fail/30s → 60s. Safety classifier: 20 fail/60s → 120s.

4. Fallbacks

PrimaryFallback
Cloud model (large)Cloud (smaller) → local
Real-time safetyLocal classifier
KMS provider-key fetchCached in-memory (5 min)
Live evalStored eval score

5. Chaos

  • 30% provider timeout → fallback chain completes.
  • Budget exhausted mid-session → refuse gracefully.
  • Cache delete during call → recompute + store.
  • Local model OOM → fallback to cloud.