Failure Modes
:::info Source
Sourced from services/ai-gateway-service/FAILURE_MODES.md in the documentation repo.
:::
1. Scenarios
1.1 Primary Provider Outage
- Fallback chain: primary → secondary → tertiary → local.
- Alert on fallback usage > 10%.
1.2 All Providers Down
- Refuse with
ai.refused.provider(502). - Caller shows graceful UX ("AI temporarily unavailable").
- Emit
ai.provider.all_down.v1alert.
1.3 Budget Race (Two Concurrent Callers)
- Atomic UPDATE ... RETURNING ensures no over-spend.
- Second caller sees budget exhausted.
1.4 Cache Stampede
- Request coalescing (singleflight): multiple callers same key → one provider call.
1.5 Safety Classifier False Positive
- Overblocks legitimate content.
- Mitigation: admin review queue for blocked requests; thresholds tuned quarterly; override path with audit.
1.6 Output Schema Violation (Provider Returned Bad JSON)
- One retry with stricter system prompt.
- Still invalid → emit
ai.refused.schema; caller error.
1.7 Prompt Version Rollback
- Previous version still active in DB.
- Flip
statusback toactive; deprecated current. - Consumers pick up on next fetch (60s).
1.8 Provider API Key Leaked
- Rotate immediately (KMS + provider console).
- Old key revoked; deploy new key to all pods.
- Audit log check for unauthorized calls.
1.9 Local Model Crashes
- Restart GPU pod; route traffic to cloud fallback.
- Alert if crash rate high.
1.10 Streaming Connection Drop (SSE)
- Client reconnects; server replays buffer (if within 30s window) or re-issues.
2. Retry / Backoff
| Op | Max | Backoff |
|---|---|---|
| Provider call | 2 | 200ms, 1s |
| Postgres | 3 | 10ms–200ms |
| Redis | 3 | 10ms–100ms |
| Outbox | infinite | exp cap 5m |
| Embedding gen | 3 | 200ms, 1s, 3s |
3. Circuit Breakers
Per provider: 10 fail/30s → 60s. Safety classifier: 20 fail/60s → 120s.
4. Fallbacks
| Primary | Fallback |
|---|---|
| Cloud model (large) | Cloud (smaller) → local |
| Real-time safety | Local classifier |
| KMS provider-key fetch | Cached in-memory (5 min) |
| Live eval | Stored eval score |
5. Chaos
- 30% provider timeout → fallback chain completes.
- Budget exhausted mid-session → refuse gracefully.
- Cache delete during call → recompute + store.
- Local model OOM → fallback to cloud.