Skip to main content

Failure Modes

:::info Source Sourced from services/ai-gateway-service/FAILURE_MODES.md in the documentation repo. :::

1. Scenarios

1.1 Primary Provider Outage

Fallback chain: primary → secondary → tertiary → local.
Alert on fallback usage > 10%.

1.2 All Providers Down

Refuse with ai.refused.provider (502).
Caller shows graceful UX ("AI temporarily unavailable").
Emit ai.provider.all_down.v1 alert.

1.3 Budget Race (Two Concurrent Callers)

Atomic UPDATE ... RETURNING ensures no over-spend.
Second caller sees budget exhausted.

1.4 Cache Stampede

Request coalescing (singleflight): multiple callers same key → one provider call.

1.5 Safety Classifier False Positive

Overblocks legitimate content.
Mitigation: admin review queue for blocked requests; thresholds tuned quarterly; override path with audit.

1.6 Output Schema Violation (Provider Returned Bad JSON)

One retry with stricter system prompt.
Still invalid → emit ai.refused.schema; caller error.

1.7 Prompt Version Rollback

Previous version still active in DB.
Flip status back to active; deprecated current.
Consumers pick up on next fetch (60s).

1.8 Provider API Key Leaked

Rotate immediately (KMS + provider console).
Old key revoked; deploy new key to all pods.
Audit log check for unauthorized calls.

1.9 Local Model Crashes

Restart GPU pod; route traffic to cloud fallback.
Alert if crash rate high.

1.10 Streaming Connection Drop (SSE)

Client reconnects; server replays buffer (if within 30s window) or re-issues.

2. Retry / Backoff

Op	Max	Backoff
Provider call	2	200ms, 1s
Postgres	3	10ms–200ms
Redis	3	10ms–100ms
Outbox	infinite	exp cap 5m
Embedding gen	3	200ms, 1s, 3s

3. Circuit Breakers

Per provider: 10 fail/30s → 60s. Safety classifier: 20 fail/60s → 120s.

4. Fallbacks

Primary	Fallback
Cloud model (large)	Cloud (smaller) → local
Real-time safety	Local classifier
KMS provider-key fetch	Cached in-memory (5 min)
Live eval	Stored eval score

5. Chaos

30% provider timeout → fallback chain completes.
Budget exhausted mid-session → refuse gracefully.
Cache delete during call → recompute + store.
Local model OOM → fallback to cloud.

1. Scenarios
2. Retry / Backoff
3. Circuit Breakers
4. Fallbacks
5. Chaos