Failure Modes
:::info Source
Sourced from services/media-service/FAILURE_MODES.md in the documentation repo.
:::
1. Scenarios
1.1 Scanner Outage
- Queue builds; uploads delayed in "scanning" state; author UX shows progress.
- Retry on recovery.
1.2 Transcoder Failure (Specific Profile)
- Retry other profiles; mark that profile failed; asset still usable with other variants.
1.3 AI Provider Outage (Image Gen / TTS)
- Queue; fallback to local model or cached output; UI shows "AI temporarily unavailable."
1.4 S3 Outage
- Upload URL creation fails → retries; fall over to secondary region.
1.5 CDN Cache Poisoning
- Signed URLs include content hash; CDN validates.
- Purge on asset update or revocation.
1.6 Quarantine False Positive
- Admin review queue; manual release override with audit.
1.7 Large File Upload Interrupted
- Multipart upload; resumable; client retries.
1.8 GDPR Deletion Race
- Mark asset deleted; soft-delete first; purge after 30-day grace.
2. Retry / Backoff
| Op | Max | Backoff |
|---|---|---|
| S3 write | 5 | exp 100ms–10s |
| Scan | 3 | 1s, 5s, 15s |
| Transcode | 3 | 30s, 2m, 10m |
| AI call | 2 | 1s, 5s |
| Outbox | infinite | exp cap 5m |
3. Circuit Breakers
S3: 10 fail/30s → 60s. AI gateway: 10 fail/30s → 60s. Scanner: 10 fail/30s → 60s.
4. Fallbacks
| Primary | Fallback |
|---|---|
| Cloud AI | Local model |
| MediaConvert | ffmpeg workers |
| CDN | Origin direct |
5. Chaos
- Scanner 30s latency → queue drains cleanly.
- Corrupt S3 object → SHA check catches at next read.
- CDN stale → purge invalidates.