Skip to main content

Failure Modes

:::info Source Sourced from services/media-service/FAILURE_MODES.md in the documentation repo. :::

1. Scenarios

1.1 Scanner Outage

  • Queue builds; uploads delayed in "scanning" state; author UX shows progress.
  • Retry on recovery.

1.2 Transcoder Failure (Specific Profile)

  • Retry other profiles; mark that profile failed; asset still usable with other variants.

1.3 AI Provider Outage (Image Gen / TTS)

  • Queue; fallback to local model or cached output; UI shows "AI temporarily unavailable."

1.4 S3 Outage

  • Upload URL creation fails → retries; fall over to secondary region.

1.5 CDN Cache Poisoning

  • Signed URLs include content hash; CDN validates.
  • Purge on asset update or revocation.

1.6 Quarantine False Positive

  • Admin review queue; manual release override with audit.

1.7 Large File Upload Interrupted

  • Multipart upload; resumable; client retries.

1.8 GDPR Deletion Race

  • Mark asset deleted; soft-delete first; purge after 30-day grace.

2. Retry / Backoff

OpMaxBackoff
S3 write5exp 100ms–10s
Scan31s, 5s, 15s
Transcode330s, 2m, 10m
AI call21s, 5s
Outboxinfiniteexp cap 5m

3. Circuit Breakers

S3: 10 fail/30s → 60s. AI gateway: 10 fail/30s → 60s. Scanner: 10 fail/30s → 60s.

4. Fallbacks

PrimaryFallback
Cloud AILocal model
MediaConvertffmpeg workers
CDNOrigin direct

5. Chaos

  • Scanner 30s latency → queue drains cleanly.
  • Corrupt S3 object → SHA check catches at next read.
  • CDN stale → purge invalidates.