:::info Source
Sourced from services/content-service/FAILURE_MODES.md in the documentation repo.
:::
1. Scenarios
1.1 Package Build Failure
- Cause: Missing asset, invalid manifest, KMS signing error.
- Mitigation:
- Publish saga compensation: discard package; draft →
approved; alert author.
- Retry up to 3 times for transient errors (network, S3 rate limit).
- Runbook:
runbooks/content/build-failures.md
1.2 Bundle Encryption Failure
- Cause: KMS unavailable, device key missing, disk full on worker.
- Mitigation:
- Retry with exponential backoff.
- Fall back to queuing (bundle later when KMS available).
- Alert ops; learner sees "preparing offline content" state.
1.3 Bundle Tamper Detected (Learner-Side)
- Cause: Device storage corruption or attack.
- Response:
- Client unmounts bundle; emits
content.bundle.tamper_detected.v1 on next sync.
- Server flags bundle + device; offers fresh download.
- Repeated tampers on same device → device trust revoked; admin alert.
1.4 SCORM Import Malicious Zip
- Symptom: Import worker detects eval, network attempt, or path traversal.
- Response: Quarantine zip; alert security; block importer for that tenant until review.
1.5 Revocation Propagation Delay
- Symptom: Bundle revoked server-side; offline device still plays until next sync.
- Mitigation:
- License envelope
expiresAt bounds risk window (default 30 days, configurable).
- Sync prioritizes revocation deltas.
- Push notification on reconnect forces sync.
- Accepted risk: offline device plays for ≤
expiresAt; documented in compliance.
1.6 KMS kid Rotation Collision
- Symptom: Old bundles signed with rotated-out key fail verification.
- Mitigation:
- Overlap window ≥ 2 days.
- Public verification keys stored in
/.well-known/content-keys.json with kid → pubkey map.
- Device cache includes all active
kids.
- Rotation communicated via sync well before cutover.
1.7 Publish Saga Split-Brain
- Symptom: content emits
.built but catalog doesn't receive; saga stuck.
- Mitigation:
- NATS at-least-once + outbox; retry to depletion.
- Saga timeout 15 min → compensate.
- Monitoring alert on saga in
building > 5 min.
1.8 Large Package Build OOM
- Symptom: Course with 500 4K videos OOMs builder.
- Mitigation:
- Stream processing (not bulk load).
- Size cap per tenant plan tier.
- Vertical scale budget; alert on OOM → admin review.
1.9 Signed URL Expiry Before Download Complete
- Symptom: Client on slow network gets URL but connection drops.
- Mitigation:
- TTL 10 min — generous.
- Client resumable downloads (HTTP range).
- Auto-renew signed URL on 403 from CDN.
1.10 CDN Cache Poisoning
- Symptom: Corrupt bundle served from CDN.
- Mitigation:
- Signed URLs include content hash; CDN validates.
- Client verifies SHA-256 on arrival.
- Fail → fetch origin + invalidate CDN.
2. Retry / Backoff
| Operation | Max | Backoff | Budget |
|---|
| KMS sign | 3 | 50ms, 200ms, 500ms | 1s |
| S3 upload | 5 | exp 100ms–10s | 30s |
| Asset fetch | 3 | 500ms, 2s, 5s | 10s |
| Publish saga step | 5 | exp cap 30s | 15 min total |
| Bundle create retry | 5 | exp cap 60s | 10 min |
3. Circuit Breakers
| Target | Trip | Reset |
|---|
| KMS | 10 fail / 30s | 60s |
| S3 | 10 fail / 30s | 60s |
| ai-gateway (for assistant config) | 5 fail / 30s | 60s |
4. Fallbacks
| Primary | Fallback |
|---|
| Real-time bundle create | Queue; learner sees "preparing…" |
| Server-side revocation push | Expiry-based (bounded) |
| CDN | Origin direct (signed URL) |
| AI-generated assistant config | Default config |
5. Chaos
- Kill KMS for 30s mid-publish → verify retry + recovery.
- Drop 10% of NATS messages for 2 min → verify no event loss.
- Corrupt random S3 object → verify tamper detection on next mount.
- Partition content ↔ catalog for 60s during publish saga → verify compensation.