Skip to main content

Failure Modes

:::info Source Sourced from services/content-service/FAILURE_MODES.md in the documentation repo. :::

1. Scenarios

1.1 Package Build Failure

  • Cause: Missing asset, invalid manifest, KMS signing error.
  • Mitigation:
    • Publish saga compensation: discard package; draft → approved; alert author.
    • Retry up to 3 times for transient errors (network, S3 rate limit).
  • Runbook: runbooks/content/build-failures.md

1.2 Bundle Encryption Failure

  • Cause: KMS unavailable, device key missing, disk full on worker.
  • Mitigation:
    • Retry with exponential backoff.
    • Fall back to queuing (bundle later when KMS available).
    • Alert ops; learner sees "preparing offline content" state.

1.3 Bundle Tamper Detected (Learner-Side)

  • Cause: Device storage corruption or attack.
  • Response:
    • Client unmounts bundle; emits content.bundle.tamper_detected.v1 on next sync.
    • Server flags bundle + device; offers fresh download.
    • Repeated tampers on same device → device trust revoked; admin alert.

1.4 SCORM Import Malicious Zip

  • Symptom: Import worker detects eval, network attempt, or path traversal.
  • Response: Quarantine zip; alert security; block importer for that tenant until review.

1.5 Revocation Propagation Delay

  • Symptom: Bundle revoked server-side; offline device still plays until next sync.
  • Mitigation:
    • License envelope expiresAt bounds risk window (default 30 days, configurable).
    • Sync prioritizes revocation deltas.
    • Push notification on reconnect forces sync.
    • Accepted risk: offline device plays for ≤ expiresAt; documented in compliance.

1.6 KMS kid Rotation Collision

  • Symptom: Old bundles signed with rotated-out key fail verification.
  • Mitigation:
    • Overlap window ≥ 2 days.
    • Public verification keys stored in /.well-known/content-keys.json with kid → pubkey map.
    • Device cache includes all active kids.
    • Rotation communicated via sync well before cutover.

1.7 Publish Saga Split-Brain

  • Symptom: content emits .built but catalog doesn't receive; saga stuck.
  • Mitigation:
    • NATS at-least-once + outbox; retry to depletion.
    • Saga timeout 15 min → compensate.
    • Monitoring alert on saga in building > 5 min.

1.8 Large Package Build OOM

  • Symptom: Course with 500 4K videos OOMs builder.
  • Mitigation:
    • Stream processing (not bulk load).
    • Size cap per tenant plan tier.
    • Vertical scale budget; alert on OOM → admin review.

1.9 Signed URL Expiry Before Download Complete

  • Symptom: Client on slow network gets URL but connection drops.
  • Mitigation:
    • TTL 10 min — generous.
    • Client resumable downloads (HTTP range).
    • Auto-renew signed URL on 403 from CDN.

1.10 CDN Cache Poisoning

  • Symptom: Corrupt bundle served from CDN.
  • Mitigation:
    • Signed URLs include content hash; CDN validates.
    • Client verifies SHA-256 on arrival.
    • Fail → fetch origin + invalidate CDN.

2. Retry / Backoff

OperationMaxBackoffBudget
KMS sign350ms, 200ms, 500ms1s
S3 upload5exp 100ms–10s30s
Asset fetch3500ms, 2s, 5s10s
Publish saga step5exp cap 30s15 min total
Bundle create retry5exp cap 60s10 min

3. Circuit Breakers

TargetTripReset
KMS10 fail / 30s60s
S310 fail / 30s60s
ai-gateway (for assistant config)5 fail / 30s60s

4. Fallbacks

PrimaryFallback
Real-time bundle createQueue; learner sees "preparing…"
Server-side revocation pushExpiry-based (bounded)
CDNOrigin direct (signed URL)
AI-generated assistant configDefault config

5. Chaos

  • Kill KMS for 30s mid-publish → verify retry + recovery.
  • Drop 10% of NATS messages for 2 min → verify no event loss.
  • Corrupt random S3 object → verify tamper detection on next mount.
  • Partition content ↔ catalog for 60s during publish saga → verify compensation.