Skip to main content

Failure Modes

:::info Source Sourced from services/delivery-service/FAILURE_MODES.md in the documentation repo. :::

1. Scenarios

1.1 Bundle Mount Fails (License Invalid / Tamper)

  • Symptom: Player cannot mount bundle; learner sees "content unavailable."
  • Mitigation:
    • Diagnostic event emitted (content.bundle.tamper_detected.v1 or equivalent).
    • Player offers fresh download if online.
    • Admin alert for repeated failures on same device.

1.2 AI Tutor Provider Unavailable

  • Symptom: POST /play-sessions/{id}/tutor/turn returns 502.
  • Mitigation:
    • Fallback to local model (cached in bundle).
    • UI degrades: "tutor temporarily unavailable — try again."
    • Turn persisted locally; retried when provider restored.

1.3 Progress Emission Lost Mid-Play

  • Symptom: Network blip drops progress.statement.stored.v1.
  • Mitigation:
    • Client-side statements outbox (IndexedDB / SQLite).
    • Replay on reconnect with idempotency key.
    • Duplicate statements deduplicated in progress-service via statementId.

1.4 Play Session Concurrent Modification

  • Symptom: Two devices update same session simultaneously.
  • Mitigation:
    • Session is device-scoped; typically no conflict.
    • Cross-device "resume" uses vector clock + LWW on navigation cursor.
    • Statements append-only, never conflict.

1.5 SCORM Runtime Tracking Mismatch

  • Symptom: SCORM course API calls (cmi.*) not persisted.
  • Mitigation:
    • SCORM adapter translates cmi.* → xAPI statements → progress-service.
    • Adapter logs all calls; integration tests cover SCORM 1.2 + 2004.
    • SCORM Cloud conformance in CI.

1.6 Offline Session Corruption

  • Symptom: IndexedDB write fails mid-session.
  • Mitigation:
    • Atomic writes per block completion.
    • Session state reconstructible from statements-outbox + cursor.
    • Client-side recovery on app restart.

1.7 AI Budget Exhausted Offline

  • Symptom: Local AI tutor exhausts device token budget.
  • Mitigation:
    • Local model has rate limit (per session).
    • UI shows "tutor paused — resume when online."
    • Budget replenishes on next sync with cloud credit.

1.8 Device Trust Revoked During Session

  • Symptom: Admin revokes device; learner's active session ends abruptly.
  • Mitigation:
    • Graceful unmount: save state, emit session-abandoned.
    • Revocation event propagated via sync.
    • Data re-accessible when re-enrolled on another device.

1.9 Clock Skew on Offline Device

  • Symptom: Device clock way off; license expiresAt check fails.
  • Mitigation:
    • Server-time token embedded in bundle at mount-time (signed).
    • Player compares session time vs server-time + elapsed.
    • Large skew (> 24h) warns user; forces reconnect.

1.10 Assessment Submission After Session End

  • Symptom: Learner submits quiz response after session marked completed.
  • Mitigation:
    • Session state completed rejects further writes.
    • Client shows error; late submission discarded.
    • Retry allowed only via new attempt.

2. Retry / Backoff

OpMaxBackoffBudget
Statement emitinfinite (outbox)exp, cap 5 min
AI tutor turn2500ms, 2s5s
Bundle mount retry31s, 5s, 15s25s
Resume sync pull5exp cap 30s2 min

3. Circuit Breakers

TargetTripReset
ai-gateway5 fail / 30s60s (switch to local model)
progress-service10 fail / 30s60s (queue statements locally)
content-service5 fail / 30s60s (serve from cached bundle)

4. Fallbacks

PrimaryFallback
Cloud AI tutorLocal model (smaller, slower, same API)
Real-time progressClient outbox (up to 7 days offline)
Server resume stateClient cursor
Content CDNCached bundle

5. Chaos

  • Network partition during active session for 30 min → verify resume works.
  • Kill AI provider → verify local-model fallback.
  • Corrupt local statements-outbox → verify reconstruction from cursor.
  • Device clock jump +2h → verify server-time reconciliation.