:::info Source
Sourced from services/delivery-service/FAILURE_MODES.md in the documentation repo.
:::
1. Scenarios
1.1 Bundle Mount Fails (License Invalid / Tamper)
- Symptom: Player cannot mount bundle; learner sees "content unavailable."
- Mitigation:
- Diagnostic event emitted (
content.bundle.tamper_detected.v1 or equivalent).
- Player offers fresh download if online.
- Admin alert for repeated failures on same device.
1.2 AI Tutor Provider Unavailable
- Symptom:
POST /play-sessions/{id}/tutor/turn returns 502.
- Mitigation:
- Fallback to local model (cached in bundle).
- UI degrades: "tutor temporarily unavailable — try again."
- Turn persisted locally; retried when provider restored.
1.3 Progress Emission Lost Mid-Play
- Symptom: Network blip drops
progress.statement.stored.v1.
- Mitigation:
- Client-side statements outbox (IndexedDB / SQLite).
- Replay on reconnect with idempotency key.
- Duplicate statements deduplicated in progress-service via
statementId.
1.4 Play Session Concurrent Modification
- Symptom: Two devices update same session simultaneously.
- Mitigation:
- Session is device-scoped; typically no conflict.
- Cross-device "resume" uses vector clock + LWW on navigation cursor.
- Statements append-only, never conflict.
1.5 SCORM Runtime Tracking Mismatch
- Symptom: SCORM course API calls (cmi.*) not persisted.
- Mitigation:
- SCORM adapter translates cmi.* → xAPI statements → progress-service.
- Adapter logs all calls; integration tests cover SCORM 1.2 + 2004.
- SCORM Cloud conformance in CI.
1.6 Offline Session Corruption
- Symptom: IndexedDB write fails mid-session.
- Mitigation:
- Atomic writes per block completion.
- Session state reconstructible from statements-outbox + cursor.
- Client-side recovery on app restart.
1.7 AI Budget Exhausted Offline
- Symptom: Local AI tutor exhausts device token budget.
- Mitigation:
- Local model has rate limit (per session).
- UI shows "tutor paused — resume when online."
- Budget replenishes on next sync with cloud credit.
1.8 Device Trust Revoked During Session
- Symptom: Admin revokes device; learner's active session ends abruptly.
- Mitigation:
- Graceful unmount: save state, emit session-abandoned.
- Revocation event propagated via sync.
- Data re-accessible when re-enrolled on another device.
1.9 Clock Skew on Offline Device
- Symptom: Device clock way off; license
expiresAt check fails.
- Mitigation:
- Server-time token embedded in bundle at mount-time (signed).
- Player compares session time vs server-time + elapsed.
- Large skew (> 24h) warns user; forces reconnect.
1.10 Assessment Submission After Session End
- Symptom: Learner submits quiz response after session marked completed.
- Mitigation:
- Session state
completed rejects further writes.
- Client shows error; late submission discarded.
- Retry allowed only via new attempt.
2. Retry / Backoff
| Op | Max | Backoff | Budget |
|---|
| Statement emit | infinite (outbox) | exp, cap 5 min | — |
| AI tutor turn | 2 | 500ms, 2s | 5s |
| Bundle mount retry | 3 | 1s, 5s, 15s | 25s |
| Resume sync pull | 5 | exp cap 30s | 2 min |
3. Circuit Breakers
| Target | Trip | Reset |
|---|
| ai-gateway | 5 fail / 30s | 60s (switch to local model) |
| progress-service | 10 fail / 30s | 60s (queue statements locally) |
| content-service | 5 fail / 30s | 60s (serve from cached bundle) |
4. Fallbacks
| Primary | Fallback |
|---|
| Cloud AI tutor | Local model (smaller, slower, same API) |
| Real-time progress | Client outbox (up to 7 days offline) |
| Server resume state | Client cursor |
| Content CDN | Cached bundle |
5. Chaos
- Network partition during active session for 30 min → verify resume works.
- Kill AI provider → verify local-model fallback.
- Corrupt local statements-outbox → verify reconstruction from cursor.
- Device clock jump +2h → verify server-time reconciliation.