:::info Source
Sourced from services/authoring-service/13-FAILURE_MODES.md in the documentation repo.
:::
1. Known Scenarios
1.1 Publish Saga Half-Failure
- Symptom: Publish saga fails between step 2 (cataloging) and step 3 (bundling); CourseVersion registered but no bundle.
- Mitigation:
- Explicit compensation: catalog unregister, package discard, draft →
approved.
- Saga state persisted in
sagas table; worker crash-safe.
- Saga timeout 15 min → force compensation.
- Chaos tests inject failure at every step.
- Runbook:
runbooks/authoring/publish-saga.md
1.2 Yjs Document Corruption
- Symptom: Collab session loses state mid-edit.
- Mitigation:
- Periodic Yjs snapshots every 60s to Postgres.
- Replay from event log (
authoring.block.added.v1, etc.) as recovery.
- Per-draft repair tool runs in staging first, then promoted.
- Recovery: kick affected clients; server restores from snapshot; clients rejoin.
1.3 AI Block Generated Without Provenance
- Symptom: AI block persisted with missing
aiProvenance VO.
- Mitigation: Domain invariant throws
DomainError.AIProvenanceRequired at write. Consumer code uses only AIClient port which always attaches provenance.
- Verification: Unit test blocks every codepath that creates an AIBlock.
1.4 Block Schema Drift
- Symptom: Block registered with
kind: "interactive-coding"; older delivery-service doesn't understand.
- Mitigation:
- Block registry is frozen (F17). New kinds additive.
- Player falls back to "unsupported block" stub.
- Forward-compatibility test: S1 bundle must load in latest player.
- Symptom: Block references media asset that was deleted.
- Mitigation: Invariant: draft cannot transition to
publishing unless every media ref resolves to media.asset.status = ready. Validator runs on publish command.
1.6 Large SCORM Import Exhausts Memory
- Symptom: 2GB SCORM zip OOMs import worker.
- Mitigation:
- Stream parser (no full extract to memory).
- Size cap 500MB at API (configurable per tenant).
- Sandboxed worker with memory limit.
- Progress events emitted; UI shows "rejected — too large".
1.7 Collab Server Pod Eviction Mid-Edit
- Symptom: WebSocket drops; client reconnects to new pod with stale state.
- Mitigation:
- Sticky sessions by
draftId hash.
- On pod restart: clients receive "reconnect" hint; rejoin with last-known server version.
- Yjs CRDT merges any diverged edits without loss.
1.8 AI Co-Author Budget Exhausted Mid-Session
- Symptom: AI request returns
429 ai.refused.budget.
- Mitigation:
- UI shows budget status; warns at 80%.
- Fallback to local model (slower, lower quality) with user consent.
- Admin alert at 95%.
2. Retry / Backoff
| Op | Max | Backoff | Budget |
|---|
| Postgres write | 3 | 10ms, 50ms, 200ms | 300ms |
| AI call | 2 | 1s | 5s |
| Media asset check | 3 | 200ms, 1s, 3s | 5s |
| Publish saga step | 5 | exp, cap 30s | 15min (saga timeout) |
| SCORM import step | 3 | 5s, 30s, 2m | 10min |
3. Circuit Breakers
| Target | Trip | Reset |
|---|
| ai-gateway | 10 / 30s | 60s |
| media-service | 5 / 30s | 60s |
| catalog-service | 5 / 30s | 60s |
| content-service | 5 / 30s | 60s |
4. Fallbacks
| Primary | Fallback |
|---|
| AI cloud model | Local model (reduced quality) |
| Real-time collab | Last-saved snapshot; single-user mode |
| Media preview CDN | Origin direct |
| Publish saga step | Explicit compensation; draft returns to editing |
5. Chaos Experiments
- Kill collab pod during edit session (verify client reconnect).
- Inject failure at each publish-saga step (verify compensation).
- AI-gateway 10s latency spike (verify UX degradation, not error).
- Postgres primary failover during block write (verify no loss).