Skip to main content

Failure Modes

:::info Source Sourced from services/authoring-service/13-FAILURE_MODES.md in the documentation repo. :::

1. Known Scenarios

1.1 Publish Saga Half-Failure

Symptom: Publish saga fails between step 2 (cataloging) and step 3 (bundling); CourseVersion registered but no bundle.
Mitigation:
- Explicit compensation: catalog unregister, package discard, draft → approved.
- Saga state persisted in sagas table; worker crash-safe.
- Saga timeout 15 min → force compensation.
- Chaos tests inject failure at every step.
Runbook: runbooks/authoring/publish-saga.md

1.2 Yjs Document Corruption

Symptom: Collab session loses state mid-edit.
Mitigation:
- Periodic Yjs snapshots every 60s to Postgres.
- Replay from event log (authoring.block.added.v1, etc.) as recovery.
- Per-draft repair tool runs in staging first, then promoted.
Recovery: kick affected clients; server restores from snapshot; clients rejoin.

1.3 AI Block Generated Without Provenance

Symptom: AI block persisted with missing aiProvenance VO.
Mitigation: Domain invariant throws DomainError.AIProvenanceRequired at write. Consumer code uses only AIClient port which always attaches provenance.
Verification: Unit test blocks every codepath that creates an AIBlock.

1.4 Block Schema Drift

Symptom: Block registered with kind: "interactive-coding"; older delivery-service doesn't understand.
Mitigation:
- Block registry is frozen (F17). New kinds additive.
- Player falls back to "unsupported block" stub.
- Forward-compatibility test: S1 bundle must load in latest player.

1.5 Media Reference Unresolved at Publish

Symptom: Block references media asset that was deleted.
Mitigation: Invariant: draft cannot transition to publishing unless every media ref resolves to media.asset.status = ready. Validator runs on publish command.

1.6 Large SCORM Import Exhausts Memory

Symptom: 2GB SCORM zip OOMs import worker.
Mitigation:
- Stream parser (no full extract to memory).
- Size cap 500MB at API (configurable per tenant).
- Sandboxed worker with memory limit.
- Progress events emitted; UI shows "rejected — too large".

1.7 Collab Server Pod Eviction Mid-Edit

Symptom: WebSocket drops; client reconnects to new pod with stale state.
Mitigation:
- Sticky sessions by draftId hash.
- On pod restart: clients receive "reconnect" hint; rejoin with last-known server version.
- Yjs CRDT merges any diverged edits without loss.

1.8 AI Co-Author Budget Exhausted Mid-Session

Symptom: AI request returns 429 ai.refused.budget.
Mitigation:
- UI shows budget status; warns at 80%.
- Fallback to local model (slower, lower quality) with user consent.
- Admin alert at 95%.

2. Retry / Backoff

Op	Max	Backoff	Budget
Postgres write	3	10ms, 50ms, 200ms	300ms
AI call	2	1s	5s
Media asset check	3	200ms, 1s, 3s	5s
Publish saga step	5	exp, cap 30s	15min (saga timeout)
SCORM import step	3	5s, 30s, 2m	10min

3. Circuit Breakers

Target	Trip	Reset
ai-gateway	10 / 30s	60s
media-service	5 / 30s	60s
catalog-service	5 / 30s	60s
content-service	5 / 30s	60s

4. Fallbacks

Primary	Fallback
AI cloud model	Local model (reduced quality)
Real-time collab	Last-saved snapshot; single-user mode
Media preview CDN	Origin direct
Publish saga step	Explicit compensation; draft returns to editing

5. Chaos Experiments

Kill collab pod during edit session (verify client reconnect).
Inject failure at each publish-saga step (verify compensation).
AI-gateway 10s latency spike (verify UX degradation, not error).
Postgres primary failover during block write (verify no loss).

1. Known Scenarios
2. Retry / Backoff
3. Circuit Breakers
4. Fallbacks
5. Chaos Experiments