Skip to main content

Failure Modes

:::info Source Sourced from services/authoring-service/13-FAILURE_MODES.md in the documentation repo. :::

1. Known Scenarios

1.1 Publish Saga Half-Failure

  • Symptom: Publish saga fails between step 2 (cataloging) and step 3 (bundling); CourseVersion registered but no bundle.
  • Mitigation:
    • Explicit compensation: catalog unregister, package discard, draft → approved.
    • Saga state persisted in sagas table; worker crash-safe.
    • Saga timeout 15 min → force compensation.
    • Chaos tests inject failure at every step.
  • Runbook: runbooks/authoring/publish-saga.md

1.2 Yjs Document Corruption

  • Symptom: Collab session loses state mid-edit.
  • Mitigation:
    • Periodic Yjs snapshots every 60s to Postgres.
    • Replay from event log (authoring.block.added.v1, etc.) as recovery.
    • Per-draft repair tool runs in staging first, then promoted.
  • Recovery: kick affected clients; server restores from snapshot; clients rejoin.

1.3 AI Block Generated Without Provenance

  • Symptom: AI block persisted with missing aiProvenance VO.
  • Mitigation: Domain invariant throws DomainError.AIProvenanceRequired at write. Consumer code uses only AIClient port which always attaches provenance.
  • Verification: Unit test blocks every codepath that creates an AIBlock.

1.4 Block Schema Drift

  • Symptom: Block registered with kind: "interactive-coding"; older delivery-service doesn't understand.
  • Mitigation:
    • Block registry is frozen (F17). New kinds additive.
    • Player falls back to "unsupported block" stub.
    • Forward-compatibility test: S1 bundle must load in latest player.

1.5 Media Reference Unresolved at Publish

  • Symptom: Block references media asset that was deleted.
  • Mitigation: Invariant: draft cannot transition to publishing unless every media ref resolves to media.asset.status = ready. Validator runs on publish command.

1.6 Large SCORM Import Exhausts Memory

  • Symptom: 2GB SCORM zip OOMs import worker.
  • Mitigation:
    • Stream parser (no full extract to memory).
    • Size cap 500MB at API (configurable per tenant).
    • Sandboxed worker with memory limit.
    • Progress events emitted; UI shows "rejected — too large".

1.7 Collab Server Pod Eviction Mid-Edit

  • Symptom: WebSocket drops; client reconnects to new pod with stale state.
  • Mitigation:
    • Sticky sessions by draftId hash.
    • On pod restart: clients receive "reconnect" hint; rejoin with last-known server version.
    • Yjs CRDT merges any diverged edits without loss.

1.8 AI Co-Author Budget Exhausted Mid-Session

  • Symptom: AI request returns 429 ai.refused.budget.
  • Mitigation:
    • UI shows budget status; warns at 80%.
    • Fallback to local model (slower, lower quality) with user consent.
    • Admin alert at 95%.

2. Retry / Backoff

OpMaxBackoffBudget
Postgres write310ms, 50ms, 200ms300ms
AI call21s5s
Media asset check3200ms, 1s, 3s5s
Publish saga step5exp, cap 30s15min (saga timeout)
SCORM import step35s, 30s, 2m10min

3. Circuit Breakers

TargetTripReset
ai-gateway10 / 30s60s
media-service5 / 30s60s
catalog-service5 / 30s60s
content-service5 / 30s60s

4. Fallbacks

PrimaryFallback
AI cloud modelLocal model (reduced quality)
Real-time collabLast-saved snapshot; single-user mode
Media preview CDNOrigin direct
Publish saga stepExplicit compensation; draft returns to editing

5. Chaos Experiments

  • Kill collab pod during edit session (verify client reconnect).
  • Inject failure at each publish-saga step (verify compensation).
  • AI-gateway 10s latency spike (verify UX degradation, not error).
  • Postgres primary failover during block write (verify no loss).