Failure Modes
:::info Source
Sourced from services/assessment-service/FAILURE_MODES.md in the documentation repo.
:::
1. Scenarios
1.1 AI Grading Provider Unavailable
- Fallback: queue response for later grading; notify instructor; SLA = 4h.
1.2 Answer Key Decryption Failure
- Symptom: KMS unavailable or DEK lost.
- Response: 503 to scoring endpoint; alert P1; cached answer key for 5 min allows continued serving.
1.3 Scoring Timeout
- Complex rubric grading exceeds 30s → async grading path; learner sees "grading in progress" state.
1.4 Duplicate Submission
- Idempotent on
attemptId + attemptNumber; second submission returns first result.
1.5 Offline-Computed Score Mismatch vs Server Recompute
- Server recomputes on ingest; mismatch → log + trust server; alert on high mismatch rate (may indicate tampered bundle).
1.6 Branching Scenario Node Not Found
- Domain invariant at publish prevents orphan; runtime defensive fallback: return to previous node.
1.7 AI Question Generation Refused
- Show author the refusal reason; offer manual path; log for prompt tuning.
1.8 Rubric Grading Disagreement
- AI vs human disagreement > threshold → flag; retrain eval corpus; prompt version may be rolled back.
2. Retry / Backoff
| Op | Max | Backoff |
|---|---|---|
| AI generate | 2 | 1s, 5s |
| AI grade | 3 | 2s, 10s, 30s |
| Postgres write | 3 | 10ms, 50ms, 200ms |
| Outbox | infinite | exp cap 5m |
3. Circuit Breakers
ai-gateway: 10 fail/30s → 60s. KMS: 10 fail/30s → 60s.
4. Fallbacks
| Primary | Fallback |
|---|---|
| AI grading | Queue + notify instructor |
| AI question gen | Manual authoring |
| Real-time scoring | Async grading notification |
5. Chaos
- AI gateway 30s latency → verify UX degradation (not error).
- KMS 30s outage → scoring queue builds up; drains on recovery.
- Bundle tamper → scoring fails cleanly with diagnostic.