Skip to main content

Failure Modes

:::info Source Sourced from services/certification-service/FAILURE_MODES.md in the documentation repo. :::

1. Scenarios

1.1 Rendering Failure (PDF/PNG)

  • Cause: headless Chrome crash, malformed template.
  • Mitigation: retry with exponential backoff; fall back to minimal template if template render fails; alert + manual intervention.

1.2 KMS Unavailable for Signing

  • Certificates queue; issuance retries; critical alert if > 10 min outage.

1.3 Offline Claim Signature Invalid

  • Reject; learner sees reason ("content integrity could not be verified — please resync"); support can manually issue if evidence supports it.

1.4 Duplicate Completion Event

  • Idempotent: one certificate per (enrollment, course_version, state='issued'). Second event returns existing.

1.5 Clock Skew in Offline Claim

  • Tolerance ±24h; outside window → rejected with specific reason; learner can request manual review.

1.6 Public Verify Token Enumeration

  • Rate-limited per IP + per tenant; alert on spike; WAF geo rules.

1.7 kid Rotation Without Overlap

  • Published JWKS updated; 2-day overlap; if rushed, old bundles fail verify — requires emergency issuance.

1.8 CDN Cache Staleness on Revoke

  • Revoke triggers CDN purge + Cache-Control: no-cache on response for 5 min.
  • Short-window stale cache accepted; reinforced by verify endpoint re-check on user click.

1.9 Artifact S3 Outage

  • Learner portfolio shows stub with "temporarily unavailable"; retries via CDN fallback to origin.

1.10 GDPR Anonymization Race

  • Certificate retained with user_display_name_at_issuance → policy decides whether to anonymize on erasure.

2. Retry / Backoff

OpMaxBackoff
KMS sign350ms, 200ms, 500ms
Render31s, 5s, 15s
S3 upload5exp 100ms–10s
Outboxinfiniteexp cap 5m

3. Circuit Breakers

KMS: 10 fail/30s → 60s. S3: 10 fail/30s → 60s. Renderer: 5 fail/60s → 120s.

4. Fallbacks

PrimaryFallback
Custom template renderMinimal default template
OpenBadges generatorBasic PDF + PNG only (deferred OB)
CDNOrigin direct

5. Chaos

  • KMS 60s outage → issuance queue builds; drains cleanly.
  • Renderer OOM on adversarial template → isolated pod crash; saga retries.
  • Verify spike 10x baseline → rate limit + CDN absorb; alert on breach.