Skip to main content

Observability

:::info Source Sourced from services/content-service/OBSERVABILITY.md in the documentation repo. :::

1. Logs

Structured JSON (schema v3 per docs/15-observability-telemetry.md).

Key events:

  • content.package.build.started / .completed / .failed
  • content.bundle.create.started / .completed / .failed
  • content.bundle.sign.completed
  • content.scorm.import.started / .completed / .quarantined / .rejected
  • content.export.started / .completed
  • content.bundle.tamper_detected (AUDIT)
  • content.license.revoked (AUDIT)

Attrs: course_version_id, play_package_id, bundle_id, device_id_hash, size_bytes, encryption_kid.

Redaction: license_signature_jws logged only as kid + truncated signature prefix.

2. Metrics

2.1 RED (Rate/Errors/Duration)

  • content_api_requests_total{endpoint,method,status,tenant_id} — counter
  • content_api_duration_seconds{endpoint,method} — histogram

2.2 Domain KPIs

  • content_package_build_duration_seconds — histogram (target p95 < 120s)
  • content_bundle_create_duration_seconds — histogram (target p95 < 30s)
  • content_bundle_size_bytes — histogram
  • content_package_revocation_total{reason} — counter
  • content_bundle_tamper_detected_total{tenant_id} — counter (alert if > 0 in 1 h)
  • content_scorm_import_total{outcome="success|quarantined|rejected"} — counter
  • content_export_duration_seconds{format="scorm12|scorm2004|html|xapi"} — histogram

2.3 USE (Utilization/Saturation/Errors)

  • content_worker_queue_depth{queue="build|bundle|import|export"} — gauge
  • content_storage_bytes_used{tenant_id} — gauge

2.4 Cost

  • content_storage_bytes{class="hot|cold"} — gauge (for cost dashboards)

3. Traces (OpenTelemetry)

Spans:

  • content.package.build → sub-spans: fetch_draft, assemble_manifest, gather_assets, sign, upload
  • content.bundle.createderive_key, encrypt_assets, sign_envelope, upload, emit_event
  • content.scorm.importvalidate_manifest, scan_av, extract, translate_to_package

Every span carries tenant_id, request_id, actor_id_hash, course_version_id, play_package_id, bundle_id.

4. Dashboards (Grafana)

  • Build Pipeline — package build duration, queue depth, failure rate.
  • Bundle Factory — bundle create duration, size distribution, per-tenant rate.
  • Offline Trust — tamper events, revocations, license expiry near-term.
  • SCORM Import — success/quarantine/reject rates; import duration.
  • Exports — export format × duration.
  • Storage Cost — per-tenant hot + cold bytes.

5. Alerts

AlertThresholdSeverityRunbook
content-build-failure-rate> 2% for 15 minP2runbooks/content/build-failures.md
content-bundle-tamper-detected> 0 in 1h for a tenantP1runbooks/content/tamper.md
content-bundle-backlog> 10k queuedP2runbooks/content/queue-backlog.md
content-saga-timeoutany saga > 15 minP2runbooks/content/saga-timeout.md
content-scorm-import-rejected-spike> 10/minP3runbooks/content/scorm-spike.md
content-signing-kms-failure> 5 fail / 1 minP1runbooks/content/kms.md
content-outbox-lag> 30s p99P2runbooks/content/outbox.md

6. SLOs

SLITargetError Budget
Package build success rate≥ 99%1%
Bundle create success rate≥ 99.5%0.5%
Signing latency p95< 200ms
Bundle download signed URL validity99.99%
Tamper detection → revocation< 60s online

7. Error Budget Policy

  • 30-day rolling; if budget < 20% remaining, feature freeze on content-service until recovered.
  • CTO sign-off required to override freeze.