Observability
:::info Source
Sourced from services/content-service/OBSERVABILITY.md in the documentation repo.
:::
1. Logs
Structured JSON (schema v3 per docs/15-observability-telemetry.md).
Key events:
content.package.build.started/.completed/.failedcontent.bundle.create.started/.completed/.failedcontent.bundle.sign.completedcontent.scorm.import.started/.completed/.quarantined/.rejectedcontent.export.started/.completedcontent.bundle.tamper_detected(AUDIT)content.license.revoked(AUDIT)
Attrs: course_version_id, play_package_id, bundle_id, device_id_hash, size_bytes, encryption_kid.
Redaction: license_signature_jws logged only as kid + truncated signature prefix.
2. Metrics
2.1 RED (Rate/Errors/Duration)
content_api_requests_total{endpoint,method,status,tenant_id}— countercontent_api_duration_seconds{endpoint,method}— histogram
2.2 Domain KPIs
content_package_build_duration_seconds— histogram (target p95 < 120s)content_bundle_create_duration_seconds— histogram (target p95 < 30s)content_bundle_size_bytes— histogramcontent_package_revocation_total{reason}— countercontent_bundle_tamper_detected_total{tenant_id}— counter (alert if > 0 in 1 h)content_scorm_import_total{outcome="success|quarantined|rejected"}— countercontent_export_duration_seconds{format="scorm12|scorm2004|html|xapi"}— histogram
2.3 USE (Utilization/Saturation/Errors)
content_worker_queue_depth{queue="build|bundle|import|export"}— gaugecontent_storage_bytes_used{tenant_id}— gauge
2.4 Cost
content_storage_bytes{class="hot|cold"}— gauge (for cost dashboards)
3. Traces (OpenTelemetry)
Spans:
content.package.build→ sub-spans:fetch_draft,assemble_manifest,gather_assets,sign,uploadcontent.bundle.create→derive_key,encrypt_assets,sign_envelope,upload,emit_eventcontent.scorm.import→validate_manifest,scan_av,extract,translate_to_package
Every span carries tenant_id, request_id, actor_id_hash, course_version_id, play_package_id, bundle_id.
4. Dashboards (Grafana)
- Build Pipeline — package build duration, queue depth, failure rate.
- Bundle Factory — bundle create duration, size distribution, per-tenant rate.
- Offline Trust — tamper events, revocations, license expiry near-term.
- SCORM Import — success/quarantine/reject rates; import duration.
- Exports — export format × duration.
- Storage Cost — per-tenant hot + cold bytes.
5. Alerts
| Alert | Threshold | Severity | Runbook |
|---|---|---|---|
content-build-failure-rate | > 2% for 15 min | P2 | runbooks/content/build-failures.md |
content-bundle-tamper-detected | > 0 in 1h for a tenant | P1 | runbooks/content/tamper.md |
content-bundle-backlog | > 10k queued | P2 | runbooks/content/queue-backlog.md |
content-saga-timeout | any saga > 15 min | P2 | runbooks/content/saga-timeout.md |
content-scorm-import-rejected-spike | > 10/min | P3 | runbooks/content/scorm-spike.md |
content-signing-kms-failure | > 5 fail / 1 min | P1 | runbooks/content/kms.md |
content-outbox-lag | > 30s p99 | P2 | runbooks/content/outbox.md |
6. SLOs
| SLI | Target | Error Budget |
|---|---|---|
| Package build success rate | ≥ 99% | 1% |
| Bundle create success rate | ≥ 99.5% | 0.5% |
| Signing latency p95 | < 200ms | — |
| Bundle download signed URL validity | 99.99% | — |
| Tamper detection → revocation | < 60s online | — |
7. Error Budget Policy
- 30-day rolling; if budget < 20% remaining, feature freeze on content-service until recovered.
- CTO sign-off required to override freeze.