Skip to main content

Observability

:::info Source Sourced from services/progress-service/OBSERVABILITY.md in the documentation repo. :::

1. Logs

  • Structured (schema v3).
  • Events: progress.statement.ingested, .rejected, progress.attempt.opened, .closed, progress.completion.recorded, progress.xapi.query.served, progress.replay.started.
  • Attrs: statement_id, attempt_id, enrollment_id, verb_id, tenant_id, partition.
  • Redact actor PII beyond hash; no quiz response bodies in logs.

2. Metrics

RED

  • progress_xapi_ingest_total{status,tenant_id} — counter
  • progress_xapi_ingest_duration_seconds{batch_size_bucket} — histogram
  • progress_api_requests_total{endpoint,method,status} — counter

Domain

  • progress_statements_stored_total{tenant_id,verb} — counter
  • progress_attempts_opened_total, progress_attempts_closed_total{outcome} — counter
  • progress_completions_recorded_total{tenant_id} — counter
  • progress_transcript_render_duration_seconds — histogram
  • progress_xapi_query_duration_seconds{filters_hash} — histogram

USE

  • progress_outbox_lag_seconds — gauge
  • progress_inbox_dedup_ratio — gauge
  • progress_partition_row_count{partition} — gauge

3. Traces

Spans:

  • progress.xapi.statement.post → validate, dedup, persist, emit
  • progress.attempt.close → aggregate statements, set outcome, emit
  • progress.completion.record → idempotent insert, emit integration event
  • progress.xapi.query → parse filters, query Postgres, serialize

Every span: tenant_id, attempt_id, actor_id_hash.

4. Dashboards

  • Ingest — TPS, error rate, batch-size distribution.
  • Attempt Lifecycle — open/close rates per tenant.
  • Completion Funnel — attempts started → completed → passed.
  • xAPI Query — query duration by filter shape.
  • Outbox Health — lag, DLQ depth.
  • Partition Management — row counts, detach readiness.

5. Alerts

AlertThresholdSeverityRunbook
progress-ingest-error-rate> 0.5% for 10 minP2runbooks/progress/ingest-errors.md
progress-completion-gapp95 statement→completion event > 30sP2runbooks/progress/completion-lag.md
progress-outbox-lag> 60s p99P2runbooks/progress/outbox.md
progress-dlq-non-emptyany msg in DLQP2runbooks/progress/dlq.md
progress-attempt-open-staleattempt open > 7 daysP3runbooks/progress/stale-attempts.md
progress-xapi-query-slowp95 > 2sP3runbooks/progress/query-perf.md

6. SLOs

SLITargetBudget
Ingestion availability99.99%4.38 min/mo
Statement-to-completion-event p95< 5s1% > 30s
xAPI query p95< 300ms1% > 1s
Transcript render p95< 2s1% > 10s
Outbox lag p99< 30s

7. Error Budget Policy

30-day rolling; freeze non-safety features on budget exhaustion.

8. RUM (for admin transcript UI)

  • Transcript page LCP target < 2s.
  • Query panel INP < 200ms.