Observability
:::info Source
Sourced from services/progress-service/OBSERVABILITY.md in the documentation repo.
:::
1. Logs
- Structured (schema v3).
- Events:
progress.statement.ingested,.rejected,progress.attempt.opened,.closed,progress.completion.recorded,progress.xapi.query.served,progress.replay.started. - Attrs:
statement_id,attempt_id,enrollment_id,verb_id,tenant_id,partition. - Redact actor PII beyond hash; no quiz response bodies in logs.
2. Metrics
RED
progress_xapi_ingest_total{status,tenant_id}— counterprogress_xapi_ingest_duration_seconds{batch_size_bucket}— histogramprogress_api_requests_total{endpoint,method,status}— counter
Domain
progress_statements_stored_total{tenant_id,verb}— counterprogress_attempts_opened_total,progress_attempts_closed_total{outcome}— counterprogress_completions_recorded_total{tenant_id}— counterprogress_transcript_render_duration_seconds— histogramprogress_xapi_query_duration_seconds{filters_hash}— histogram
USE
progress_outbox_lag_seconds— gaugeprogress_inbox_dedup_ratio— gaugeprogress_partition_row_count{partition}— gauge
3. Traces
Spans:
progress.xapi.statement.post→ validate, dedup, persist, emitprogress.attempt.close→ aggregate statements, set outcome, emitprogress.completion.record→ idempotent insert, emit integration eventprogress.xapi.query→ parse filters, query Postgres, serialize
Every span: tenant_id, attempt_id, actor_id_hash.
4. Dashboards
- Ingest — TPS, error rate, batch-size distribution.
- Attempt Lifecycle — open/close rates per tenant.
- Completion Funnel — attempts started → completed → passed.
- xAPI Query — query duration by filter shape.
- Outbox Health — lag, DLQ depth.
- Partition Management — row counts, detach readiness.
5. Alerts
| Alert | Threshold | Severity | Runbook |
|---|---|---|---|
progress-ingest-error-rate | > 0.5% for 10 min | P2 | runbooks/progress/ingest-errors.md |
progress-completion-gap | p95 statement→completion event > 30s | P2 | runbooks/progress/completion-lag.md |
progress-outbox-lag | > 60s p99 | P2 | runbooks/progress/outbox.md |
progress-dlq-non-empty | any msg in DLQ | P2 | runbooks/progress/dlq.md |
progress-attempt-open-stale | attempt open > 7 days | P3 | runbooks/progress/stale-attempts.md |
progress-xapi-query-slow | p95 > 2s | P3 | runbooks/progress/query-perf.md |
6. SLOs
| SLI | Target | Budget |
|---|---|---|
| Ingestion availability | 99.99% | 4.38 min/mo |
| Statement-to-completion-event p95 | < 5s | 1% > 30s |
| xAPI query p95 | < 300ms | 1% > 1s |
| Transcript render p95 | < 2s | 1% > 10s |
| Outbox lag p99 | < 30s | — |
7. Error Budget Policy
30-day rolling; freeze non-safety features on budget exhaustion.
8. RUM (for admin transcript UI)
- Transcript page LCP target < 2s.
- Query panel INP < 200ms.