Observability
:::info Source
Sourced from services/authoring-service/10-OBSERVABILITY.md in the documentation repo.
:::
Companion: 15 Observability & Telemetry
1. Instrumentation Stack
| Pillar | Tooling |
|---|---|
| Tracing | OpenTelemetry SDK (Node) → OTel Collector → Tempo/SigNoz |
| Metrics | OpenTelemetry SDK → OTel Collector → Prometheus/Mimir |
| Logs | Pino (JSON) → OTel Collector → Loki → S3 Parquet (cold) |
| Profiling | Pyroscope (pprof format) |
Every log line, span, and metric exemplar carries: trace_id, tenant_id, user_id_hash, request_id, service=authoring, commit=$SHA, instance.
2. Structured Log Schema
interface AuthoringLogLine {
timestamp: ISODate;
level: 'trace'|'debug'|'info'|'warn'|'error'|'fatal';
service: 'authoring-service';
instance: string;
commit: string;
log_schema_version: '1.0';
trace_id?: string;
span_id?: string;
request_id?: string;
tenant_id?: string;
user_id_hash?: string; // sha256(userId + salt) — never raw
actor_type?: 'user'|'system'|'api_key';
event: string; // canonical event name, e.g. 'draft.created'
resource_type?: string;
resource_id?: string;
duration_ms?: number;
status?: string;
error?: { code: string; message: string; stack?: string };
meta?: Record<string, unknown>; // non-PII context
}
Redaction is applied by the RedactingLogger wrapper before emit. No free-form strings with user data; meta values hit a PII scanner before write.
3. Metric Catalog
3.1 RED (Rate, Errors, Duration)
| Metric | Type | Labels | Purpose |
|---|---|---|---|
authoring_http_requests_total | counter | method, route, status, tenant | HTTP request rate |
authoring_http_request_duration_seconds | histogram | method, route, status | HTTP latency |
authoring_http_errors_total | counter | method, route, error_code | Error taxonomy |
authoring_ws_connections_active | gauge | tenant | Active collab sessions |
authoring_ws_messages_total | counter | tenant, direction | WS message rate |
3.2 Domain-Specific
| Metric | Type | Labels |
|---|---|---|
authoring_drafts_total | gauge | tenant, state |
authoring_draft_state_transitions_total | counter | tenant, from_state, to_state |
authoring_blocks_added_total | counter | tenant, kind, source (manual/ai) |
authoring_blocks_removed_total | counter | tenant, kind |
authoring_ai_requests_total | counter | tenant, flow, status |
authoring_ai_duration_seconds | histogram | tenant, flow |
authoring_ai_tokens_total | counter | tenant, flow, direction |
authoring_ai_cost_microusd_total | counter | tenant, flow, model |
authoring_ai_acceptance_rate | gauge | tenant, flow |
authoring_ai_moderation_blocks_total | counter | tenant, flow, verdict |
authoring_publish_saga_duration_seconds | histogram | tenant, step, outcome |
authoring_publish_saga_active | gauge | tenant, step |
authoring_publish_saga_timeouts_total | counter | tenant |
authoring_publish_saga_compensations_total | counter | tenant, step |
authoring_scorm_import_duration_seconds | histogram | tenant, scorm_version, outcome |
authoring_scorm_import_total | counter | tenant, scorm_version, status |
authoring_collab_session_duration_seconds | histogram | tenant |
authoring_time_to_first_block_seconds | histogram | tenant |
3.3 Outbox / Event Health
| Metric | Type | Labels |
|---|---|---|
authoring_outbox_pending_total | gauge | — |
authoring_outbox_publish_duration_seconds | histogram | topic |
authoring_outbox_retries_total | counter | topic |
authoring_outbox_dlq_total | counter | topic |
authoring_inbox_processed_total | counter | topic, status |
authoring_inbox_duration_seconds | histogram | topic |
3.4 USE (Utilization, Saturation, Errors) for Resources
| Metric | Type | Labels |
|---|---|---|
authoring_db_pool_active | gauge | — |
authoring_db_pool_idle | gauge | — |
authoring_db_pool_wait_seconds | histogram | — |
authoring_db_query_duration_seconds | histogram | operation |
authoring_memory_heap_bytes | gauge | — |
authoring_event_loop_lag_seconds | histogram | — |
4. Distributed Tracing
4.1 Key Spans
| Span name | Attributes |
|---|---|
http.request | method, route, status, tenant_id |
draft.create | draft_id, tenant_id |
draft.update | draft_id, change_set_size |
block.add | block_id, kind, lesson_id |
ai.generate_block | flow, prompt_id, prompt_version, model, tokens, cost_microusd |
ai.improve_block | flow, prompt_id, block_id |
publish.saga.start | draft_id, saga_id |
publish.saga.step.{building|cataloging|bundling|ready} | step, duration, outcome |
publish.saga.compensate | step, reason |
scorm.import.parse | source_url, version |
scorm.import.map | warning_count |
outbox.publish | topic, outbox_id |
db.query | statement (sanitized), rows |
4.2 Span Sampling
| Path | Sampling |
|---|---|
| AI generation spans | 100% (safety-critical) |
| Publish saga spans | 100% (tamper-evident) |
| Write endpoints | 50% head-based; 100% tail-based on errors |
| Read endpoints | 1% head-based; 100% tail-based on slow (> 500ms) |
| Health / readiness | 0% (noise) |
4.3 Context Propagation
- W3C Trace Context (
traceparent,tracestate) on all HTTP requests - NATS events carry
trace_idin envelope; consumers link to same trace - SSE streams propagate
trace_idas event meta - WebSocket connections carry
trace_idas first message
5. Dashboards
5.1 Service Overview Dashboard
Panels:
- RED metrics (RPS, error rate, p50/p95/p99 latency)
- Active collab sessions (gauge)
- Drafts by state (stacked bar)
- Publish saga throughput (steps/sec)
- DLQ depth (single-stat with alert threshold)
5.2 AI Authoring Dashboard
Panels:
- AI requests by flow (stacked)
- AI acceptance rate by flow (time series)
- AI cost by tenant (top 10, per-day)
- AI duration p95 by model (time series)
- Moderation block rate (time series + alert line)
- Local vs remote AI mix (ratio)
5.3 Publish Saga Dashboard
Panels:
- Active sagas by step (stacked)
- Saga duration histogram
- Timeout rate
- Compensation rate
- Per-step failure breakdown
5.4 Tenant Drill-Down
Parameterized by tenant_id:
- Per-tenant draft count, block count
- Per-tenant AI spend (month-to-date + forecast)
- Per-tenant publish rate
- Per-tenant rate-limit near-misses
6. Alerts
6.1 SLO-Based
| Alert | Condition | Severity | Runbook |
|---|---|---|---|
AuthoringAvailabilityBurn | Error budget burn > 14.4x in 1h | critical | rb/authoring-availability |
AuthoringLatencyBurn | p95 > 800ms for 5m | warning | rb/authoring-latency |
AuthoringDLQNonEmpty | DLQ depth > 0 for 5m | critical | rb/authoring-dlq |
AuthoringOutboxStalled | Pending > 10k for 5m | critical | rb/authoring-outbox-stalled |
AuthoringPublishSagaTimeoutRate | timeout_rate > 5% in 15m | warning | rb/publish-saga-timeout |
AuthoringAIModerationRate | moderation_block_rate > 20% in 15m | warning | rb/ai-moderation-spike |
AuthoringAICostAnomaly | cost p99 > 3x 7d baseline | warning | rb/ai-cost-anomaly |
AuthoringDBConnectionsExhausted | pool_active / pool_max > 0.9 | critical | rb/authoring-db |
6.2 Security Alerts
| Alert | Condition | Severity |
|---|---|---|
AuthoringCrossTenantAttempt | Any authoring.cross_tenant 403 | critical (SecOps) |
AuthoringUnauthorizedPublish | Publish attempted on foreign draft | critical (SecOps) |
AuthoringScormImportFailureSpike | Import failure rate > 50% in 10m | warning (possible RCE attempt) |
7. SLIs / SLOs
| SLI | Definition | SLO Target | Window |
|---|---|---|---|
| Availability (write) | 1 - (5xx rate on write endpoints) | 99.9% | 30d rolling |
| Availability (read) | 1 - (5xx rate on read endpoints) | 99.95% | 30d rolling |
| Latency (write p95) | write endpoint duration | < 400ms | 7d rolling |
| Latency (read p95) | read endpoint duration | < 150ms | 7d rolling |
| AI job success rate | ai_jobs completed / attempted | 98% | 7d rolling |
| Publish saga success rate | sagas reaching ready / initiated | 99% | 7d rolling |
| Event delivery lag p95 | outbox→NATS publish latency | < 5s | 7d rolling |
Error budget = (1 - SLO). Alerts fire at 2x, 5x, 14.4x burn rates.
8. Runbooks (Pointers)
All runbooks live at runbooks.ghasi.io/authoring/*:
rb/authoring-availability— service down / 5xx spikerb/authoring-latency— p95 regressionrb/authoring-dlq— DLQ non-empty (manual replay)rb/authoring-outbox-stalled— relay crashed or backed uprb/publish-saga-timeout— saga stuck in steprb/ai-moderation-spike— content safety team triagerb/ai-cost-anomaly— AI spend investigationrb/authoring-db— database pool exhaustion
9. Exemplars
Every histogram metric carries trace exemplars. Grafana panels allow "jump to trace" from a latency spike.
10. AI-Specific Telemetry
Per AI job, we emit:
trace_id, span_id, tenant_id, user_id_hash, draft_id,
flow, prompt_id, prompt_version, model, model_version,
input_tokens, output_tokens, cost_microusd, cached (bool),
local (bool), moderation_verdict, duration_ms,
accepted_at (null | ts), edit_distance (null | int)
Analytics-service consumes authoring.block.ai_generated.v1 and authoring.block.reviewed.v1 to compute acceptance rate, time-to-accept, edit distance distribution, per-author AI trust score.
11. Publish Saga Telemetry
Every saga emits a trace spanning all steps. The span tree:
publish.saga.start
├── publish.saga.step.building (waits for content-service)
├── publish.saga.step.cataloging (waits for catalog-service)
├── publish.saga.step.bundling (waits for content-service bundle)
└── publish.saga.step.ready
On failure, span attributes include compensation_path and failure_step.
12. Offline Telemetry (S5)
- Client buffers structured events to IndexedDB when offline
- On reconnect, events are uploaded with original timestamps and
delayed=truetag - Server applies replay-safe ingestion (timestamps preserved; not real-time metrics)
13. PII Handling in Telemetry
| Field | Treatment |
|---|---|
userId | Hashed (sha256 + salt rotating daily) |
email | Never logged |
draft content | Never logged in full; truncated hashes only |
AI prompt inputs | Redacted via classifier; hashes only |
IP address | /24 prefix only; rotated hash after 30d |
PII leakage detected in logs → automated quarantine + incident ticket.
14. Retention
| Data | Hot | Cold | Archive |
|---|---|---|---|
| Traces | 7d (Tempo) | 30d (Parquet) | 90d then delete |
| Metrics | 30d (Prometheus) | 13mo (Mimir) | — |
| Logs | 14d (Loki) | 395d (S3 parquet) | — |
| Profiles | 7d | — | — |
| Audit logs | 90d hot | 7y cold | legal hold on request |
Per-tenant retention may override (higher tiers get longer retention).