Skip to main content

Observability

:::info Source Sourced from services/authoring-service/10-OBSERVABILITY.md in the documentation repo. :::

Companion: 15 Observability & Telemetry


1. Instrumentation Stack

PillarTooling
TracingOpenTelemetry SDK (Node) → OTel Collector → Tempo/SigNoz
MetricsOpenTelemetry SDK → OTel Collector → Prometheus/Mimir
LogsPino (JSON) → OTel Collector → Loki → S3 Parquet (cold)
ProfilingPyroscope (pprof format)

Every log line, span, and metric exemplar carries: trace_id, tenant_id, user_id_hash, request_id, service=authoring, commit=$SHA, instance.

2. Structured Log Schema

interface AuthoringLogLine {
timestamp: ISODate;
level: 'trace'|'debug'|'info'|'warn'|'error'|'fatal';
service: 'authoring-service';
instance: string;
commit: string;
log_schema_version: '1.0';
trace_id?: string;
span_id?: string;
request_id?: string;
tenant_id?: string;
user_id_hash?: string; // sha256(userId + salt) — never raw
actor_type?: 'user'|'system'|'api_key';
event: string; // canonical event name, e.g. 'draft.created'
resource_type?: string;
resource_id?: string;
duration_ms?: number;
status?: string;
error?: { code: string; message: string; stack?: string };
meta?: Record<string, unknown>; // non-PII context
}

Redaction is applied by the RedactingLogger wrapper before emit. No free-form strings with user data; meta values hit a PII scanner before write.

3. Metric Catalog

3.1 RED (Rate, Errors, Duration)

MetricTypeLabelsPurpose
authoring_http_requests_totalcountermethod, route, status, tenantHTTP request rate
authoring_http_request_duration_secondshistogrammethod, route, statusHTTP latency
authoring_http_errors_totalcountermethod, route, error_codeError taxonomy
authoring_ws_connections_activegaugetenantActive collab sessions
authoring_ws_messages_totalcountertenant, directionWS message rate

3.2 Domain-Specific

MetricTypeLabels
authoring_drafts_totalgaugetenant, state
authoring_draft_state_transitions_totalcountertenant, from_state, to_state
authoring_blocks_added_totalcountertenant, kind, source (manual/ai)
authoring_blocks_removed_totalcountertenant, kind
authoring_ai_requests_totalcountertenant, flow, status
authoring_ai_duration_secondshistogramtenant, flow
authoring_ai_tokens_totalcountertenant, flow, direction
authoring_ai_cost_microusd_totalcountertenant, flow, model
authoring_ai_acceptance_rategaugetenant, flow
authoring_ai_moderation_blocks_totalcountertenant, flow, verdict
authoring_publish_saga_duration_secondshistogramtenant, step, outcome
authoring_publish_saga_activegaugetenant, step
authoring_publish_saga_timeouts_totalcountertenant
authoring_publish_saga_compensations_totalcountertenant, step
authoring_scorm_import_duration_secondshistogramtenant, scorm_version, outcome
authoring_scorm_import_totalcountertenant, scorm_version, status
authoring_collab_session_duration_secondshistogramtenant
authoring_time_to_first_block_secondshistogramtenant

3.3 Outbox / Event Health

MetricTypeLabels
authoring_outbox_pending_totalgauge
authoring_outbox_publish_duration_secondshistogramtopic
authoring_outbox_retries_totalcountertopic
authoring_outbox_dlq_totalcountertopic
authoring_inbox_processed_totalcountertopic, status
authoring_inbox_duration_secondshistogramtopic

3.4 USE (Utilization, Saturation, Errors) for Resources

MetricTypeLabels
authoring_db_pool_activegauge
authoring_db_pool_idlegauge
authoring_db_pool_wait_secondshistogram
authoring_db_query_duration_secondshistogramoperation
authoring_memory_heap_bytesgauge
authoring_event_loop_lag_secondshistogram

4. Distributed Tracing

4.1 Key Spans

Span nameAttributes
http.requestmethod, route, status, tenant_id
draft.createdraft_id, tenant_id
draft.updatedraft_id, change_set_size
block.addblock_id, kind, lesson_id
ai.generate_blockflow, prompt_id, prompt_version, model, tokens, cost_microusd
ai.improve_blockflow, prompt_id, block_id
publish.saga.startdraft_id, saga_id
publish.saga.step.{building|cataloging|bundling|ready}step, duration, outcome
publish.saga.compensatestep, reason
scorm.import.parsesource_url, version
scorm.import.mapwarning_count
outbox.publishtopic, outbox_id
db.querystatement (sanitized), rows

4.2 Span Sampling

PathSampling
AI generation spans100% (safety-critical)
Publish saga spans100% (tamper-evident)
Write endpoints50% head-based; 100% tail-based on errors
Read endpoints1% head-based; 100% tail-based on slow (> 500ms)
Health / readiness0% (noise)

4.3 Context Propagation

  • W3C Trace Context (traceparent, tracestate) on all HTTP requests
  • NATS events carry trace_id in envelope; consumers link to same trace
  • SSE streams propagate trace_id as event meta
  • WebSocket connections carry trace_id as first message

5. Dashboards

5.1 Service Overview Dashboard

Panels:

  • RED metrics (RPS, error rate, p50/p95/p99 latency)
  • Active collab sessions (gauge)
  • Drafts by state (stacked bar)
  • Publish saga throughput (steps/sec)
  • DLQ depth (single-stat with alert threshold)

5.2 AI Authoring Dashboard

Panels:

  • AI requests by flow (stacked)
  • AI acceptance rate by flow (time series)
  • AI cost by tenant (top 10, per-day)
  • AI duration p95 by model (time series)
  • Moderation block rate (time series + alert line)
  • Local vs remote AI mix (ratio)

5.3 Publish Saga Dashboard

Panels:

  • Active sagas by step (stacked)
  • Saga duration histogram
  • Timeout rate
  • Compensation rate
  • Per-step failure breakdown

5.4 Tenant Drill-Down

Parameterized by tenant_id:

  • Per-tenant draft count, block count
  • Per-tenant AI spend (month-to-date + forecast)
  • Per-tenant publish rate
  • Per-tenant rate-limit near-misses

6. Alerts

6.1 SLO-Based

AlertConditionSeverityRunbook
AuthoringAvailabilityBurnError budget burn > 14.4x in 1hcriticalrb/authoring-availability
AuthoringLatencyBurnp95 > 800ms for 5mwarningrb/authoring-latency
AuthoringDLQNonEmptyDLQ depth > 0 for 5mcriticalrb/authoring-dlq
AuthoringOutboxStalledPending > 10k for 5mcriticalrb/authoring-outbox-stalled
AuthoringPublishSagaTimeoutRatetimeout_rate > 5% in 15mwarningrb/publish-saga-timeout
AuthoringAIModerationRatemoderation_block_rate > 20% in 15mwarningrb/ai-moderation-spike
AuthoringAICostAnomalycost p99 > 3x 7d baselinewarningrb/ai-cost-anomaly
AuthoringDBConnectionsExhaustedpool_active / pool_max > 0.9criticalrb/authoring-db

6.2 Security Alerts

AlertConditionSeverity
AuthoringCrossTenantAttemptAny authoring.cross_tenant 403critical (SecOps)
AuthoringUnauthorizedPublishPublish attempted on foreign draftcritical (SecOps)
AuthoringScormImportFailureSpikeImport failure rate > 50% in 10mwarning (possible RCE attempt)

7. SLIs / SLOs

SLIDefinitionSLO TargetWindow
Availability (write)1 - (5xx rate on write endpoints)99.9%30d rolling
Availability (read)1 - (5xx rate on read endpoints)99.95%30d rolling
Latency (write p95)write endpoint duration< 400ms7d rolling
Latency (read p95)read endpoint duration< 150ms7d rolling
AI job success rateai_jobs completed / attempted98%7d rolling
Publish saga success ratesagas reaching ready / initiated99%7d rolling
Event delivery lag p95outbox→NATS publish latency< 5s7d rolling

Error budget = (1 - SLO). Alerts fire at 2x, 5x, 14.4x burn rates.

8. Runbooks (Pointers)

All runbooks live at runbooks.ghasi.io/authoring/*:

  • rb/authoring-availability — service down / 5xx spike
  • rb/authoring-latency — p95 regression
  • rb/authoring-dlq — DLQ non-empty (manual replay)
  • rb/authoring-outbox-stalled — relay crashed or backed up
  • rb/publish-saga-timeout — saga stuck in step
  • rb/ai-moderation-spike — content safety team triage
  • rb/ai-cost-anomaly — AI spend investigation
  • rb/authoring-db — database pool exhaustion

9. Exemplars

Every histogram metric carries trace exemplars. Grafana panels allow "jump to trace" from a latency spike.

10. AI-Specific Telemetry

Per AI job, we emit:

trace_id, span_id, tenant_id, user_id_hash, draft_id,
flow, prompt_id, prompt_version, model, model_version,
input_tokens, output_tokens, cost_microusd, cached (bool),
local (bool), moderation_verdict, duration_ms,
accepted_at (null | ts), edit_distance (null | int)

Analytics-service consumes authoring.block.ai_generated.v1 and authoring.block.reviewed.v1 to compute acceptance rate, time-to-accept, edit distance distribution, per-author AI trust score.

11. Publish Saga Telemetry

Every saga emits a trace spanning all steps. The span tree:

publish.saga.start
├── publish.saga.step.building (waits for content-service)
├── publish.saga.step.cataloging (waits for catalog-service)
├── publish.saga.step.bundling (waits for content-service bundle)
└── publish.saga.step.ready

On failure, span attributes include compensation_path and failure_step.

12. Offline Telemetry (S5)

  • Client buffers structured events to IndexedDB when offline
  • On reconnect, events are uploaded with original timestamps and delayed=true tag
  • Server applies replay-safe ingestion (timestamps preserved; not real-time metrics)

13. PII Handling in Telemetry

FieldTreatment
userIdHashed (sha256 + salt rotating daily)
emailNever logged
draft contentNever logged in full; truncated hashes only
AI prompt inputsRedacted via classifier; hashes only
IP address/24 prefix only; rotated hash after 30d

PII leakage detected in logs → automated quarantine + incident ticket.

14. Retention

DataHotColdArchive
Traces7d (Tempo)30d (Parquet)90d then delete
Metrics30d (Prometheus)13mo (Mimir)
Logs14d (Loki)395d (S3 parquet)
Profiles7d
Audit logs90d hot7y coldlegal hold on request

Per-tenant retention may override (higher tiers get longer retention).