Observability

:::info Source Sourced from services/authoring-service/10-OBSERVABILITY.md in the documentation repo. :::

Companion: 15 Observability & Telemetry

1. Instrumentation Stack

Pillar	Tooling
Tracing	OpenTelemetry SDK (Node) → OTel Collector → Tempo/SigNoz
Metrics	OpenTelemetry SDK → OTel Collector → Prometheus/Mimir
Logs	Pino (JSON) → OTel Collector → Loki → S3 Parquet (cold)
Profiling	Pyroscope (pprof format)

Every log line, span, and metric exemplar carries: trace_id, tenant_id, user_id_hash, request_id, service=authoring, commit=$SHA, instance.

2. Structured Log Schema

interface AuthoringLogLine {
  timestamp: ISODate;
  level: 'trace'|'debug'|'info'|'warn'|'error'|'fatal';
  service: 'authoring-service';
  instance: string;
  commit: string;
  log_schema_version: '1.0';
  trace_id?: string;
  span_id?: string;
  request_id?: string;
  tenant_id?: string;
  user_id_hash?: string;           // sha256(userId + salt) — never raw
  actor_type?: 'user'|'system'|'api_key';
  event: string;                   // canonical event name, e.g. 'draft.created'
  resource_type?: string;
  resource_id?: string;
  duration_ms?: number;
  status?: string;
  error?: { code: string; message: string; stack?: string };
  meta?: Record<string, unknown>;  // non-PII context
}

Redaction is applied by the RedactingLogger wrapper before emit. No free-form strings with user data; meta values hit a PII scanner before write.

3. Metric Catalog

3.1 RED (Rate, Errors, Duration)

Metric	Type	Labels	Purpose
`authoring_http_requests_total`	counter	method, route, status, tenant	HTTP request rate
`authoring_http_request_duration_seconds`	histogram	method, route, status	HTTP latency
`authoring_http_errors_total`	counter	method, route, error_code	Error taxonomy
`authoring_ws_connections_active`	gauge	tenant	Active collab sessions
`authoring_ws_messages_total`	counter	tenant, direction	WS message rate

3.2 Domain-Specific

Metric	Type	Labels
`authoring_drafts_total`	gauge	tenant, state
`authoring_draft_state_transitions_total`	counter	tenant, from_state, to_state
`authoring_blocks_added_total`	counter	tenant, kind, source (manual/ai)
`authoring_blocks_removed_total`	counter	tenant, kind
`authoring_ai_requests_total`	counter	tenant, flow, status
`authoring_ai_duration_seconds`	histogram	tenant, flow
`authoring_ai_tokens_total`	counter	tenant, flow, direction
`authoring_ai_cost_microusd_total`	counter	tenant, flow, model
`authoring_ai_acceptance_rate`	gauge	tenant, flow
`authoring_ai_moderation_blocks_total`	counter	tenant, flow, verdict
`authoring_publish_saga_duration_seconds`	histogram	tenant, step, outcome
`authoring_publish_saga_active`	gauge	tenant, step
`authoring_publish_saga_timeouts_total`	counter	tenant
`authoring_publish_saga_compensations_total`	counter	tenant, step
`authoring_scorm_import_duration_seconds`	histogram	tenant, scorm_version, outcome
`authoring_scorm_import_total`	counter	tenant, scorm_version, status
`authoring_collab_session_duration_seconds`	histogram	tenant
`authoring_time_to_first_block_seconds`	histogram	tenant

3.3 Outbox / Event Health

Metric	Type	Labels
`authoring_outbox_pending_total`	gauge	—
`authoring_outbox_publish_duration_seconds`	histogram	topic
`authoring_outbox_retries_total`	counter	topic
`authoring_outbox_dlq_total`	counter	topic
`authoring_inbox_processed_total`	counter	topic, status
`authoring_inbox_duration_seconds`	histogram	topic

3.4 USE (Utilization, Saturation, Errors) for Resources

Metric	Type	Labels
`authoring_db_pool_active`	gauge	—
`authoring_db_pool_idle`	gauge	—
`authoring_db_pool_wait_seconds`	histogram	—
`authoring_db_query_duration_seconds`	histogram	operation
`authoring_memory_heap_bytes`	gauge	—
`authoring_event_loop_lag_seconds`	histogram	—

4. Distributed Tracing

4.1 Key Spans

Span name	Attributes
`http.request`	method, route, status, tenant_id
`draft.create`	draft_id, tenant_id
`draft.update`	draft_id, change_set_size
`block.add`	block_id, kind, lesson_id
`ai.generate_block`	flow, prompt_id, prompt_version, model, tokens, cost_microusd
`ai.improve_block`	flow, prompt_id, block_id
`publish.saga.start`	draft_id, saga_id
`publish.saga.step.{building\|cataloging\|bundling\|ready}`	step, duration, outcome
`publish.saga.compensate`	step, reason
`scorm.import.parse`	source_url, version
`scorm.import.map`	warning_count
`outbox.publish`	topic, outbox_id
`db.query`	statement (sanitized), rows

4.2 Span Sampling

Path	Sampling
AI generation spans	100% (safety-critical)
Publish saga spans	100% (tamper-evident)
Write endpoints	50% head-based; 100% tail-based on errors
Read endpoints	1% head-based; 100% tail-based on slow (> 500ms)
Health / readiness	0% (noise)

4.3 Context Propagation

W3C Trace Context (traceparent, tracestate) on all HTTP requests
NATS events carry trace_id in envelope; consumers link to same trace
SSE streams propagate trace_id as event meta
WebSocket connections carry trace_id as first message

5. Dashboards

5.1 Service Overview Dashboard

Panels:

RED metrics (RPS, error rate, p50/p95/p99 latency)
Active collab sessions (gauge)
Drafts by state (stacked bar)
Publish saga throughput (steps/sec)
DLQ depth (single-stat with alert threshold)

5.2 AI Authoring Dashboard

Panels:

AI requests by flow (stacked)
AI acceptance rate by flow (time series)
AI cost by tenant (top 10, per-day)
AI duration p95 by model (time series)
Moderation block rate (time series + alert line)
Local vs remote AI mix (ratio)

5.3 Publish Saga Dashboard

Panels:

Active sagas by step (stacked)
Saga duration histogram
Timeout rate
Compensation rate
Per-step failure breakdown

5.4 Tenant Drill-Down

Parameterized by tenant_id:

Per-tenant draft count, block count
Per-tenant AI spend (month-to-date + forecast)
Per-tenant publish rate
Per-tenant rate-limit near-misses

6. Alerts

6.1 SLO-Based

Alert	Condition	Severity	Runbook
`AuthoringAvailabilityBurn`	Error budget burn > 14.4x in 1h	critical	`rb/authoring-availability`
`AuthoringLatencyBurn`	p95 > 800ms for 5m	warning	`rb/authoring-latency`
`AuthoringDLQNonEmpty`	DLQ depth > 0 for 5m	critical	`rb/authoring-dlq`
`AuthoringOutboxStalled`	Pending > 10k for 5m	critical	`rb/authoring-outbox-stalled`
`AuthoringPublishSagaTimeoutRate`	timeout_rate > 5% in 15m	warning	`rb/publish-saga-timeout`
`AuthoringAIModerationRate`	moderation_block_rate > 20% in 15m	warning	`rb/ai-moderation-spike`
`AuthoringAICostAnomaly`	cost p99 > 3x 7d baseline	warning	`rb/ai-cost-anomaly`
`AuthoringDBConnectionsExhausted`	pool_active / pool_max > 0.9	critical	`rb/authoring-db`

6.2 Security Alerts

Alert	Condition	Severity
`AuthoringCrossTenantAttempt`	Any `authoring.cross_tenant` 403	critical (SecOps)
`AuthoringUnauthorizedPublish`	Publish attempted on foreign draft	critical (SecOps)
`AuthoringScormImportFailureSpike`	Import failure rate > 50% in 10m	warning (possible RCE attempt)

7. SLIs / SLOs

SLI	Definition	SLO Target	Window
Availability (write)	1 - (5xx rate on write endpoints)	99.9%	30d rolling
Availability (read)	1 - (5xx rate on read endpoints)	99.95%	30d rolling
Latency (write p95)	write endpoint duration	< 400ms	7d rolling
Latency (read p95)	read endpoint duration	< 150ms	7d rolling
AI job success rate	ai_jobs completed / attempted	98%	7d rolling
Publish saga success rate	sagas reaching `ready` / initiated	99%	7d rolling
Event delivery lag p95	outbox→NATS publish latency	< 5s	7d rolling

Error budget = (1 - SLO). Alerts fire at 2x, 5x, 14.4x burn rates.

8. Runbooks (Pointers)

All runbooks live at runbooks.ghasi.io/authoring/*:

rb/authoring-availability — service down / 5xx spike
rb/authoring-latency — p95 regression
rb/authoring-dlq — DLQ non-empty (manual replay)
rb/authoring-outbox-stalled — relay crashed or backed up
rb/publish-saga-timeout — saga stuck in step
rb/ai-moderation-spike — content safety team triage
rb/ai-cost-anomaly — AI spend investigation
rb/authoring-db — database pool exhaustion

9. Exemplars

Every histogram metric carries trace exemplars. Grafana panels allow "jump to trace" from a latency spike.

10. AI-Specific Telemetry

Per AI job, we emit:

trace_id, span_id, tenant_id, user_id_hash, draft_id,
flow, prompt_id, prompt_version, model, model_version,
input_tokens, output_tokens, cost_microusd, cached (bool),
local (bool), moderation_verdict, duration_ms,
accepted_at (null | ts), edit_distance (null | int)

Analytics-service consumes authoring.block.ai_generated.v1 and authoring.block.reviewed.v1 to compute acceptance rate, time-to-accept, edit distance distribution, per-author AI trust score.

11. Publish Saga Telemetry

Every saga emits a trace spanning all steps. The span tree:

publish.saga.start
├── publish.saga.step.building       (waits for content-service)
├── publish.saga.step.cataloging     (waits for catalog-service)
├── publish.saga.step.bundling       (waits for content-service bundle)
└── publish.saga.step.ready

On failure, span attributes include compensation_path and failure_step.

12. Offline Telemetry (S5)

Client buffers structured events to IndexedDB when offline
On reconnect, events are uploaded with original timestamps and delayed=true tag
Server applies replay-safe ingestion (timestamps preserved; not real-time metrics)

13. PII Handling in Telemetry

Field	Treatment
`userId`	Hashed (sha256 + salt rotating daily)
`email`	Never logged
`draft content`	Never logged in full; truncated hashes only
`AI prompt inputs`	Redacted via classifier; hashes only
`IP address`	`/24` prefix only; rotated hash after 30d

PII leakage detected in logs → automated quarantine + incident ticket.

14. Retention

Data	Hot	Cold	Archive
Traces	7d (Tempo)	30d (Parquet)	90d then delete
Metrics	30d (Prometheus)	13mo (Mimir)	—
Logs	14d (Loki)	395d (S3 parquet)	—
Profiles	7d	—	—
Audit logs	90d hot	7y cold	legal hold on request

Per-tenant retention may override (higher tiers get longer retention).

1. Instrumentation Stack​

2. Structured Log Schema​

3. Metric Catalog​

3.1 RED (Rate, Errors, Duration)​

3.2 Domain-Specific​

3.3 Outbox / Event Health​

3.4 USE (Utilization, Saturation, Errors) for Resources​

4. Distributed Tracing​

4.1 Key Spans​

4.2 Span Sampling​

4.3 Context Propagation​

5. Dashboards​

5.1 Service Overview Dashboard​

5.2 AI Authoring Dashboard​

5.3 Publish Saga Dashboard​

5.4 Tenant Drill-Down​

6. Alerts​

6.1 SLO-Based​

6.2 Security Alerts​

7. SLIs / SLOs​

8. Runbooks (Pointers)​

9. Exemplars​

10. AI-Specific Telemetry​

11. Publish Saga Telemetry​

12. Offline Telemetry (S5)​

13. PII Handling in Telemetry​

14. Retention​

1. Instrumentation Stack

2. Structured Log Schema

3. Metric Catalog

3.1 RED (Rate, Errors, Duration)

3.2 Domain-Specific

3.3 Outbox / Event Health

3.4 USE (Utilization, Saturation, Errors) for Resources

4. Distributed Tracing

4.1 Key Spans

4.2 Span Sampling

4.3 Context Propagation

5. Dashboards

5.1 Service Overview Dashboard

5.2 AI Authoring Dashboard

5.3 Publish Saga Dashboard

5.4 Tenant Drill-Down

6. Alerts

6.1 SLO-Based

6.2 Security Alerts

7. SLIs / SLOs

8. Runbooks (Pointers)

9. Exemplars

10. AI-Specific Telemetry

11. Publish Saga Telemetry

12. Offline Telemetry (S5)

13. PII Handling in Telemetry

14. Retention