Observability

:::info Source Sourced from services/delivery-service/OBSERVABILITY.md in the documentation repo. :::

Companion: 15 Observability & Telemetry · APPLICATION_LOGIC

1. Telemetry Stack

Follows platform standard: OpenTelemetry SDK (Node.js) -> OTel Collector -> Prometheus + Tempo + Loki. Service imports @ghasi/telemetry wrapper (never vendor SDKs directly).

2. Service Metadata

service.name: delivery-service
service.namespace: ghasi-edtech
service.version: <semver>
deployment.environment: <dev|staging|prod>

3. Metrics

3.1 RED Metrics (required)

Metric	Type	Labels	Purpose
`delivery_http_requests_total`	Counter	`method`, `route`, `status_code`, `tenant_id`	Request rate
`delivery_http_request_duration_seconds`	Histogram	`method`, `route`, `status_code`	Latency
`delivery_http_errors_total`	Counter	`method`, `route`, `error_code`	Error rate

3.2 Domain Metrics

Metric	Type	Labels
`delivery_play_sessions_active`	Gauge	`tenant_id`
`delivery_play_sessions_started_total`	Counter	`tenant_id`, `is_offline`
`delivery_play_sessions_completed_total`	Counter	`tenant_id`
`delivery_play_sessions_abandoned_total`	Counter	`tenant_id`, `reason`
`delivery_session_duration_seconds`	Histogram	`tenant_id`, `outcome`
`delivery_navigation_events_total`	Counter	`tenant_id`, `navigation_type`
`delivery_tutor_turns_total`	Counter	`tenant_id`, `local`
`delivery_tutor_turn_duration_seconds`	Histogram	`tenant_id`, `local`, `model`
`delivery_tutor_tokens_total`	Counter	`direction`, `tenant_id`, `model`
`delivery_tutor_cost_microusd_total`	Counter	`tenant_id`, `model`
`delivery_offline_mounts_active`	Gauge	`tenant_id`
`delivery_offline_mounts_total`	Counter	`tenant_id`, `outcome` (success/rejected)
`delivery_tamper_detected_total`	Counter	`tenant_id`

3.3 Infrastructure Metrics

Metric	Type	Purpose
`delivery_db_pool_connections_active`	Gauge	DB pool saturation
`delivery_db_query_duration_seconds`	Histogram	DB query latency
`delivery_redis_operations_total`	Counter	Redis ops
`delivery_redis_latency_seconds`	Histogram	Redis latency
`delivery_outbox_lag_seconds`	Gauge	Outbox publish delay
`delivery_inbox_processed_total`	Counter	Events consumed
`delivery_inbox_dlq_total`	Counter	Events sent to DLQ

4. Logs

Structured JSON, schema-validated via @ghasi/telemetry. Required fields per platform §3.1:

{
  "timestamp": "2026-04-15T10:15:30.123Z",
  "severity": "INFO",
  "service": "delivery-service",
  "trace_id": "00-abc...-01",
  "span_id": "def...",
  "tenant_id": "tnt_01H...",
  "request_id": "req_01H...",
  "actor_id": "<hashed>",
  "event": "play_session.started",
  "session_id": "pls_01H...",
  "enrollment_id": "enr_01H...",
  "course_version_id": "cvr_01H...",
  "device_id": "dev_01H...",
  "is_offline": false,
  "log_schema_version": "1.0"
}

4.1 Log Levels

Level	Usage
`DEBUG`	Development only; disabled in prod
`INFO`	Business events (session start, completion, tutor turn)
`WARN`	Recoverable issues (rate limits hit, retry attempts)
`ERROR`	Request failures, unexpected exceptions
`FATAL`	Service cannot continue (startup failures, unrecoverable states)

4.2 PII Redaction

actor_id is always hashed (SHA-256 with tenant salt).
Tutor prompt and response fields are NEVER logged at INFO level.
Tutor content logged at DEBUG only, with PII classifier pass.
Redaction enforced by @ghasi/telemetry emitter.

5. Distributed Tracing

5.1 Span Taxonomy

Span Name	Kind	Attributes
`delivery.start_play_session`	SERVER	session.id, enrollment.id, is_offline
`delivery.navigate`	SERVER	session.id, navigation.type
`delivery.complete_session`	SERVER	session.id, duration_seconds
`delivery.tutor_turn`	SERVER	session.id, turn.id, ai.local
`delivery.tutor_stream`	INTERNAL	turn.id, tokens
`delivery.mount_offline`	SERVER	mount.id, bundle.id
`delivery.db.query`	CLIENT	db.statement
`delivery.nats.publish`	PRODUCER	messaging.destination
`delivery.nats.consume`	CONSUMER	messaging.destination
`delivery.ai_client.stream`	CLIENT	ai.model, ai.local
`delivery.content_client.manifest`	CLIENT	http.url
`delivery.enrollment_client.validate`	CLIENT	http.url

5.2 Trace Propagation

Incoming requests: parse traceparent header; if missing, start new trace.
Outgoing HTTP: inject traceparent and tracestate.
NATS: inject traceparent into envelope correlationId (W3C-derived).
SSE streams: parent span kept alive until stream closes; each chunk is a span event (not a child span) to avoid span explosion.

5.3 Sampling

Default: head-based 10% sampling.
100% sampling for:
- Errors (5xx)
- Sessions containing tutor turns
- Offline mount operations
- Tamper detection events
Sampling decision propagated via tracestate.

6. Dashboards

All dashboards stored as code in grafana/ repo:

Dashboard	Purpose
`delivery-overview`	RED metrics, active sessions, error rates
`delivery-sessions`	Session lifecycle funnel, abandonment reasons, duration histogram
`delivery-ai-tutor`	Tutor turn volume, latency, cost, local-vs-cloud split, error rate
`delivery-offline`	Active mounts, mount success rate, tamper detections, unmount reasons
`delivery-tenant-{tenantId}`	Per-tenant views for large customers (RBAC-scoped)
`delivery-slo`	SLO burn-rate, error budget remaining

7. Alerts

Declared in alerts/delivery/ with Alertmanager routing to PagerDuty + Slack.

7.1 Critical (P1)

Alert	Condition	Runbook
`DeliveryServiceDown`	`up{service="delivery"} == 0` for 2 min	`runbooks/delivery/service-down`
`DeliveryHighErrorRate`	5xx rate > 5% for 5 min	`runbooks/delivery/high-error-rate`
`DeliveryDatabaseUnreachable`	DB connection errors > 10/min	`runbooks/delivery/db-unreachable`
`TamperSpike`	tamper_detected rate > 10/min	`runbooks/delivery/tamper-spike`
`CrossTenantAccessAttempt`	Any `audit.security.cross_tenant_rejected`	`runbooks/security/cross-tenant`

7.2 High (P2)

Alert	Condition
`DeliveryTutorLatencyHigh`	p95 > 5s for 10 min
`DeliveryOutboxLag`	Lag > 30s for 5 min
`DeliveryDLQGrowing`	DLQ has > 100 messages
`AITutorCostSpike`	Cost > 3x baseline over 1h

7.3 Medium (P3)

Alert	Condition
`DeliveryNavigationLatencyHigh`	p95 > 500ms for 15 min
`SessionAbandonmentHigh`	Abandonment rate > 40% for 1 hour

8. SLOs

SLI	Target	Window
Start session availability	99.9%	30 days
Navigation latency	p95 < 300ms	30 days
Tutor turn time-to-first-token	p95 < 1.5s	30 days
Tutor turn availability	99.5%	30 days
Offline mount success rate	99%	30 days
Session completion fidelity (no lost sessions)	99.99%	90 days

Error budgets calculated by Sloth; burn-rate alerts at 2x and 10x rates.

9. Exemplars

Every metric histogram has exemplars linking to traces. Enabled at Collector with tail_sampling processor configured for errors and slow paths.

10. Cost Monitoring

Cost signals tagged per tenant:

Cost Signal	Source
AI tutor tokens	`delivery_tutor_cost_microusd_total`
Database storage	PG table size monitoring
NATS throughput	Stream size monitoring
Network egress	AWS/GCP cost explorer tagged

Per-tenant cost dashboard available to finance for cost attribution.

1. Telemetry Stack​

2. Service Metadata​

3. Metrics​

3.1 RED Metrics (required)​

3.2 Domain Metrics​

3.3 Infrastructure Metrics​

4. Logs​

4.1 Log Levels​

4.2 PII Redaction​

5. Distributed Tracing​

5.1 Span Taxonomy​

5.2 Trace Propagation​

5.3 Sampling​

6. Dashboards​

7. Alerts​

7.1 Critical (P1)​

7.2 High (P2)​

7.3 Medium (P3)​

8. SLOs​

9. Exemplars​

10. Cost Monitoring​