Skip to main content

Observability

:::info Source Sourced from services/delivery-service/OBSERVABILITY.md in the documentation repo. :::

Companion: 15 Observability & Telemetry · APPLICATION_LOGIC

1. Telemetry Stack

Follows platform standard: OpenTelemetry SDK (Node.js) -> OTel Collector -> Prometheus + Tempo + Loki. Service imports @ghasi/telemetry wrapper (never vendor SDKs directly).

2. Service Metadata

service.name: delivery-service
service.namespace: ghasi-edtech
service.version: <semver>
deployment.environment: <dev|staging|prod>

3. Metrics

3.1 RED Metrics (required)

MetricTypeLabelsPurpose
delivery_http_requests_totalCountermethod, route, status_code, tenant_idRequest rate
delivery_http_request_duration_secondsHistogrammethod, route, status_codeLatency
delivery_http_errors_totalCountermethod, route, error_codeError rate

3.2 Domain Metrics

MetricTypeLabels
delivery_play_sessions_activeGaugetenant_id
delivery_play_sessions_started_totalCountertenant_id, is_offline
delivery_play_sessions_completed_totalCountertenant_id
delivery_play_sessions_abandoned_totalCountertenant_id, reason
delivery_session_duration_secondsHistogramtenant_id, outcome
delivery_navigation_events_totalCountertenant_id, navigation_type
delivery_tutor_turns_totalCountertenant_id, local
delivery_tutor_turn_duration_secondsHistogramtenant_id, local, model
delivery_tutor_tokens_totalCounterdirection, tenant_id, model
delivery_tutor_cost_microusd_totalCountertenant_id, model
delivery_offline_mounts_activeGaugetenant_id
delivery_offline_mounts_totalCountertenant_id, outcome (success/rejected)
delivery_tamper_detected_totalCountertenant_id

3.3 Infrastructure Metrics

MetricTypePurpose
delivery_db_pool_connections_activeGaugeDB pool saturation
delivery_db_query_duration_secondsHistogramDB query latency
delivery_redis_operations_totalCounterRedis ops
delivery_redis_latency_secondsHistogramRedis latency
delivery_outbox_lag_secondsGaugeOutbox publish delay
delivery_inbox_processed_totalCounterEvents consumed
delivery_inbox_dlq_totalCounterEvents sent to DLQ

4. Logs

Structured JSON, schema-validated via @ghasi/telemetry. Required fields per platform §3.1:

{
"timestamp": "2026-04-15T10:15:30.123Z",
"severity": "INFO",
"service": "delivery-service",
"trace_id": "00-abc...-01",
"span_id": "def...",
"tenant_id": "tnt_01H...",
"request_id": "req_01H...",
"actor_id": "<hashed>",
"event": "play_session.started",
"session_id": "pls_01H...",
"enrollment_id": "enr_01H...",
"course_version_id": "cvr_01H...",
"device_id": "dev_01H...",
"is_offline": false,
"log_schema_version": "1.0"
}

4.1 Log Levels

LevelUsage
DEBUGDevelopment only; disabled in prod
INFOBusiness events (session start, completion, tutor turn)
WARNRecoverable issues (rate limits hit, retry attempts)
ERRORRequest failures, unexpected exceptions
FATALService cannot continue (startup failures, unrecoverable states)

4.2 PII Redaction

  • actor_id is always hashed (SHA-256 with tenant salt).
  • Tutor prompt and response fields are NEVER logged at INFO level.
  • Tutor content logged at DEBUG only, with PII classifier pass.
  • Redaction enforced by @ghasi/telemetry emitter.

5. Distributed Tracing

5.1 Span Taxonomy

Span NameKindAttributes
delivery.start_play_sessionSERVERsession.id, enrollment.id, is_offline
delivery.navigateSERVERsession.id, navigation.type
delivery.complete_sessionSERVERsession.id, duration_seconds
delivery.tutor_turnSERVERsession.id, turn.id, ai.local
delivery.tutor_streamINTERNALturn.id, tokens
delivery.mount_offlineSERVERmount.id, bundle.id
delivery.db.queryCLIENTdb.statement
delivery.nats.publishPRODUCERmessaging.destination
delivery.nats.consumeCONSUMERmessaging.destination
delivery.ai_client.streamCLIENTai.model, ai.local
delivery.content_client.manifestCLIENThttp.url
delivery.enrollment_client.validateCLIENThttp.url

5.2 Trace Propagation

  • Incoming requests: parse traceparent header; if missing, start new trace.
  • Outgoing HTTP: inject traceparent and tracestate.
  • NATS: inject traceparent into envelope correlationId (W3C-derived).
  • SSE streams: parent span kept alive until stream closes; each chunk is a span event (not a child span) to avoid span explosion.

5.3 Sampling

  • Default: head-based 10% sampling.
  • 100% sampling for:
    • Errors (5xx)
    • Sessions containing tutor turns
    • Offline mount operations
    • Tamper detection events
  • Sampling decision propagated via tracestate.

6. Dashboards

All dashboards stored as code in grafana/ repo:

DashboardPurpose
delivery-overviewRED metrics, active sessions, error rates
delivery-sessionsSession lifecycle funnel, abandonment reasons, duration histogram
delivery-ai-tutorTutor turn volume, latency, cost, local-vs-cloud split, error rate
delivery-offlineActive mounts, mount success rate, tamper detections, unmount reasons
delivery-tenant-{tenantId}Per-tenant views for large customers (RBAC-scoped)
delivery-sloSLO burn-rate, error budget remaining

7. Alerts

Declared in alerts/delivery/ with Alertmanager routing to PagerDuty + Slack.

7.1 Critical (P1)

AlertConditionRunbook
DeliveryServiceDownup{service="delivery"} == 0 for 2 minrunbooks/delivery/service-down
DeliveryHighErrorRate5xx rate > 5% for 5 minrunbooks/delivery/high-error-rate
DeliveryDatabaseUnreachableDB connection errors > 10/minrunbooks/delivery/db-unreachable
TamperSpiketamper_detected rate > 10/minrunbooks/delivery/tamper-spike
CrossTenantAccessAttemptAny audit.security.cross_tenant_rejectedrunbooks/security/cross-tenant

7.2 High (P2)

AlertCondition
DeliveryTutorLatencyHighp95 > 5s for 10 min
DeliveryOutboxLagLag > 30s for 5 min
DeliveryDLQGrowingDLQ has > 100 messages
AITutorCostSpikeCost > 3x baseline over 1h

7.3 Medium (P3)

AlertCondition
DeliveryNavigationLatencyHighp95 > 500ms for 15 min
SessionAbandonmentHighAbandonment rate > 40% for 1 hour

8. SLOs

SLITargetWindow
Start session availability99.9%30 days
Navigation latencyp95 < 300ms30 days
Tutor turn time-to-first-tokenp95 < 1.5s30 days
Tutor turn availability99.5%30 days
Offline mount success rate99%30 days
Session completion fidelity (no lost sessions)99.99%90 days

Error budgets calculated by Sloth; burn-rate alerts at 2x and 10x rates.

9. Exemplars

Every metric histogram has exemplars linking to traces. Enabled at Collector with tail_sampling processor configured for errors and slow paths.

10. Cost Monitoring

Cost signals tagged per tenant:

Cost SignalSource
AI tutor tokensdelivery_tutor_cost_microusd_total
Database storagePG table size monitoring
NATS throughputStream size monitoring
Network egressAWS/GCP cost explorer tagged

Per-tenant cost dashboard available to finance for cost attribution.