Observability
:::info Source
Sourced from services/delivery-service/OBSERVABILITY.md in the documentation repo.
:::
Companion: 15 Observability & Telemetry · APPLICATION_LOGIC
1. Telemetry Stack
Follows platform standard: OpenTelemetry SDK (Node.js) -> OTel Collector -> Prometheus + Tempo + Loki. Service imports @ghasi/telemetry wrapper (never vendor SDKs directly).
2. Service Metadata
service.name: delivery-service
service.namespace: ghasi-edtech
service.version: <semver>
deployment.environment: <dev|staging|prod>
3. Metrics
3.1 RED Metrics (required)
| Metric | Type | Labels | Purpose |
|---|---|---|---|
delivery_http_requests_total | Counter | method, route, status_code, tenant_id | Request rate |
delivery_http_request_duration_seconds | Histogram | method, route, status_code | Latency |
delivery_http_errors_total | Counter | method, route, error_code | Error rate |
3.2 Domain Metrics
| Metric | Type | Labels |
|---|---|---|
delivery_play_sessions_active | Gauge | tenant_id |
delivery_play_sessions_started_total | Counter | tenant_id, is_offline |
delivery_play_sessions_completed_total | Counter | tenant_id |
delivery_play_sessions_abandoned_total | Counter | tenant_id, reason |
delivery_session_duration_seconds | Histogram | tenant_id, outcome |
delivery_navigation_events_total | Counter | tenant_id, navigation_type |
delivery_tutor_turns_total | Counter | tenant_id, local |
delivery_tutor_turn_duration_seconds | Histogram | tenant_id, local, model |
delivery_tutor_tokens_total | Counter | direction, tenant_id, model |
delivery_tutor_cost_microusd_total | Counter | tenant_id, model |
delivery_offline_mounts_active | Gauge | tenant_id |
delivery_offline_mounts_total | Counter | tenant_id, outcome (success/rejected) |
delivery_tamper_detected_total | Counter | tenant_id |
3.3 Infrastructure Metrics
| Metric | Type | Purpose |
|---|---|---|
delivery_db_pool_connections_active | Gauge | DB pool saturation |
delivery_db_query_duration_seconds | Histogram | DB query latency |
delivery_redis_operations_total | Counter | Redis ops |
delivery_redis_latency_seconds | Histogram | Redis latency |
delivery_outbox_lag_seconds | Gauge | Outbox publish delay |
delivery_inbox_processed_total | Counter | Events consumed |
delivery_inbox_dlq_total | Counter | Events sent to DLQ |
4. Logs
Structured JSON, schema-validated via @ghasi/telemetry. Required fields per platform §3.1:
{
"timestamp": "2026-04-15T10:15:30.123Z",
"severity": "INFO",
"service": "delivery-service",
"trace_id": "00-abc...-01",
"span_id": "def...",
"tenant_id": "tnt_01H...",
"request_id": "req_01H...",
"actor_id": "<hashed>",
"event": "play_session.started",
"session_id": "pls_01H...",
"enrollment_id": "enr_01H...",
"course_version_id": "cvr_01H...",
"device_id": "dev_01H...",
"is_offline": false,
"log_schema_version": "1.0"
}
4.1 Log Levels
| Level | Usage |
|---|---|
DEBUG | Development only; disabled in prod |
INFO | Business events (session start, completion, tutor turn) |
WARN | Recoverable issues (rate limits hit, retry attempts) |
ERROR | Request failures, unexpected exceptions |
FATAL | Service cannot continue (startup failures, unrecoverable states) |
4.2 PII Redaction
actor_idis always hashed (SHA-256 with tenant salt).- Tutor
promptandresponsefields are NEVER logged at INFO level. - Tutor content logged at DEBUG only, with PII classifier pass.
- Redaction enforced by
@ghasi/telemetryemitter.
5. Distributed Tracing
5.1 Span Taxonomy
| Span Name | Kind | Attributes |
|---|---|---|
delivery.start_play_session | SERVER | session.id, enrollment.id, is_offline |
delivery.navigate | SERVER | session.id, navigation.type |
delivery.complete_session | SERVER | session.id, duration_seconds |
delivery.tutor_turn | SERVER | session.id, turn.id, ai.local |
delivery.tutor_stream | INTERNAL | turn.id, tokens |
delivery.mount_offline | SERVER | mount.id, bundle.id |
delivery.db.query | CLIENT | db.statement |
delivery.nats.publish | PRODUCER | messaging.destination |
delivery.nats.consume | CONSUMER | messaging.destination |
delivery.ai_client.stream | CLIENT | ai.model, ai.local |
delivery.content_client.manifest | CLIENT | http.url |
delivery.enrollment_client.validate | CLIENT | http.url |
5.2 Trace Propagation
- Incoming requests: parse
traceparentheader; if missing, start new trace. - Outgoing HTTP: inject
traceparentandtracestate. - NATS: inject traceparent into envelope
correlationId(W3C-derived). - SSE streams: parent span kept alive until stream closes; each chunk is a span event (not a child span) to avoid span explosion.
5.3 Sampling
- Default: head-based 10% sampling.
- 100% sampling for:
- Errors (5xx)
- Sessions containing tutor turns
- Offline mount operations
- Tamper detection events
- Sampling decision propagated via
tracestate.
6. Dashboards
All dashboards stored as code in grafana/ repo:
| Dashboard | Purpose |
|---|---|
delivery-overview | RED metrics, active sessions, error rates |
delivery-sessions | Session lifecycle funnel, abandonment reasons, duration histogram |
delivery-ai-tutor | Tutor turn volume, latency, cost, local-vs-cloud split, error rate |
delivery-offline | Active mounts, mount success rate, tamper detections, unmount reasons |
delivery-tenant-{tenantId} | Per-tenant views for large customers (RBAC-scoped) |
delivery-slo | SLO burn-rate, error budget remaining |
7. Alerts
Declared in alerts/delivery/ with Alertmanager routing to PagerDuty + Slack.
7.1 Critical (P1)
| Alert | Condition | Runbook |
|---|---|---|
DeliveryServiceDown | up{service="delivery"} == 0 for 2 min | runbooks/delivery/service-down |
DeliveryHighErrorRate | 5xx rate > 5% for 5 min | runbooks/delivery/high-error-rate |
DeliveryDatabaseUnreachable | DB connection errors > 10/min | runbooks/delivery/db-unreachable |
TamperSpike | tamper_detected rate > 10/min | runbooks/delivery/tamper-spike |
CrossTenantAccessAttempt | Any audit.security.cross_tenant_rejected | runbooks/security/cross-tenant |
7.2 High (P2)
| Alert | Condition |
|---|---|
DeliveryTutorLatencyHigh | p95 > 5s for 10 min |
DeliveryOutboxLag | Lag > 30s for 5 min |
DeliveryDLQGrowing | DLQ has > 100 messages |
AITutorCostSpike | Cost > 3x baseline over 1h |
7.3 Medium (P3)
| Alert | Condition |
|---|---|
DeliveryNavigationLatencyHigh | p95 > 500ms for 15 min |
SessionAbandonmentHigh | Abandonment rate > 40% for 1 hour |
8. SLOs
| SLI | Target | Window |
|---|---|---|
| Start session availability | 99.9% | 30 days |
| Navigation latency | p95 < 300ms | 30 days |
| Tutor turn time-to-first-token | p95 < 1.5s | 30 days |
| Tutor turn availability | 99.5% | 30 days |
| Offline mount success rate | 99% | 30 days |
| Session completion fidelity (no lost sessions) | 99.99% | 90 days |
Error budgets calculated by Sloth; burn-rate alerts at 2x and 10x rates.
9. Exemplars
Every metric histogram has exemplars linking to traces. Enabled at Collector with tail_sampling processor configured for errors and slow paths.
10. Cost Monitoring
Cost signals tagged per tenant:
| Cost Signal | Source |
|---|---|
| AI tutor tokens | delivery_tutor_cost_microusd_total |
| Database storage | PG table size monitoring |
| NATS throughput | Stream size monitoring |
| Network egress | AWS/GCP cost explorer tagged |
Per-tenant cost dashboard available to finance for cost attribution.