Observability

:::info Source Sourced from docs/15-observability-telemetry.md in the documentation repo. :::

Document: 15 of N Status: Normative Owners: Platform SRE, Data Platform, AI Platform, Security Aligned with: 01 Enterprise Architecture, 02 DDD Bounded Contexts, 03 Microservices, 04 Event-Driven Architecture, 05 API Design, 10 Authoring Tool, 11 LMS Player, 12 Data Models, 13 Security/Compliance/Tenancy Applies to: All backend services, web/mobile clients, offline runtimes, authoring tools, AI services, marketplace, and analytics pipeline.

0. Purpose & Scope

Ghasi-edTech is an AI-first, offline-first, multi-tenant, event-driven learning platform. Observability is not optional instrumentation — it is the operating substrate that lets us:

Prove the platform is safe for learners (AI moderation, PII handling, abuse).
Prove it is correct (assessment scoring, licensing, gradebook integrity).
Prove it is available (SLOs, error budgets).
Prove it is cost-sane (AI token burn, egress, storage).
Prove it is tamper-evident (offline bundles, device binding, audit chain).
Drive learning analytics and instructor feedback loops via the same pipeline.

This document is the single source of truth for:

Log schema, field contracts, and redaction rules.
Metric taxonomy (RED, USE, and domain KPIs).
Distributed tracing via OpenTelemetry (OTel).
Dashboards, alerts, SLIs/SLOs, and error budgets.
AI-, offline-, player-, authoring-, and marketplace-specific telemetry.
Retention, privacy, residency, and downstream integration with analytics-service.
Incident response and automated remediation hooks.

It is normative for all code generation and review.

1. Principles

#	Principle	Consequence
P1	Three pillars, one correlation ID	Every log line, metric exemplar, and span carries `trace_id`, `tenant_id`, `request_id`, and `actor_id` (hashed).
P2	Structured by default	No free-form log strings in production paths. JSON, schema-validated, versioned (`log_schema_version`).
P3	PII never leaves the boundary unredacted	Redaction is a library, not discipline. Applied at emitter, verified at collector, re-verified at sink.
P4	Tenant isolation extends to telemetry	Tenant-scoped dashboards, alerts, retention, and export. No cross-tenant leakage in incident artifacts.
P5	Sampling is policy, not accident	Head-based for hot paths, tail-based for error/slow paths, 100% for safety-critical AI and assessment scoring.
P6	Cost is a first-class signal	AI spend, egress, and storage have SLOs just like latency.
P7	Offline is observable when it reconnects	Device-side telemetry buffer with tamper-evident framing; reconciled on sync.
P8	Events are the ledger	Domain events (see Doc 04) are replayable; telemetry augments but never replaces them.
P9	Every alert is actionable	Alerts reference a runbook slug, owner, and auto-remediation hook where applicable.
P10	Privacy > curiosity	When in doubt, drop the field. Learner wellbeing outranks debuggability.

2. Reference Stack

Layer	Chosen Tooling (normative)	Notes
Instrumentation	OpenTelemetry SDK (Node, Python, JVM, Go, Swift, Kotlin, Dart)	Single vendor-neutral API.
Collection	OTel Collector (gateway + agent tiers)	Redaction, tenant routing, sampling decisions.
Logs	Loki (hot 14d) → S3 + Parquet (cold 395d)	Indexed by `tenant_id`, `service`, `severity`.
Metrics	Prometheus (hot 30d) → Mimir/Thanos (13mo)	Remote-write from Collector.
Traces	Tempo (hot 7d) → S3 (90d, sampled)	Exemplars link metrics → traces → logs.
Dashboards	Grafana (per-tenant folders, RBAC)	Stored as code (`grafana/` repo).
Alerts	Alertmanager + PagerDuty + Slack `#oncall-*`	Alerts declared in Git, reviewed via PR.
Analytics sink	`analytics-service` (Kafka → ClickHouse)	Learning analytics, not SRE.
Audit sink	`audit-service` (append-only, WORM S3)	Security/compliance only.
SLO engine	Sloth → Prometheus rules	Generates burn-rate alerts.
Incident	PagerDuty + Statuspage + `incident-bot`	Auto-declares, pulls runbook, opens bridge.

Services MUST NOT import vendor SDKs directly. They import @ghasi/telemetry (per language), which wraps OTel and enforces field contracts.

3. Identity, Correlation, and Context Propagation

3.1 Required Context Keys

Every telemetry signal (log, metric exemplar, span attribute, event) MUST include the keys below where the value exists. Absent values use null, never "".

Key	Type	Source	Notes
`trace_id`	hex(32)	W3C `traceparent`	Generated at edge if absent.
`span_id`	hex(16)	OTel
`request_id`	uuidv7	Edge (API gateway)	Survives through queues via baggage.
`tenant_id`	ULID	JWT → baggage	Mandatory outside onboarding.
`org_unit_id`	ULID?	JWT	School/campus scope.
`actor_id_hash`	sha256(actor_id + tenant_salt)	Auth layer	Raw `actor_id` NEVER in telemetry.
`actor_role`	enum	Auth	`learner\|instructor\|author\|admin\|service\|anonymous`
`session_id`	ULID	Player/Authoring	For cross-request stitching.
`device_id_hash`	sha256	Offline SDK	Device binding (Doc 13).
`app`	string	Build	`web-player`, `mobile-player`, `authoring`, `admin`.
`app_version`	semver	Build
`env`	enum	Runtime	`dev\|staging\|prod\|sandbox`.
`region`	string	Runtime	`af-jnb-1`, `eu-fra-1`, …
`log_schema_version`	int	Library	Current: `3`.

3.2 Baggage

OTel Baggage carries tenant_id, request_id, actor_role, offline_origin, ai_budget_id across all hops (HTTP, gRPC, Kafka). Baggage is stripped at egress to 3rd-party APIs.

3.3 Correlation Across Boundaries

HTTP: traceparent, tracestate, x-ghasi-tenant, x-ghasi-request-id.
Kafka: OTel headers + tenant_id header used for partitioning and DLQ routing.
WebSocket/SSE (player tutor): trace_id established per connection, per-message span_id.
Offline → Online sync: Device emits a sync_batch_id; server links all replayed signals to the original offline trace_id preserved in the bundle.

4. Logging

4.1 Log Schema (v3)

{
  "ts": "2026-04-15T09:12:33.214Z",
  "level": "INFO|DEBUG|WARN|ERROR|FATAL|AUDIT",
  "msg": "short human summary, <=120 chars, no interpolation of PII",
  "event": "player.lesson.completed",
  "service": "player-service",
  "component": "ProgressAggregator",
  "trace_id": "…", "span_id": "…", "request_id": "…",
  "tenant_id": "…", "org_unit_id": "…",
  "actor_id_hash": "…", "actor_role": "learner",
  "session_id": "…", "device_id_hash": "…",
  "app": "web-player", "app_version": "2026.4.1",
  "env": "prod", "region": "af-jnb-1",
  "attrs": { "course_id": "…", "lesson_id": "…", "score": 0.87 },
  "error": { "type": "…", "message": "…", "stack": "…", "cause": {…} },
  "log_schema_version": 3
}

Rules:

msg is static; variable data goes in attrs.
event uses domain.entity.action (see Doc 04 event catalog).
attrs keys MUST be snake_case and namespaced.
error.stack only at ERROR/FATAL and only in non-prod, or after scrubbing in prod.

4.2 Levels & Usage

Level	Use	Sampling
`FATAL`	Process-terminating	100%
`ERROR`	Contract violation, unhandled	100%
`WARN`	Recoverable anomaly, degraded mode	100%
`AUDIT`	Security/compliance events (see 4.5)	100%, also to `audit-service`
`INFO`	Domain event emission, lifecycle	100% in prod, filtered by category
`DEBUG`	Developer detail	0% prod, 100% sandbox

4.3 PII & Redaction

Redaction is applied in the @ghasi/telemetry library before serialization. The collector re-runs redaction as defense in depth.

Deny-list fields (never logged, even hashed unless noted):

password, password_hash, otp, totp_secret, access_token, refresh_token, session_cookie, id_token, private_key, webhook_secret, payment_pan, cvv, national_id, passport_no, date_of_birth, home_address, phone_e164, email, learner_free_text_answer, tutor_prompt_raw, tutor_response_raw, parent_contact, health_note.

Hashed (keyed per tenant salt): actor_id, device_id, ip_address, learner_email, guardian_email.

Truncated + categorized (never raw):

Tutor prompts/responses → prompt_category, prompt_length, prompt_lang, safety_tags[].
Free-text answers → answer_length, lang, similarity_to_reference (0–1).

Automated scanners: A nightly job runs regex + ML PII detectors on Loki; any hit opens a SEV-2 ticket and quarantines the offending log stream.

4.4 Multi-Tenancy in Logs

Loki label set: {service, env, region, severity, tenant_id} — tenant_id is a required label; unlabeled logs are dropped at the Collector with a counter incremented.
Per-tenant retention overrides (enterprise tier) are supported via Collector routing to dedicated indexes.
Tenant admin UI can export only their own logs (signed, time-boxed S3 URL, 24h).

4.5 Audit Logs (separate pipeline)

Audit events are never best-effort. They go through a synchronous, ack'd write to audit-service before the user-facing response completes. Failures return 503 — we do not transact without audit.

Audit event categories:

Auth: login, mfa, session revoke, impersonation.
Data: export, delete (DSAR), bulk access.
Content: publish, unpublish, license grant/revoke.
Moderation: AI block, human override, appeal outcome.
Offline: bundle issue, revoke, device bind/unbind.
Grading: grade change, regrade, override, release.
Marketplace: purchase, refund, payout, dispute.
Admin: role change, policy change, tenant config change.

Schema: canonical log schema + audit.action, audit.target, audit.before_hash, audit.after_hash, audit.signed_by, audit.chain_prev_hash (hash-chained for tamper evidence).

5. Metrics

5.1 Taxonomy

Three families, each with strict naming:

RED per request path: http_requests_total, http_request_duration_seconds, http_requests_errors_total.
USE per resource: process_cpu_seconds_total, db_pool_connections{state}, kafka_consumer_lag.
Domain (DKPIs): <domain>_<entity>_<action>_total — see §5.3+.

Naming rules:

_total for counters, _seconds for latency, _ratio for 0–1, _bytes for sizes.
Labels are bounded cardinality. tenant_id is allowed (bounded ~ low 10k); user_id is never a label.
High-cardinality dimensions (course, lesson) go to exemplars and analytics-service, not Prometheus labels.

5.2 Standard Labels

Every metric includes: service, env, region, tenant_tier (free|school|district|enterprise). Domain metrics additionally include tenant_id when cardinality permits (Collector enforces a per-series cap and down-labels above threshold).

5.3 Per-Service SLIs (all services)

SLI	Definition	Target
Availability	`1 - errors/total` over read paths (5xx excluding 499)	99.9%
Latency P95	`http_request_duration_seconds` P95	≤ 300 ms (API) / ≤ 150 ms (edge)
Latency P99		≤ 800 ms
Saturation	`db_pool_in_use / db_pool_max`	< 0.8 sustained
Queue lag	`kafka_consumer_lag`	< 5000 / partition

5.4 Domain Metrics — Catalogue

Full catalogue is in telemetry/metrics.yaml (code-generated). Highlights:

Identity: auth_login_total{result}, auth_mfa_challenge_total{method,result}, auth_session_revoked_total{reason}.

Curriculum: curriculum_course_published_total, curriculum_draft_autosave_seconds, curriculum_review_cycle_duration_seconds.

Authoring: see §9.

Player: see §8.

Assessment: assessment_attempt_started_total, assessment_attempt_completed_total{result}, assessment_autograde_latency_seconds, assessment_regrade_total{reason}, assessment_integrity_flags_total{flag}.

AI: see §7.

Offline: see §8.

Marketplace: see §10.

Billing: billing_invoice_total{status}, billing_dunning_stage_total{stage}.

Notifications: notification_delivered_total{channel,status}, notification_suppressed_total{reason}.

5.5 Exemplars

Every domain counter/histogram carries exemplars linking to trace IDs for 1 in N successful requests and 100% of errored requests. Grafana exemplar panels enable one-click metric → trace → log drill-down.

6. Distributed Tracing

6.1 Instrumentation Rules

All inbound HTTP, gRPC, GraphQL, WebSocket, and Kafka consumer handlers create a root or child span.
All outbound DB, cache, HTTP, Kafka producer, object-store, and AI provider calls create a child span.
Spans have otel.status_code, error.type, and domain-specific attributes.
Critical rule: Any span crossing a trust boundary (tenant → provider, online → offline, sync-in) MUST carry a trust.boundary attribute and its own error budget.

6.2 Span Attributes (domain examples)

ghasi.tenant_id, ghasi.actor_role
ghasi.course_id, ghasi.lesson_id, ghasi.attempt_id
ghasi.ai.model, ghasi.ai.provider, ghasi.ai.purpose
ghasi.ai.tokens_in, ghasi.ai.tokens_out, ghasi.ai.cost_usd
ghasi.offline.bundle_id, ghasi.offline.sync_batch_id
ghasi.integrity.flag (proctoring signals)

6.3 Sampling

Path	Strategy	Rate
Health / liveness	Drop	0%
Read APIs (cached)	Head-based	1%
Write APIs	Head-based	10%
Assessment scoring, grade commit, payment, DSAR	Always-on	100%
AI inference (tutor, content-gen, moderation)	Always-on	100%
Errors (4xx ≥ 429, all 5xx)	Tail-based override	100%
Slow requests (> SLO P99)	Tail-based override	100%
Offline sync replay	Always-on	100%

Tail sampling runs in the Collector gateway tier with a 30s decision window.

6.4 Redaction in Spans

db.statement is captured parameterized only (placeholders, never values). URL paths are templatized (/courses/:id/lessons/:lid). Request/response bodies are never placed in span attributes.

7. AI Telemetry

AI is the highest-risk, highest-cost surface. It gets its own first-class observability.

7.1 Dimensions to Capture (per invocation)

Dimension	Example	Purpose
`ai.purpose`	`tutor.explain`, `authoring.outline`, `moderation.text`, `assessment.rubric_grade`	SLOs, cost allocation
`ai.model` / `ai.model_version`	`claude-sonnet-4-6@20261001`	Drift detection, A/B
`ai.provider`	`anthropic`, `internal-llm`, `openai-fallback`	Provider SLOs, failover
`ai.prompt_template_id` + `hash`	`tutor.v14`	Provenance
`ai.tokens_in` / `ai.tokens_out`		Cost
`ai.cost_usd`		Cost
`ai.latency_ms_ttfb` / `ttlb`		UX
`ai.cache.hit` / `ai.cache.key_hash`		Cost
`ai.safety.pre` / `ai.safety.post`	`{"self_harm":0.01,...}`	Safety
`ai.safety.action`	`allow\|redact\|block\|escalate`	Safety
`ai.guardrail.violations[]`	`pii_leak`, `age_inappropriate`	Safety
`ai.output.citations[]`	source doc IDs	Provenance
`ai.output.grounding_score`	0–1	Hallucination
`ai.human_override`	bool	Loop closure
`ai.confidence`	0–1	Routing

Prompts and responses themselves go to a separate, encrypted, tenant-scoped store (ai-transcripts-service) with tighter retention (§11). Telemetry carries only hashes, categories, and safety signals.

7.2 AI SLIs & SLOs

SLI	Target
Tutor TTFB P95	≤ 1.2s online, ≤ 300ms offline-SLM
Tutor stream completion success	≥ 99.5%
Moderation decision latency P99	≤ 400ms
Safety false-negative rate (sampled audit)	≤ 0.1%
Safety false-positive rate	≤ 2%
Authoring AI accept rate	tracked, no threshold (product KPI)
AI cost per MAU	≤ budget per tier
Cache hit rate (prompt+ctx)	≥ 35% tutor, ≥ 60% authoring
Provider failover success	≥ 99%
Grounding score (RAG paths)	P50 ≥ 0.8

7.3 AI Cost Observability

Budget IDs attach to every invocation; rollups by tenant_id × purpose × model in ClickHouse.
Circuit breakers fire when a tenant/purpose breaches 120% of hourly budget: degrade model (Opus → Sonnet → Haiku), enable aggressive caching, then fail-closed to cached/static responses.
Dashboards: ai/cost-overview, ai/per-tenant, ai/per-purpose, ai/model-mix, ai/cache-efficacy.

7.4 AI Provenance & Replay

Each AI output stores {prompt_template_id, prompt_hash, context_hash, model, model_version, params, safety_decisions, citations} — sufficient to replay a response for audit (Doc 13). Replay is exposed via ai-audit admin tool and logged as AUDIT.

8. Offline Telemetry & Player Telemetry

8.1 Offline SDK Responsibilities

The offline runtime (mobile, desktop player, low-bandwidth web cache) runs a local telemetry buffer:

SQLite-backed, append-only, size-capped (default 64 MB, configurable).
Each batch is MAC-signed with the device binding key (Doc 13) so the server can detect tampering on replay.
Buffer encrypts at rest using the device-bound key.
Batches are chunked by sync_batch_id, ordered by monotonic sequence number; gaps flagged on server.

8.2 Offline Signals

Signal	Fields
`offline.bundle.activated`	`bundle_id`, `course_id`, `size_bytes`, `integrity_ok`
`offline.bundle.integrity_failure`	`bundle_id`, `expected_hash`, `actual_hash`, `reason`
`offline.sync.started` / `.completed` / `.failed`	`batch_id`, `items`, `bytes`, `duration_ms`, `conflicts`
`offline.conflict.detected`	`entity`, `strategy`, `winner`, `loser_preserved`
`offline.device.bind` / `.unbind` / `.rebind_denied`	`device_id_hash`, `reason`
`offline.tamper.suspected`	`signal`, `severity`, `evidence_hash`
`offline.clock.skew`	`skew_seconds`
`offline.duration.active_seconds`	rolling

8.3 Offline SLIs/SLOs

SLI	Target
Sync success rate	≥ 99% per device/day
Conflict rate (of synced writes)	≤ 1%
Bundle tamper detection	100% of test cases caught
Device-binding mismatch blocked	100%
Mean time from reconnection → sync complete	≤ 60s for ≤ 10MB

8.4 Player Telemetry

Event	Key Attributes
`player.session.started`	`session_id`, `course_id`, `entry_point`
`player.navigation`	`from_lesson_id`, `to_lesson_id`, `method` (next, TOC, deeplink, tutor)
`player.lesson.viewed`	`lesson_id`, `media_type`, `duration_ms`, `completion_ratio`
`player.media.buffered`	`stall_count`, `stall_total_ms`
`player.quiz.started` / `.answered` / `.submitted`	`attempt_id`, `question_id`, `time_on_question_ms`, `changed_answer_count`
`player.tutor.opened` / `.prompt` / `.response`	`turn_number`, `ai.purpose=tutor.*`, `escalated_to_human`
`player.accessibility.used`	`feature` (captions, TTS, dyslexia_mode, hi_contrast)
`player.integrity.flag`	`flag`, `confidence` (proctoring)
`player.session.ended`	`reason`, `duration_ms`, `progress_delta`

These feed both SRE dashboards and analytics-service for learning analytics (engagement, mastery).

Quiz/assessment integrity telemetry is captured at 100% sampling and mirrored to audit-service whenever integrity flags fire.

9. Authoring Telemetry

The authoring tool is where AI leverage is highest; telemetry closes the feedback loop to improve prompts and UX.

Event	Attributes	Why
`authoring.session.started`	`course_id`, `editor_version`	Engagement
`authoring.block.inserted`	`block_type`, `source` (manual, ai, template, import)	Block-mix
`authoring.ai.suggestion.shown`	`surface`, `purpose`, `suggestion_id`	AI UX
`authoring.ai.suggestion.accepted`	`suggestion_id`, `accept_mode` (full, partial, edited)	Accept rate
`authoring.ai.suggestion.rejected`	`suggestion_id`, `reason_category`	Learning signal
`authoring.ai.regenerate`	`count`, `deltas`	Friction
`authoring.autosave`	`latency_ms`, `bytes`, `conflict`	Reliability
`authoring.conflict.detected`	`entity`, `resolution_strategy`, `lost_bytes`	Collab quality
`authoring.review.cycle`	`state_from`, `state_to`, `duration_s`, `reviewer_role`	Workflow
`authoring.publish`	`course_id`, `validator_errors`, `warnings`, `a11y_score`	Release quality
`authoring.accessibility.score`	`score`, `violations[]`	Quality gate

9.1 Authoring SLIs

SLI	Target
Autosave success	≥ 99.95%
Autosave latency P95	≤ 250 ms
Merge conflict rate	≤ 0.5% of autosaves
AI suggestion TTFB P95	≤ 900 ms
AI accept rate (rolling 30d)	tracked; alert on ±25% WoW shift
Publish validator pass rate	≥ 99%

10. Marketplace Telemetry

Marketplace telemetry joins SRE, product analytics, and finance.

Event	Attributes
`market.listing.viewed`	`listing_id`, `source`
`market.checkout.started`	`cart_hash`, `price_cents`, `currency`, `coupon`
`market.checkout.payment_attempt`	`provider`, `method`, `3ds`
`market.checkout.succeeded` / `.failed`	`reason`
`market.refund.requested` / `.approved` / `.denied`	`reason_category`
`market.payout.scheduled` / `.completed`	`creator_id_hash`, `amount_cents`
`market.fraud.flag`	`signals[]`, `decision`
`market.dispute.opened` / `.resolved`	`provider_case_id_hash`
`market.license.activated` / `.exhausted`	`listing_id`, `seats`

10.1 Marketplace SLIs

SLI	Target
Checkout success rate	≥ 98% (excluding legit declines)
Payment provider failover success	≥ 99%
Refund SLA (request → decision)	P95 ≤ 72h
Chargeback rate	≤ 0.5%
Fraud false positive	≤ 2%
Payout on-time rate	≥ 99.5%

Revenue metrics (market_gmv_cents_total, market_refund_cents_total) are reconciled daily against the ledger in billing-service. Divergence > 0.1% pages finance oncall.

11. Data Retention, Residency & Privacy

11.1 Retention Matrix

Signal	Hot	Warm	Cold	Max
App logs (non-PII)	14d Loki	90d S3	395d S3 Glacier	395d
Audit logs	30d hot	7y WORM S3	—	7y
Metrics	30d Prom	13mo Mimir	—	13mo
Traces (sampled)	7d Tempo	90d S3	—	90d
Traces (errors, AI, payment)	30d	395d	—	395d
AI transcripts (learner)	30d	180d per tenant config	—	≤ 365d
AI transcripts (safety-flagged)	180d	2y audit	—	2y
Offline device telemetry	14d post-sync	90d	—	90d
Learner event stream (analytics)	90d	2y aggregated	—	K–12 tenants: raw ≤ 13 mo
Marketplace financial events	7y	—	—	regulatory

11.2 Residency

Tenants select a home region (af-jnb-1, eu-fra-1, us-iad-1, me-bah-1). Logs, metrics, traces, and transcripts are pinned to that region. Cross-region replication is off by default; enabled only for enterprise DR with DPA addendum.

11.3 Minor/K–12 Rules

No actor email/phone anywhere in telemetry, even hashed with global salt (use tenant salt).
AI transcript retention ≤ 30d unless safety-flagged.
No behavioral profiling data exported to third parties.
DSAR deletion propagates to Loki, Mimir, Tempo, ClickHouse, and AI transcripts within 30 days; deletion receipts logged to audit-service.

11.4 Access Control

Grafana folders per tenant + per function (SRE, Product, Finance, Safety).
Raw logs accessible only to SRE+Security; product engineers see sanitized views.
Break-glass access is MFA-gated, time-boxed (≤ 4h), auto-audited.

12. Dashboards

Dashboards are stored as JSON in grafana/ and provisioned via CI. Each has an owner and an SLO link.

12.1 Global Dashboards

Platform Overview — availability, latency, saturation across all services.
Error Budget Burn — per service, per SLO, 1h/6h/24h burn rates.
Tenant Health — per-tenant error rate, latency, AI spend, offline sync.
Release Radar — correlates deploys → error rate / latency deltas.
Cost Control — infra + AI + egress.

12.2 Per-Capability Dashboards

Identity & Access — login success, MFA mix, session revocations, impersonations.
Authoring — autosave, AI accept rate, conflict rate, publish validator, review cycle time.
Player — session starts, lesson completion funnel, quiz funnel, tutor usage, a11y usage.
Assessment — attempts, grading latency, regrades, integrity flags.
AI / Tutor — TTFB, cache hit, safety actions, grounding, cost.
Offline — sync success, conflict %, tamper flags, bundle integrity.
Marketplace — checkout funnel, refunds, fraud, payout SLA, GMV vs ledger.
Billing — invoice status, dunning, MRR proxy.
Safety & Moderation — blocks, overrides, appeals, SLA to decision.
Data Platform — Kafka lag, DLQ depth, consumer group health.

12.3 Per-Service Dashboards (template)

Each service auto-gets a dashboard with: RED panels, USE panels, top errors, top slow endpoints, top consumers, DB pool, Kafka lag (if applicable), dependency map.

13. Alerts, SLIs/SLOs, Error Budgets

13.1 SLO Framework

All SLOs defined in slo/*.yaml (Sloth format), reviewed via PR.
Multiwindow, multi-burn-rate alerts (Google SRE): 1h+5m (fast burn), 6h+30m (slow burn).
28-day rolling windows; error-budget policy enforced automatically.

13.2 Alert Contract

Every alert declares:

alert: PlayerTutorTTFBHigh
expr: ...
for: 5m
severity: SEV-2
owner: ai-platform
runbook: https://runbooks.ghasi/obs/ai-tutor-ttfb
auto_remediation: ai.failover.downgrade_model
dashboards: [ai/tutor, player/engagement]
slos: [player.tutor.ttfb]

Alerts without runbook + owner are rejected in CI.

13.3 Severity Ladder

Severity	Definition	Response
SEV-1	User-visible outage, data loss risk, safety breach, payment outage	Page + bridge + Statuspage within 5 min
SEV-2	Capability degraded, SLO fast-burn	Page primary oncall
SEV-3	Slow-burn SLO, non-blocking	Ticket + Slack
SEV-4	Housekeeping	Ticket

13.4 Example Alerts (non-exhaustive)

IdentityLoginErrorRate (SEV-2): 5xx on /auth/login > 1% for 5m.
AssessmentGradeCommitFailure (SEV-1): any ERROR on grade commit path — always-on pager.
AIContentModerationBypass (SEV-1): ai.safety.post != block but guardrail.violations non-empty.
AITutorTTFBSlowBurn (SEV-3): 6h burn > 1× budget on player.tutor.ttfb.
OfflineBundleTamper (SEV-1): any offline.bundle.integrity_failure.
OfflineSyncConflictSpike (SEV-2): conflict rate > 2% over 30m per tenant.
AICostBudgetBreach (SEV-2): tenant hourly AI spend > 120% budget; auto-downgrade fires first, alert on 2 consecutive windows.
MarketplaceGMVReconMismatch (SEV-1): daily GMV vs ledger diff > 0.1%.
KafkaConsumerLagHigh (SEV-2): any consumer lag > 50k for 10m.
DSARDeletionSLA (SEV-2): any open DSAR > 27 days.
AuditWriteFailure (SEV-1): any failed audit write; blocks transactions.
AuthoringAutosaveFailure (SEV-2): autosave success < 99.5% for 10m.
PlayerA11yViolation (SEV-3): a11y violations on published course > threshold.

13.5 Error Budget Policy

When a service consumes:

50% of monthly budget → notify service owner; feature freeze optional.
75% → feature freeze mandatory; only reliability PRs merge.
100% → rollback latest risky changes; post-incident review required; no launches until budget restored.

Policy enforcement is automated via a GitHub check that reads the SLO service.

14. Incident Response Hooks

14.1 Auto-Declare

Any SEV-1 or two concurrent SEV-2s auto-declares an incident:

incident-bot opens a Slack channel #inc-YYYYMMDD-NN.
Creates PagerDuty incident, pages on-call per owner.
Posts runbook, current burn rate, recent deploys, related alerts.
Opens a bridge (Zoom/Meet) with auto-invite.
Updates Statuspage with tenant-scoped visibility.
Starts a timeline log (every human comment + alert transition captured).

14.2 Automated Remediation Hooks

Trigger	Action
AI provider error rate > 5%	Failover to secondary provider + model downgrade
Tutor TTFB P95 breach	Reduce streaming concurrency per tenant
Kafka DLQ depth spike	Pause producer, alert, enable dead-letter drain tool
Offline tamper spike from a tenant	Auto-revoke affected device bindings, require re-enroll
Payment provider outage	Route new checkouts to secondary
Audit write failure	Trip global "no-write" breaker on affected capability
Cost breach	Model downgrade, cache-only mode, then 503 with graceful copy

All auto-remediation actions are logged as AUDIT and reversible via single command.

14.3 Postmortems

Blameless template generated from incident timeline + telemetry (pm-bot).
Required within 5 business days for SEV-1/2.
Action items tracked with SLA; overdue AIs appear on the error-budget dashboard.

15. Integration with `analytics-service`

analytics-service (Doc 03) is the learning & product analytics sink — not an SRE tool, but it shares the telemetry substrate.

15.1 Dual Emission

Domain events (player.*, authoring.*, market.*) are:

Logged (for SRE forensics).
Emitted as Kafka domain events (source of truth, Doc 04).
Projected into ClickHouse by analytics-service consumers.

Metrics are not duplicated into analytics; analytics re-aggregates from events.

15.2 Contract Between Telemetry Library and Analytics

Events declared once in events/*.proto with @analytics and @telemetry annotations.
Codegen produces: (a) producer client, (b) log emitter, (c) ClickHouse DDL, (d) dbt models, (e) Grafana/Looker source docs.
Adding an event without updating the schema CI blocks the PR.

15.3 PII Boundary

Analytics warehouse receives tenant-salted hashed IDs. Reverse identification requires a signed, audited join against identity-service — gated, rate-limited, logged.

16. Frontend & Mobile Client Telemetry

Web: OTel Web SDK + custom @ghasi/telemetry-web. CLS/LCP/INP collected. Session replay disabled by default (privacy); enabled per-tenant with explicit consent, never for learners under 18.
Mobile: OTel Android/iOS + Dart. Offline buffer as in §8.
Crash reporting: native platform crash collectors, symbolicated, linked to trace_id via breadcrumbs. No PII in crash payloads.
Network errors and retry storms are explicit metrics (client_retry_storm_total).

Consent flags (consent.telemetry, consent.session_replay) are read on every emission; denial drops events at the client before transport.

17. Security & Threat Telemetry

authn_bruteforce_detected_total, authn_credential_stuffing_detected_total, abuse_rate_limit_triggered_total.
WAF events forwarded with rule_id, action.
Secret scanner alerts (source code, logs) mapped to SEV-2.
Anomaly detection: per-tenant usage baselines in ClickHouse; deviations > 4σ raise SecurityAnomaly alerts.
Supply chain: SBOM diffs per deploy, signed; rollback triggers if vulnerability severity ≥ HIGH.

18. Implementation Checklist (per service)

A service is not production-ready until:

19. Open Questions & Roadmap

On-device anomaly detection for offline tamper — currently server-side only; prototype lightweight ML on-device in H2.
Learner-visible AI transparency card — surfaces ai.model, citations, grounding_score to learners ≥ age 13. Needs legal review.
Per-course cost attribution — currently tenant+purpose; extend to course+creator for marketplace unit economics.
Formal privacy budgets (differential privacy) for analytics exports — exploratory.
eBPF-based USE telemetry for node-local kernel metrics — under PoC.
Synthetic journeys for K–12 low-bandwidth scenarios — planned Q3.

20. Change Management

This document is versioned; material changes require an RFC in rfcs/observability/.
Breaking changes to the log schema require log_schema_version bump and a deprecation window of 2 minor releases.
Alert and SLO changes are PR-reviewed by SRE + capability owner + (for safety/AI) Trust & Safety.

End of Document 15 — Observability & Telemetry Specification.

0. Purpose & Scope​

1. Principles​

2. Reference Stack​

3. Identity, Correlation, and Context Propagation​

3.1 Required Context Keys​

3.2 Baggage​

3.3 Correlation Across Boundaries​

4. Logging​

4.1 Log Schema (v3)​

4.2 Levels & Usage​

4.3 PII & Redaction​

4.4 Multi-Tenancy in Logs​

4.5 Audit Logs (separate pipeline)​

5. Metrics​

5.1 Taxonomy​

5.2 Standard Labels​

5.3 Per-Service SLIs (all services)​

5.4 Domain Metrics — Catalogue​

5.5 Exemplars​

6. Distributed Tracing​

6.1 Instrumentation Rules​

6.2 Span Attributes (domain examples)​

6.3 Sampling​

6.4 Redaction in Spans​

7. AI Telemetry​

7.1 Dimensions to Capture (per invocation)​

7.2 AI SLIs & SLOs​

7.3 AI Cost Observability​

7.4 AI Provenance & Replay​

8. Offline Telemetry & Player Telemetry​

8.1 Offline SDK Responsibilities​

8.2 Offline Signals​

8.3 Offline SLIs/SLOs​

8.4 Player Telemetry​

9. Authoring Telemetry​

9.1 Authoring SLIs​

10. Marketplace Telemetry​

10.1 Marketplace SLIs​

11. Data Retention, Residency & Privacy​

11.1 Retention Matrix​

11.2 Residency​

11.3 Minor/K–12 Rules​

11.4 Access Control​

12. Dashboards​

12.1 Global Dashboards​

12.2 Per-Capability Dashboards​

12.3 Per-Service Dashboards (template)​

13. Alerts, SLIs/SLOs, Error Budgets​

13.1 SLO Framework​

13.2 Alert Contract​

13.3 Severity Ladder​

13.4 Example Alerts (non-exhaustive)​

13.5 Error Budget Policy​

14. Incident Response Hooks​

14.1 Auto-Declare​

14.2 Automated Remediation Hooks​

14.3 Postmortems​

15. Integration with analytics-service​

15.1 Dual Emission​

15.2 Contract Between Telemetry Library and Analytics​

15.3 PII Boundary​

16. Frontend & Mobile Client Telemetry​

17. Security & Threat Telemetry​

18. Implementation Checklist (per service)​

19. Open Questions & Roadmap​

20. Change Management​