Observability
:::info Source
Sourced from docs/15-observability-telemetry.md in the documentation repo.
:::
Document: 15 of N Status: Normative Owners: Platform SRE, Data Platform, AI Platform, Security Aligned with: 01 Enterprise Architecture, 02 DDD Bounded Contexts, 03 Microservices, 04 Event-Driven Architecture, 05 API Design, 10 Authoring Tool, 11 LMS Player, 12 Data Models, 13 Security/Compliance/Tenancy Applies to: All backend services, web/mobile clients, offline runtimes, authoring tools, AI services, marketplace, and analytics pipeline.
0. Purpose & Scope
Ghasi-edTech is an AI-first, offline-first, multi-tenant, event-driven learning platform. Observability is not optional instrumentation — it is the operating substrate that lets us:
- Prove the platform is safe for learners (AI moderation, PII handling, abuse).
- Prove it is correct (assessment scoring, licensing, gradebook integrity).
- Prove it is available (SLOs, error budgets).
- Prove it is cost-sane (AI token burn, egress, storage).
- Prove it is tamper-evident (offline bundles, device binding, audit chain).
- Drive learning analytics and instructor feedback loops via the same pipeline.
This document is the single source of truth for:
- Log schema, field contracts, and redaction rules.
- Metric taxonomy (RED, USE, and domain KPIs).
- Distributed tracing via OpenTelemetry (OTel).
- Dashboards, alerts, SLIs/SLOs, and error budgets.
- AI-, offline-, player-, authoring-, and marketplace-specific telemetry.
- Retention, privacy, residency, and downstream integration with
analytics-service. - Incident response and automated remediation hooks.
It is normative for all code generation and review.
1. Principles
| # | Principle | Consequence |
|---|---|---|
| P1 | Three pillars, one correlation ID | Every log line, metric exemplar, and span carries trace_id, tenant_id, request_id, and actor_id (hashed). |
| P2 | Structured by default | No free-form log strings in production paths. JSON, schema-validated, versioned (log_schema_version). |
| P3 | PII never leaves the boundary unredacted | Redaction is a library, not discipline. Applied at emitter, verified at collector, re-verified at sink. |
| P4 | Tenant isolation extends to telemetry | Tenant-scoped dashboards, alerts, retention, and export. No cross-tenant leakage in incident artifacts. |
| P5 | Sampling is policy, not accident | Head-based for hot paths, tail-based for error/slow paths, 100% for safety-critical AI and assessment scoring. |
| P6 | Cost is a first-class signal | AI spend, egress, and storage have SLOs just like latency. |
| P7 | Offline is observable when it reconnects | Device-side telemetry buffer with tamper-evident framing; reconciled on sync. |
| P8 | Events are the ledger | Domain events (see Doc 04) are replayable; telemetry augments but never replaces them. |
| P9 | Every alert is actionable | Alerts reference a runbook slug, owner, and auto-remediation hook where applicable. |
| P10 | Privacy > curiosity | When in doubt, drop the field. Learner wellbeing outranks debuggability. |
2. Reference Stack
| Layer | Chosen Tooling (normative) | Notes |
|---|---|---|
| Instrumentation | OpenTelemetry SDK (Node, Python, JVM, Go, Swift, Kotlin, Dart) | Single vendor-neutral API. |
| Collection | OTel Collector (gateway + agent tiers) | Redaction, tenant routing, sampling decisions. |
| Logs | Loki (hot 14d) → S3 + Parquet (cold 395d) | Indexed by tenant_id, service, severity. |
| Metrics | Prometheus (hot 30d) → Mimir/Thanos (13mo) | Remote-write from Collector. |
| Traces | Tempo (hot 7d) → S3 (90d, sampled) | Exemplars link metrics → traces → logs. |
| Dashboards | Grafana (per-tenant folders, RBAC) | Stored as code (grafana/ repo). |
| Alerts | Alertmanager + PagerDuty + Slack #oncall-* | Alerts declared in Git, reviewed via PR. |
| Analytics sink | analytics-service (Kafka → ClickHouse) | Learning analytics, not SRE. |
| Audit sink | audit-service (append-only, WORM S3) | Security/compliance only. |
| SLO engine | Sloth → Prometheus rules | Generates burn-rate alerts. |
| Incident | PagerDuty + Statuspage + incident-bot | Auto-declares, pulls runbook, opens bridge. |
Services MUST NOT import vendor SDKs directly. They import
@ghasi/telemetry(per language), which wraps OTel and enforces field contracts.
3. Identity, Correlation, and Context Propagation
3.1 Required Context Keys
Every telemetry signal (log, metric exemplar, span attribute, event) MUST include the keys below where the value exists. Absent values use null, never "".
| Key | Type | Source | Notes |
|---|---|---|---|
trace_id | hex(32) | W3C traceparent | Generated at edge if absent. |
span_id | hex(16) | OTel | |
request_id | uuidv7 | Edge (API gateway) | Survives through queues via baggage. |
tenant_id | ULID | JWT → baggage | Mandatory outside onboarding. |
org_unit_id | ULID? | JWT | School/campus scope. |
actor_id_hash | sha256(actor_id + tenant_salt) | Auth layer | Raw actor_id NEVER in telemetry. |
actor_role | enum | Auth | learner|instructor|author|admin|service|anonymous |
session_id | ULID | Player/Authoring | For cross-request stitching. |
device_id_hash | sha256 | Offline SDK | Device binding (Doc 13). |
app | string | Build | web-player, mobile-player, authoring, admin. |
app_version | semver | Build | |
env | enum | Runtime | dev|staging|prod|sandbox. |
region | string | Runtime | af-jnb-1, eu-fra-1, … |
log_schema_version | int | Library | Current: 3. |
3.2 Baggage
OTel Baggage carries tenant_id, request_id, actor_role, offline_origin, ai_budget_id across all hops (HTTP, gRPC, Kafka). Baggage is stripped at egress to 3rd-party APIs.
3.3 Correlation Across Boundaries
- HTTP:
traceparent,tracestate,x-ghasi-tenant,x-ghasi-request-id. - Kafka: OTel headers +
tenant_idheader used for partitioning and DLQ routing. - WebSocket/SSE (player tutor):
trace_idestablished per connection, per-messagespan_id. - Offline → Online sync: Device emits a
sync_batch_id; server links all replayed signals to the original offlinetrace_idpreserved in the bundle.
4. Logging
4.1 Log Schema (v3)
{
"ts": "2026-04-15T09:12:33.214Z",
"level": "INFO|DEBUG|WARN|ERROR|FATAL|AUDIT",
"msg": "short human summary, <=120 chars, no interpolation of PII",
"event": "player.lesson.completed",
"service": "player-service",
"component": "ProgressAggregator",
"trace_id": "…", "span_id": "…", "request_id": "…",
"tenant_id": "…", "org_unit_id": "…",
"actor_id_hash": "…", "actor_role": "learner",
"session_id": "…", "device_id_hash": "…",
"app": "web-player", "app_version": "2026.4.1",
"env": "prod", "region": "af-jnb-1",
"attrs": { "course_id": "…", "lesson_id": "…", "score": 0.87 },
"error": { "type": "…", "message": "…", "stack": "…", "cause": {…} },
"log_schema_version": 3
}
Rules:
msgis static; variable data goes inattrs.eventusesdomain.entity.action(see Doc 04 event catalog).attrskeys MUST be snake_case and namespaced.error.stackonly atERROR/FATALand only in non-prod, or after scrubbing in prod.
4.2 Levels & Usage
| Level | Use | Sampling |
|---|---|---|
FATAL | Process-terminating | 100% |
ERROR | Contract violation, unhandled | 100% |
WARN | Recoverable anomaly, degraded mode | 100% |
AUDIT | Security/compliance events (see 4.5) | 100%, also to audit-service |
INFO | Domain event emission, lifecycle | 100% in prod, filtered by category |
DEBUG | Developer detail | 0% prod, 100% sandbox |
4.3 PII & Redaction
Redaction is applied in the @ghasi/telemetry library before serialization. The collector re-runs redaction as defense in depth.
Deny-list fields (never logged, even hashed unless noted):
password, password_hash, otp, totp_secret, access_token, refresh_token, session_cookie, id_token, private_key, webhook_secret, payment_pan, cvv, national_id, passport_no, date_of_birth, home_address, phone_e164, email, learner_free_text_answer, tutor_prompt_raw, tutor_response_raw, parent_contact, health_note.
Hashed (keyed per tenant salt): actor_id, device_id, ip_address, learner_email, guardian_email.
Truncated + categorized (never raw):
- Tutor prompts/responses →
prompt_category,prompt_length,prompt_lang,safety_tags[]. - Free-text answers →
answer_length,lang,similarity_to_reference(0–1).
Automated scanners: A nightly job runs regex + ML PII detectors on Loki; any hit opens a SEV-2 ticket and quarantines the offending log stream.
4.4 Multi-Tenancy in Logs
- Loki label set:
{service, env, region, severity, tenant_id}—tenant_idis a required label; unlabeled logs are dropped at the Collector with a counter incremented. - Per-tenant retention overrides (enterprise tier) are supported via Collector routing to dedicated indexes.
- Tenant admin UI can export only their own logs (signed, time-boxed S3 URL, 24h).
4.5 Audit Logs (separate pipeline)
Audit events are never best-effort. They go through a synchronous, ack'd write to audit-service before the user-facing response completes. Failures return 503 — we do not transact without audit.
Audit event categories:
- Auth: login, mfa, session revoke, impersonation.
- Data: export, delete (DSAR), bulk access.
- Content: publish, unpublish, license grant/revoke.
- Moderation: AI block, human override, appeal outcome.
- Offline: bundle issue, revoke, device bind/unbind.
- Grading: grade change, regrade, override, release.
- Marketplace: purchase, refund, payout, dispute.
- Admin: role change, policy change, tenant config change.
Schema: canonical log schema + audit.action, audit.target, audit.before_hash, audit.after_hash, audit.signed_by, audit.chain_prev_hash (hash-chained for tamper evidence).
5. Metrics
5.1 Taxonomy
Three families, each with strict naming:
- RED per request path:
http_requests_total,http_request_duration_seconds,http_requests_errors_total. - USE per resource:
process_cpu_seconds_total,db_pool_connections{state},kafka_consumer_lag. - Domain (DKPIs):
<domain>_<entity>_<action>_total— see §5.3+.
Naming rules:
_totalfor counters,_secondsfor latency,_ratiofor 0–1,_bytesfor sizes.- Labels are bounded cardinality.
tenant_idis allowed (bounded ~ low 10k);user_idis never a label. - High-cardinality dimensions (course, lesson) go to exemplars and analytics-service, not Prometheus labels.
5.2 Standard Labels
Every metric includes: service, env, region, tenant_tier (free|school|district|enterprise). Domain metrics additionally include tenant_id when cardinality permits (Collector enforces a per-series cap and down-labels above threshold).
5.3 Per-Service SLIs (all services)
| SLI | Definition | Target |
|---|---|---|
| Availability | 1 - errors/total over read paths (5xx excluding 499) | 99.9% |
| Latency P95 | http_request_duration_seconds P95 | ≤ 300 ms (API) / ≤ 150 ms (edge) |
| Latency P99 | ≤ 800 ms | |
| Saturation | db_pool_in_use / db_pool_max | < 0.8 sustained |
| Queue lag | kafka_consumer_lag | < 5000 / partition |
5.4 Domain Metrics — Catalogue
Full catalogue is in telemetry/metrics.yaml (code-generated). Highlights:
Identity: auth_login_total{result}, auth_mfa_challenge_total{method,result}, auth_session_revoked_total{reason}.
Curriculum: curriculum_course_published_total, curriculum_draft_autosave_seconds, curriculum_review_cycle_duration_seconds.
Authoring: see §9.
Player: see §8.
Assessment: assessment_attempt_started_total, assessment_attempt_completed_total{result}, assessment_autograde_latency_seconds, assessment_regrade_total{reason}, assessment_integrity_flags_total{flag}.
AI: see §7.
Offline: see §8.
Marketplace: see §10.
Billing: billing_invoice_total{status}, billing_dunning_stage_total{stage}.
Notifications: notification_delivered_total{channel,status}, notification_suppressed_total{reason}.
5.5 Exemplars
Every domain counter/histogram carries exemplars linking to trace IDs for 1 in N successful requests and 100% of errored requests. Grafana exemplar panels enable one-click metric → trace → log drill-down.
6. Distributed Tracing
6.1 Instrumentation Rules
- All inbound HTTP, gRPC, GraphQL, WebSocket, and Kafka consumer handlers create a root or child span.
- All outbound DB, cache, HTTP, Kafka producer, object-store, and AI provider calls create a child span.
- Spans have
otel.status_code,error.type, and domain-specific attributes. - Critical rule: Any span crossing a trust boundary (tenant → provider, online → offline, sync-in) MUST carry a
trust.boundaryattribute and its own error budget.
6.2 Span Attributes (domain examples)
ghasi.tenant_id,ghasi.actor_roleghasi.course_id,ghasi.lesson_id,ghasi.attempt_idghasi.ai.model,ghasi.ai.provider,ghasi.ai.purposeghasi.ai.tokens_in,ghasi.ai.tokens_out,ghasi.ai.cost_usdghasi.offline.bundle_id,ghasi.offline.sync_batch_idghasi.integrity.flag(proctoring signals)
6.3 Sampling
| Path | Strategy | Rate |
|---|---|---|
| Health / liveness | Drop | 0% |
| Read APIs (cached) | Head-based | 1% |
| Write APIs | Head-based | 10% |
| Assessment scoring, grade commit, payment, DSAR | Always-on | 100% |
| AI inference (tutor, content-gen, moderation) | Always-on | 100% |
| Errors (4xx ≥ 429, all 5xx) | Tail-based override | 100% |
| Slow requests (> SLO P99) | Tail-based override | 100% |
| Offline sync replay | Always-on | 100% |
Tail sampling runs in the Collector gateway tier with a 30s decision window.
6.4 Redaction in Spans
db.statement is captured parameterized only (placeholders, never values). URL paths are templatized (/courses/:id/lessons/:lid). Request/response bodies are never placed in span attributes.
7. AI Telemetry
AI is the highest-risk, highest-cost surface. It gets its own first-class observability.
7.1 Dimensions to Capture (per invocation)
| Dimension | Example | Purpose |
|---|---|---|
ai.purpose | tutor.explain, authoring.outline, moderation.text, assessment.rubric_grade | SLOs, cost allocation |
ai.model / ai.model_version | claude-sonnet-4-6@20261001 | Drift detection, A/B |
ai.provider | anthropic, internal-llm, openai-fallback | Provider SLOs, failover |
ai.prompt_template_id + hash | tutor.v14 | Provenance |
ai.tokens_in / ai.tokens_out | Cost | |
ai.cost_usd | Cost | |
ai.latency_ms_ttfb / ttlb | UX | |
ai.cache.hit / ai.cache.key_hash | Cost | |
ai.safety.pre / ai.safety.post | {"self_harm":0.01,...} | Safety |
ai.safety.action | allow|redact|block|escalate | Safety |
ai.guardrail.violations[] | pii_leak, age_inappropriate | Safety |
ai.output.citations[] | source doc IDs | Provenance |
ai.output.grounding_score | 0–1 | Hallucination |
ai.human_override | bool | Loop closure |
ai.confidence | 0–1 | Routing |
Prompts and responses themselves go to a separate, encrypted, tenant-scoped store (ai-transcripts-service) with tighter retention (§11). Telemetry carries only hashes, categories, and safety signals.
7.2 AI SLIs & SLOs
| SLI | Target |
|---|---|
| Tutor TTFB P95 | ≤ 1.2s online, ≤ 300ms offline-SLM |
| Tutor stream completion success | ≥ 99.5% |
| Moderation decision latency P99 | ≤ 400ms |
| Safety false-negative rate (sampled audit) | ≤ 0.1% |
| Safety false-positive rate | ≤ 2% |
| Authoring AI accept rate | tracked, no threshold (product KPI) |
| AI cost per MAU | ≤ budget per tier |
| Cache hit rate (prompt+ctx) | ≥ 35% tutor, ≥ 60% authoring |
| Provider failover success | ≥ 99% |
| Grounding score (RAG paths) | P50 ≥ 0.8 |
7.3 AI Cost Observability
- Budget IDs attach to every invocation; rollups by
tenant_id × purpose × modelin ClickHouse. - Circuit breakers fire when a tenant/purpose breaches 120% of hourly budget: degrade model (Opus → Sonnet → Haiku), enable aggressive caching, then fail-closed to cached/static responses.
- Dashboards:
ai/cost-overview,ai/per-tenant,ai/per-purpose,ai/model-mix,ai/cache-efficacy.
7.4 AI Provenance & Replay
Each AI output stores {prompt_template_id, prompt_hash, context_hash, model, model_version, params, safety_decisions, citations} — sufficient to replay a response for audit (Doc 13). Replay is exposed via ai-audit admin tool and logged as AUDIT.
8. Offline Telemetry & Player Telemetry
8.1 Offline SDK Responsibilities
The offline runtime (mobile, desktop player, low-bandwidth web cache) runs a local telemetry buffer:
- SQLite-backed, append-only, size-capped (default 64 MB, configurable).
- Each batch is MAC-signed with the device binding key (Doc 13) so the server can detect tampering on replay.
- Buffer encrypts at rest using the device-bound key.
- Batches are chunked by
sync_batch_id, ordered by monotonic sequence number; gaps flagged on server.
8.2 Offline Signals
| Signal | Fields |
|---|---|
offline.bundle.activated | bundle_id, course_id, size_bytes, integrity_ok |
offline.bundle.integrity_failure | bundle_id, expected_hash, actual_hash, reason |
offline.sync.started / .completed / .failed | batch_id, items, bytes, duration_ms, conflicts |
offline.conflict.detected | entity, strategy, winner, loser_preserved |
offline.device.bind / .unbind / .rebind_denied | device_id_hash, reason |
offline.tamper.suspected | signal, severity, evidence_hash |
offline.clock.skew | skew_seconds |
offline.duration.active_seconds | rolling |
8.3 Offline SLIs/SLOs
| SLI | Target |
|---|---|
| Sync success rate | ≥ 99% per device/day |
| Conflict rate (of synced writes) | ≤ 1% |
| Bundle tamper detection | 100% of test cases caught |
| Device-binding mismatch blocked | 100% |
| Mean time from reconnection → sync complete | ≤ 60s for ≤ 10MB |
8.4 Player Telemetry
| Event | Key Attributes |
|---|---|
player.session.started | session_id, course_id, entry_point |
player.navigation | from_lesson_id, to_lesson_id, method (next, TOC, deeplink, tutor) |
player.lesson.viewed | lesson_id, media_type, duration_ms, completion_ratio |
player.media.buffered | stall_count, stall_total_ms |
player.quiz.started / .answered / .submitted | attempt_id, question_id, time_on_question_ms, changed_answer_count |
player.tutor.opened / .prompt / .response | turn_number, ai.purpose=tutor.*, escalated_to_human |
player.accessibility.used | feature (captions, TTS, dyslexia_mode, hi_contrast) |
player.integrity.flag | flag, confidence (proctoring) |
player.session.ended | reason, duration_ms, progress_delta |
These feed both SRE dashboards and analytics-service for learning analytics (engagement, mastery).
Quiz/assessment integrity telemetry is captured at 100% sampling and mirrored to audit-service whenever integrity flags fire.
9. Authoring Telemetry
The authoring tool is where AI leverage is highest; telemetry closes the feedback loop to improve prompts and UX.
| Event | Attributes | Why |
|---|---|---|
authoring.session.started | course_id, editor_version | Engagement |
authoring.block.inserted | block_type, source (manual, ai, template, import) | Block-mix |
authoring.ai.suggestion.shown | surface, purpose, suggestion_id | AI UX |
authoring.ai.suggestion.accepted | suggestion_id, accept_mode (full, partial, edited) | Accept rate |
authoring.ai.suggestion.rejected | suggestion_id, reason_category | Learning signal |
authoring.ai.regenerate | count, deltas | Friction |
authoring.autosave | latency_ms, bytes, conflict | Reliability |
authoring.conflict.detected | entity, resolution_strategy, lost_bytes | Collab quality |
authoring.review.cycle | state_from, state_to, duration_s, reviewer_role | Workflow |
authoring.publish | course_id, validator_errors, warnings, a11y_score | Release quality |
authoring.accessibility.score | score, violations[] | Quality gate |
9.1 Authoring SLIs
| SLI | Target |
|---|---|
| Autosave success | ≥ 99.95% |
| Autosave latency P95 | ≤ 250 ms |
| Merge conflict rate | ≤ 0.5% of autosaves |
| AI suggestion TTFB P95 | ≤ 900 ms |
| AI accept rate (rolling 30d) | tracked; alert on ±25% WoW shift |
| Publish validator pass rate | ≥ 99% |
10. Marketplace Telemetry
Marketplace telemetry joins SRE, product analytics, and finance.
| Event | Attributes |
|---|---|
market.listing.viewed | listing_id, source |
market.checkout.started | cart_hash, price_cents, currency, coupon |
market.checkout.payment_attempt | provider, method, 3ds |
market.checkout.succeeded / .failed | reason |
market.refund.requested / .approved / .denied | reason_category |
market.payout.scheduled / .completed | creator_id_hash, amount_cents |
market.fraud.flag | signals[], decision |
market.dispute.opened / .resolved | provider_case_id_hash |
market.license.activated / .exhausted | listing_id, seats |
10.1 Marketplace SLIs
| SLI | Target |
|---|---|
| Checkout success rate | ≥ 98% (excluding legit declines) |
| Payment provider failover success | ≥ 99% |
| Refund SLA (request → decision) | P95 ≤ 72h |
| Chargeback rate | ≤ 0.5% |
| Fraud false positive | ≤ 2% |
| Payout on-time rate | ≥ 99.5% |
Revenue metrics (market_gmv_cents_total, market_refund_cents_total) are reconciled daily against the ledger in billing-service. Divergence > 0.1% pages finance oncall.
11. Data Retention, Residency & Privacy
11.1 Retention Matrix
| Signal | Hot | Warm | Cold | Max |
|---|---|---|---|---|
| App logs (non-PII) | 14d Loki | 90d S3 | 395d S3 Glacier | 395d |
| Audit logs | 30d hot | 7y WORM S3 | — | 7y |
| Metrics | 30d Prom | 13mo Mimir | — | 13mo |
| Traces (sampled) | 7d Tempo | 90d S3 | — | 90d |
| Traces (errors, AI, payment) | 30d | 395d | — | 395d |
| AI transcripts (learner) | 30d | 180d per tenant config | — | ≤ 365d |
| AI transcripts (safety-flagged) | 180d | 2y audit | — | 2y |
| Offline device telemetry | 14d post-sync | 90d | — | 90d |
| Learner event stream (analytics) | 90d | 2y aggregated | — | K–12 tenants: raw ≤ 13 mo |
| Marketplace financial events | 7y | — | — | regulatory |
11.2 Residency
Tenants select a home region (af-jnb-1, eu-fra-1, us-iad-1, me-bah-1). Logs, metrics, traces, and transcripts are pinned to that region. Cross-region replication is off by default; enabled only for enterprise DR with DPA addendum.
11.3 Minor/K–12 Rules
- No actor email/phone anywhere in telemetry, even hashed with global salt (use tenant salt).
- AI transcript retention ≤ 30d unless safety-flagged.
- No behavioral profiling data exported to third parties.
- DSAR deletion propagates to Loki, Mimir, Tempo, ClickHouse, and AI transcripts within 30 days; deletion receipts logged to
audit-service.
11.4 Access Control
- Grafana folders per tenant + per function (SRE, Product, Finance, Safety).
- Raw logs accessible only to SRE+Security; product engineers see sanitized views.
- Break-glass access is MFA-gated, time-boxed (≤ 4h), auto-audited.
12. Dashboards
Dashboards are stored as JSON in grafana/ and provisioned via CI. Each has an owner and an SLO link.
12.1 Global Dashboards
- Platform Overview — availability, latency, saturation across all services.
- Error Budget Burn — per service, per SLO, 1h/6h/24h burn rates.
- Tenant Health — per-tenant error rate, latency, AI spend, offline sync.
- Release Radar — correlates deploys → error rate / latency deltas.
- Cost Control — infra + AI + egress.
12.2 Per-Capability Dashboards
- Identity & Access — login success, MFA mix, session revocations, impersonations.
- Authoring — autosave, AI accept rate, conflict rate, publish validator, review cycle time.
- Player — session starts, lesson completion funnel, quiz funnel, tutor usage, a11y usage.
- Assessment — attempts, grading latency, regrades, integrity flags.
- AI / Tutor — TTFB, cache hit, safety actions, grounding, cost.
- Offline — sync success, conflict %, tamper flags, bundle integrity.
- Marketplace — checkout funnel, refunds, fraud, payout SLA, GMV vs ledger.
- Billing — invoice status, dunning, MRR proxy.
- Safety & Moderation — blocks, overrides, appeals, SLA to decision.
- Data Platform — Kafka lag, DLQ depth, consumer group health.
12.3 Per-Service Dashboards (template)
Each service auto-gets a dashboard with: RED panels, USE panels, top errors, top slow endpoints, top consumers, DB pool, Kafka lag (if applicable), dependency map.
13. Alerts, SLIs/SLOs, Error Budgets
13.1 SLO Framework
- All SLOs defined in
slo/*.yaml(Sloth format), reviewed via PR. - Multiwindow, multi-burn-rate alerts (Google SRE): 1h+5m (fast burn), 6h+30m (slow burn).
- 28-day rolling windows; error-budget policy enforced automatically.
13.2 Alert Contract
Every alert declares:
alert: PlayerTutorTTFBHigh
expr: ...
for: 5m
severity: SEV-2
owner: ai-platform
runbook: https://runbooks.ghasi/obs/ai-tutor-ttfb
auto_remediation: ai.failover.downgrade_model
dashboards: [ai/tutor, player/engagement]
slos: [player.tutor.ttfb]
Alerts without runbook + owner are rejected in CI.
13.3 Severity Ladder
| Severity | Definition | Response |
|---|---|---|
| SEV-1 | User-visible outage, data loss risk, safety breach, payment outage | Page + bridge + Statuspage within 5 min |
| SEV-2 | Capability degraded, SLO fast-burn | Page primary oncall |
| SEV-3 | Slow-burn SLO, non-blocking | Ticket + Slack |
| SEV-4 | Housekeeping | Ticket |
13.4 Example Alerts (non-exhaustive)
IdentityLoginErrorRate(SEV-2): 5xx on/auth/login> 1% for 5m.AssessmentGradeCommitFailure(SEV-1): anyERRORon grade commit path — always-on pager.AIContentModerationBypass(SEV-1):ai.safety.post != blockbutguardrail.violationsnon-empty.AITutorTTFBSlowBurn(SEV-3): 6h burn > 1× budget onplayer.tutor.ttfb.OfflineBundleTamper(SEV-1): anyoffline.bundle.integrity_failure.OfflineSyncConflictSpike(SEV-2): conflict rate > 2% over 30m per tenant.AICostBudgetBreach(SEV-2): tenant hourly AI spend > 120% budget; auto-downgrade fires first, alert on 2 consecutive windows.MarketplaceGMVReconMismatch(SEV-1): daily GMV vs ledger diff > 0.1%.KafkaConsumerLagHigh(SEV-2): any consumer lag > 50k for 10m.DSARDeletionSLA(SEV-2): any open DSAR > 27 days.AuditWriteFailure(SEV-1): any failed audit write; blocks transactions.AuthoringAutosaveFailure(SEV-2): autosave success < 99.5% for 10m.PlayerA11yViolation(SEV-3): a11y violations on published course > threshold.
13.5 Error Budget Policy
When a service consumes:
- 50% of monthly budget → notify service owner; feature freeze optional.
- 75% → feature freeze mandatory; only reliability PRs merge.
- 100% → rollback latest risky changes; post-incident review required; no launches until budget restored.
Policy enforcement is automated via a GitHub check that reads the SLO service.
14. Incident Response Hooks
14.1 Auto-Declare
Any SEV-1 or two concurrent SEV-2s auto-declares an incident:
incident-botopens a Slack channel#inc-YYYYMMDD-NN.- Creates PagerDuty incident, pages on-call per owner.
- Posts runbook, current burn rate, recent deploys, related alerts.
- Opens a bridge (Zoom/Meet) with auto-invite.
- Updates Statuspage with tenant-scoped visibility.
- Starts a timeline log (every human comment + alert transition captured).
14.2 Automated Remediation Hooks
| Trigger | Action |
|---|---|
| AI provider error rate > 5% | Failover to secondary provider + model downgrade |
| Tutor TTFB P95 breach | Reduce streaming concurrency per tenant |
| Kafka DLQ depth spike | Pause producer, alert, enable dead-letter drain tool |
| Offline tamper spike from a tenant | Auto-revoke affected device bindings, require re-enroll |
| Payment provider outage | Route new checkouts to secondary |
| Audit write failure | Trip global "no-write" breaker on affected capability |
| Cost breach | Model downgrade, cache-only mode, then 503 with graceful copy |
All auto-remediation actions are logged as AUDIT and reversible via single command.
14.3 Postmortems
- Blameless template generated from incident timeline + telemetry (
pm-bot). - Required within 5 business days for SEV-1/2.
- Action items tracked with SLA; overdue AIs appear on the error-budget dashboard.
15. Integration with analytics-service
analytics-service (Doc 03) is the learning & product analytics sink — not an SRE tool, but it shares the telemetry substrate.
15.1 Dual Emission
Domain events (player.*, authoring.*, market.*) are:
- Logged (for SRE forensics).
- Emitted as Kafka domain events (source of truth, Doc 04).
- Projected into ClickHouse by
analytics-serviceconsumers.
Metrics are not duplicated into analytics; analytics re-aggregates from events.
15.2 Contract Between Telemetry Library and Analytics
- Events declared once in
events/*.protowith@analyticsand@telemetryannotations. - Codegen produces: (a) producer client, (b) log emitter, (c) ClickHouse DDL, (d) dbt models, (e) Grafana/Looker source docs.
- Adding an event without updating the schema CI blocks the PR.
15.3 PII Boundary
Analytics warehouse receives tenant-salted hashed IDs. Reverse identification requires a signed, audited join against identity-service — gated, rate-limited, logged.
16. Frontend & Mobile Client Telemetry
- Web: OTel Web SDK + custom
@ghasi/telemetry-web. CLS/LCP/INP collected. Session replay disabled by default (privacy); enabled per-tenant with explicit consent, never for learners under 18. - Mobile: OTel Android/iOS + Dart. Offline buffer as in §8.
- Crash reporting: native platform crash collectors, symbolicated, linked to
trace_idvia breadcrumbs. No PII in crash payloads. - Network errors and retry storms are explicit metrics (
client_retry_storm_total).
Consent flags (consent.telemetry, consent.session_replay) are read on every emission; denial drops events at the client before transport.
17. Security & Threat Telemetry
authn_bruteforce_detected_total,authn_credential_stuffing_detected_total,abuse_rate_limit_triggered_total.- WAF events forwarded with
rule_id,action. - Secret scanner alerts (source code, logs) mapped to SEV-2.
- Anomaly detection: per-tenant usage baselines in ClickHouse; deviations > 4σ raise
SecurityAnomalyalerts. - Supply chain: SBOM diffs per deploy, signed; rollback triggers if vulnerability severity ≥ HIGH.
18. Implementation Checklist (per service)
A service is not production-ready until:
- Uses
@ghasi/telemetrywrapper (no raw OTel). - Emits correlation context per §3.
- Passes PII redaction CI test (fixtures of bad logs).
- Declares RED SLIs; defines at least 1 domain SLI.
- Has a per-service dashboard provisioned.
- Has at least one SEV-2 alert with runbook.
- Declares Kafka consumer lag alert (if a consumer).
- Audit events (if in scope) use synchronous audit write.
- Offline surfaces (if any) use buffered, MAC-signed telemetry.
- AI surfaces (if any) emit §7.1 dimensions and pass safety-replay test.
- Domain events registered in
events/with both telemetry + analytics projections. - Documented retention & residency overrides if non-default.
19. Open Questions & Roadmap
- On-device anomaly detection for offline tamper — currently server-side only; prototype lightweight ML on-device in H2.
- Learner-visible AI transparency card — surfaces
ai.model,citations,grounding_scoreto learners ≥ age 13. Needs legal review. - Per-course cost attribution — currently tenant+purpose; extend to course+creator for marketplace unit economics.
- Formal privacy budgets (differential privacy) for analytics exports — exploratory.
- eBPF-based USE telemetry for node-local kernel metrics — under PoC.
- Synthetic journeys for K–12 low-bandwidth scenarios — planned Q3.
20. Change Management
- This document is versioned; material changes require an RFC in
rfcs/observability/. - Breaking changes to the log schema require
log_schema_versionbump and a deprecation window of 2 minor releases. - Alert and SLO changes are PR-reviewed by SRE + capability owner + (for safety/AI) Trust & Safety.
End of Document 15 — Observability & Telemetry Specification.