Skip to main content

Observability

:::info Source Sourced from docs/15-observability-telemetry.md in the documentation repo. :::

Document: 15 of N Status: Normative Owners: Platform SRE, Data Platform, AI Platform, Security Aligned with: 01 Enterprise Architecture, 02 DDD Bounded Contexts, 03 Microservices, 04 Event-Driven Architecture, 05 API Design, 10 Authoring Tool, 11 LMS Player, 12 Data Models, 13 Security/Compliance/Tenancy Applies to: All backend services, web/mobile clients, offline runtimes, authoring tools, AI services, marketplace, and analytics pipeline.


0. Purpose & Scope

Ghasi-edTech is an AI-first, offline-first, multi-tenant, event-driven learning platform. Observability is not optional instrumentation — it is the operating substrate that lets us:

  1. Prove the platform is safe for learners (AI moderation, PII handling, abuse).
  2. Prove it is correct (assessment scoring, licensing, gradebook integrity).
  3. Prove it is available (SLOs, error budgets).
  4. Prove it is cost-sane (AI token burn, egress, storage).
  5. Prove it is tamper-evident (offline bundles, device binding, audit chain).
  6. Drive learning analytics and instructor feedback loops via the same pipeline.

This document is the single source of truth for:

  • Log schema, field contracts, and redaction rules.
  • Metric taxonomy (RED, USE, and domain KPIs).
  • Distributed tracing via OpenTelemetry (OTel).
  • Dashboards, alerts, SLIs/SLOs, and error budgets.
  • AI-, offline-, player-, authoring-, and marketplace-specific telemetry.
  • Retention, privacy, residency, and downstream integration with analytics-service.
  • Incident response and automated remediation hooks.

It is normative for all code generation and review.


1. Principles

#PrincipleConsequence
P1Three pillars, one correlation IDEvery log line, metric exemplar, and span carries trace_id, tenant_id, request_id, and actor_id (hashed).
P2Structured by defaultNo free-form log strings in production paths. JSON, schema-validated, versioned (log_schema_version).
P3PII never leaves the boundary unredactedRedaction is a library, not discipline. Applied at emitter, verified at collector, re-verified at sink.
P4Tenant isolation extends to telemetryTenant-scoped dashboards, alerts, retention, and export. No cross-tenant leakage in incident artifacts.
P5Sampling is policy, not accidentHead-based for hot paths, tail-based for error/slow paths, 100% for safety-critical AI and assessment scoring.
P6Cost is a first-class signalAI spend, egress, and storage have SLOs just like latency.
P7Offline is observable when it reconnectsDevice-side telemetry buffer with tamper-evident framing; reconciled on sync.
P8Events are the ledgerDomain events (see Doc 04) are replayable; telemetry augments but never replaces them.
P9Every alert is actionableAlerts reference a runbook slug, owner, and auto-remediation hook where applicable.
P10Privacy > curiosityWhen in doubt, drop the field. Learner wellbeing outranks debuggability.

2. Reference Stack

LayerChosen Tooling (normative)Notes
InstrumentationOpenTelemetry SDK (Node, Python, JVM, Go, Swift, Kotlin, Dart)Single vendor-neutral API.
CollectionOTel Collector (gateway + agent tiers)Redaction, tenant routing, sampling decisions.
LogsLoki (hot 14d) → S3 + Parquet (cold 395d)Indexed by tenant_id, service, severity.
MetricsPrometheus (hot 30d) → Mimir/Thanos (13mo)Remote-write from Collector.
TracesTempo (hot 7d) → S3 (90d, sampled)Exemplars link metrics → traces → logs.
DashboardsGrafana (per-tenant folders, RBAC)Stored as code (grafana/ repo).
AlertsAlertmanager + PagerDuty + Slack #oncall-*Alerts declared in Git, reviewed via PR.
Analytics sinkanalytics-service (Kafka → ClickHouse)Learning analytics, not SRE.
Audit sinkaudit-service (append-only, WORM S3)Security/compliance only.
SLO engineSloth → Prometheus rulesGenerates burn-rate alerts.
IncidentPagerDuty + Statuspage + incident-botAuto-declares, pulls runbook, opens bridge.

Services MUST NOT import vendor SDKs directly. They import @ghasi/telemetry (per language), which wraps OTel and enforces field contracts.


3. Identity, Correlation, and Context Propagation

3.1 Required Context Keys

Every telemetry signal (log, metric exemplar, span attribute, event) MUST include the keys below where the value exists. Absent values use null, never "".

KeyTypeSourceNotes
trace_idhex(32)W3C traceparentGenerated at edge if absent.
span_idhex(16)OTel
request_iduuidv7Edge (API gateway)Survives through queues via baggage.
tenant_idULIDJWT → baggageMandatory outside onboarding.
org_unit_idULID?JWTSchool/campus scope.
actor_id_hashsha256(actor_id + tenant_salt)Auth layerRaw actor_id NEVER in telemetry.
actor_roleenumAuthlearner|instructor|author|admin|service|anonymous
session_idULIDPlayer/AuthoringFor cross-request stitching.
device_id_hashsha256Offline SDKDevice binding (Doc 13).
appstringBuildweb-player, mobile-player, authoring, admin.
app_versionsemverBuild
envenumRuntimedev|staging|prod|sandbox.
regionstringRuntimeaf-jnb-1, eu-fra-1, …
log_schema_versionintLibraryCurrent: 3.

3.2 Baggage

OTel Baggage carries tenant_id, request_id, actor_role, offline_origin, ai_budget_id across all hops (HTTP, gRPC, Kafka). Baggage is stripped at egress to 3rd-party APIs.

3.3 Correlation Across Boundaries

  • HTTP: traceparent, tracestate, x-ghasi-tenant, x-ghasi-request-id.
  • Kafka: OTel headers + tenant_id header used for partitioning and DLQ routing.
  • WebSocket/SSE (player tutor): trace_id established per connection, per-message span_id.
  • Offline → Online sync: Device emits a sync_batch_id; server links all replayed signals to the original offline trace_id preserved in the bundle.

4. Logging

4.1 Log Schema (v3)

{
"ts": "2026-04-15T09:12:33.214Z",
"level": "INFO|DEBUG|WARN|ERROR|FATAL|AUDIT",
"msg": "short human summary, <=120 chars, no interpolation of PII",
"event": "player.lesson.completed",
"service": "player-service",
"component": "ProgressAggregator",
"trace_id": "…", "span_id": "…", "request_id": "…",
"tenant_id": "…", "org_unit_id": "…",
"actor_id_hash": "…", "actor_role": "learner",
"session_id": "…", "device_id_hash": "…",
"app": "web-player", "app_version": "2026.4.1",
"env": "prod", "region": "af-jnb-1",
"attrs": { "course_id": "…", "lesson_id": "…", "score": 0.87 },
"error": { "type": "…", "message": "…", "stack": "…", "cause": {} },
"log_schema_version": 3
}

Rules:

  • msg is static; variable data goes in attrs.
  • event uses domain.entity.action (see Doc 04 event catalog).
  • attrs keys MUST be snake_case and namespaced.
  • error.stack only at ERROR/FATAL and only in non-prod, or after scrubbing in prod.

4.2 Levels & Usage

LevelUseSampling
FATALProcess-terminating100%
ERRORContract violation, unhandled100%
WARNRecoverable anomaly, degraded mode100%
AUDITSecurity/compliance events (see 4.5)100%, also to audit-service
INFODomain event emission, lifecycle100% in prod, filtered by category
DEBUGDeveloper detail0% prod, 100% sandbox

4.3 PII & Redaction

Redaction is applied in the @ghasi/telemetry library before serialization. The collector re-runs redaction as defense in depth.

Deny-list fields (never logged, even hashed unless noted):

password, password_hash, otp, totp_secret, access_token, refresh_token, session_cookie, id_token, private_key, webhook_secret, payment_pan, cvv, national_id, passport_no, date_of_birth, home_address, phone_e164, email, learner_free_text_answer, tutor_prompt_raw, tutor_response_raw, parent_contact, health_note.

Hashed (keyed per tenant salt): actor_id, device_id, ip_address, learner_email, guardian_email.

Truncated + categorized (never raw):

  • Tutor prompts/responses → prompt_category, prompt_length, prompt_lang, safety_tags[].
  • Free-text answers → answer_length, lang, similarity_to_reference (0–1).

Automated scanners: A nightly job runs regex + ML PII detectors on Loki; any hit opens a SEV-2 ticket and quarantines the offending log stream.

4.4 Multi-Tenancy in Logs

  • Loki label set: {service, env, region, severity, tenant_id}tenant_id is a required label; unlabeled logs are dropped at the Collector with a counter incremented.
  • Per-tenant retention overrides (enterprise tier) are supported via Collector routing to dedicated indexes.
  • Tenant admin UI can export only their own logs (signed, time-boxed S3 URL, 24h).

4.5 Audit Logs (separate pipeline)

Audit events are never best-effort. They go through a synchronous, ack'd write to audit-service before the user-facing response completes. Failures return 503 — we do not transact without audit.

Audit event categories:

  • Auth: login, mfa, session revoke, impersonation.
  • Data: export, delete (DSAR), bulk access.
  • Content: publish, unpublish, license grant/revoke.
  • Moderation: AI block, human override, appeal outcome.
  • Offline: bundle issue, revoke, device bind/unbind.
  • Grading: grade change, regrade, override, release.
  • Marketplace: purchase, refund, payout, dispute.
  • Admin: role change, policy change, tenant config change.

Schema: canonical log schema + audit.action, audit.target, audit.before_hash, audit.after_hash, audit.signed_by, audit.chain_prev_hash (hash-chained for tamper evidence).


5. Metrics

5.1 Taxonomy

Three families, each with strict naming:

  • RED per request path: http_requests_total, http_request_duration_seconds, http_requests_errors_total.
  • USE per resource: process_cpu_seconds_total, db_pool_connections{state}, kafka_consumer_lag.
  • Domain (DKPIs): <domain>_<entity>_<action>_total — see §5.3+.

Naming rules:

  • _total for counters, _seconds for latency, _ratio for 0–1, _bytes for sizes.
  • Labels are bounded cardinality. tenant_id is allowed (bounded ~ low 10k); user_id is never a label.
  • High-cardinality dimensions (course, lesson) go to exemplars and analytics-service, not Prometheus labels.

5.2 Standard Labels

Every metric includes: service, env, region, tenant_tier (free|school|district|enterprise). Domain metrics additionally include tenant_id when cardinality permits (Collector enforces a per-series cap and down-labels above threshold).

5.3 Per-Service SLIs (all services)

SLIDefinitionTarget
Availability1 - errors/total over read paths (5xx excluding 499)99.9%
Latency P95http_request_duration_seconds P95≤ 300 ms (API) / ≤ 150 ms (edge)
Latency P99≤ 800 ms
Saturationdb_pool_in_use / db_pool_max< 0.8 sustained
Queue lagkafka_consumer_lag< 5000 / partition

5.4 Domain Metrics — Catalogue

Full catalogue is in telemetry/metrics.yaml (code-generated). Highlights:

Identity: auth_login_total{result}, auth_mfa_challenge_total{method,result}, auth_session_revoked_total{reason}.

Curriculum: curriculum_course_published_total, curriculum_draft_autosave_seconds, curriculum_review_cycle_duration_seconds.

Authoring: see §9.

Player: see §8.

Assessment: assessment_attempt_started_total, assessment_attempt_completed_total{result}, assessment_autograde_latency_seconds, assessment_regrade_total{reason}, assessment_integrity_flags_total{flag}.

AI: see §7.

Offline: see §8.

Marketplace: see §10.

Billing: billing_invoice_total{status}, billing_dunning_stage_total{stage}.

Notifications: notification_delivered_total{channel,status}, notification_suppressed_total{reason}.

5.5 Exemplars

Every domain counter/histogram carries exemplars linking to trace IDs for 1 in N successful requests and 100% of errored requests. Grafana exemplar panels enable one-click metric → trace → log drill-down.


6. Distributed Tracing

6.1 Instrumentation Rules

  • All inbound HTTP, gRPC, GraphQL, WebSocket, and Kafka consumer handlers create a root or child span.
  • All outbound DB, cache, HTTP, Kafka producer, object-store, and AI provider calls create a child span.
  • Spans have otel.status_code, error.type, and domain-specific attributes.
  • Critical rule: Any span crossing a trust boundary (tenant → provider, online → offline, sync-in) MUST carry a trust.boundary attribute and its own error budget.

6.2 Span Attributes (domain examples)

  • ghasi.tenant_id, ghasi.actor_role
  • ghasi.course_id, ghasi.lesson_id, ghasi.attempt_id
  • ghasi.ai.model, ghasi.ai.provider, ghasi.ai.purpose
  • ghasi.ai.tokens_in, ghasi.ai.tokens_out, ghasi.ai.cost_usd
  • ghasi.offline.bundle_id, ghasi.offline.sync_batch_id
  • ghasi.integrity.flag (proctoring signals)

6.3 Sampling

PathStrategyRate
Health / livenessDrop0%
Read APIs (cached)Head-based1%
Write APIsHead-based10%
Assessment scoring, grade commit, payment, DSARAlways-on100%
AI inference (tutor, content-gen, moderation)Always-on100%
Errors (4xx ≥ 429, all 5xx)Tail-based override100%
Slow requests (> SLO P99)Tail-based override100%
Offline sync replayAlways-on100%

Tail sampling runs in the Collector gateway tier with a 30s decision window.

6.4 Redaction in Spans

db.statement is captured parameterized only (placeholders, never values). URL paths are templatized (/courses/:id/lessons/:lid). Request/response bodies are never placed in span attributes.


7. AI Telemetry

AI is the highest-risk, highest-cost surface. It gets its own first-class observability.

7.1 Dimensions to Capture (per invocation)

DimensionExamplePurpose
ai.purposetutor.explain, authoring.outline, moderation.text, assessment.rubric_gradeSLOs, cost allocation
ai.model / ai.model_versionclaude-sonnet-4-6@20261001Drift detection, A/B
ai.provideranthropic, internal-llm, openai-fallbackProvider SLOs, failover
ai.prompt_template_id + hashtutor.v14Provenance
ai.tokens_in / ai.tokens_outCost
ai.cost_usdCost
ai.latency_ms_ttfb / ttlbUX
ai.cache.hit / ai.cache.key_hashCost
ai.safety.pre / ai.safety.post{"self_harm":0.01,...}Safety
ai.safety.actionallow|redact|block|escalateSafety
ai.guardrail.violations[]pii_leak, age_inappropriateSafety
ai.output.citations[]source doc IDsProvenance
ai.output.grounding_score0–1Hallucination
ai.human_overrideboolLoop closure
ai.confidence0–1Routing

Prompts and responses themselves go to a separate, encrypted, tenant-scoped store (ai-transcripts-service) with tighter retention (§11). Telemetry carries only hashes, categories, and safety signals.

7.2 AI SLIs & SLOs

SLITarget
Tutor TTFB P95≤ 1.2s online, ≤ 300ms offline-SLM
Tutor stream completion success≥ 99.5%
Moderation decision latency P99≤ 400ms
Safety false-negative rate (sampled audit)≤ 0.1%
Safety false-positive rate≤ 2%
Authoring AI accept ratetracked, no threshold (product KPI)
AI cost per MAU≤ budget per tier
Cache hit rate (prompt+ctx)≥ 35% tutor, ≥ 60% authoring
Provider failover success≥ 99%
Grounding score (RAG paths)P50 ≥ 0.8

7.3 AI Cost Observability

  • Budget IDs attach to every invocation; rollups by tenant_id × purpose × model in ClickHouse.
  • Circuit breakers fire when a tenant/purpose breaches 120% of hourly budget: degrade model (Opus → Sonnet → Haiku), enable aggressive caching, then fail-closed to cached/static responses.
  • Dashboards: ai/cost-overview, ai/per-tenant, ai/per-purpose, ai/model-mix, ai/cache-efficacy.

7.4 AI Provenance & Replay

Each AI output stores {prompt_template_id, prompt_hash, context_hash, model, model_version, params, safety_decisions, citations} — sufficient to replay a response for audit (Doc 13). Replay is exposed via ai-audit admin tool and logged as AUDIT.


8. Offline Telemetry & Player Telemetry

8.1 Offline SDK Responsibilities

The offline runtime (mobile, desktop player, low-bandwidth web cache) runs a local telemetry buffer:

  • SQLite-backed, append-only, size-capped (default 64 MB, configurable).
  • Each batch is MAC-signed with the device binding key (Doc 13) so the server can detect tampering on replay.
  • Buffer encrypts at rest using the device-bound key.
  • Batches are chunked by sync_batch_id, ordered by monotonic sequence number; gaps flagged on server.

8.2 Offline Signals

SignalFields
offline.bundle.activatedbundle_id, course_id, size_bytes, integrity_ok
offline.bundle.integrity_failurebundle_id, expected_hash, actual_hash, reason
offline.sync.started / .completed / .failedbatch_id, items, bytes, duration_ms, conflicts
offline.conflict.detectedentity, strategy, winner, loser_preserved
offline.device.bind / .unbind / .rebind_denieddevice_id_hash, reason
offline.tamper.suspectedsignal, severity, evidence_hash
offline.clock.skewskew_seconds
offline.duration.active_secondsrolling

8.3 Offline SLIs/SLOs

SLITarget
Sync success rate≥ 99% per device/day
Conflict rate (of synced writes)≤ 1%
Bundle tamper detection100% of test cases caught
Device-binding mismatch blocked100%
Mean time from reconnection → sync complete≤ 60s for ≤ 10MB

8.4 Player Telemetry

EventKey Attributes
player.session.startedsession_id, course_id, entry_point
player.navigationfrom_lesson_id, to_lesson_id, method (next, TOC, deeplink, tutor)
player.lesson.viewedlesson_id, media_type, duration_ms, completion_ratio
player.media.bufferedstall_count, stall_total_ms
player.quiz.started / .answered / .submittedattempt_id, question_id, time_on_question_ms, changed_answer_count
player.tutor.opened / .prompt / .responseturn_number, ai.purpose=tutor.*, escalated_to_human
player.accessibility.usedfeature (captions, TTS, dyslexia_mode, hi_contrast)
player.integrity.flagflag, confidence (proctoring)
player.session.endedreason, duration_ms, progress_delta

These feed both SRE dashboards and analytics-service for learning analytics (engagement, mastery).

Quiz/assessment integrity telemetry is captured at 100% sampling and mirrored to audit-service whenever integrity flags fire.


9. Authoring Telemetry

The authoring tool is where AI leverage is highest; telemetry closes the feedback loop to improve prompts and UX.

EventAttributesWhy
authoring.session.startedcourse_id, editor_versionEngagement
authoring.block.insertedblock_type, source (manual, ai, template, import)Block-mix
authoring.ai.suggestion.shownsurface, purpose, suggestion_idAI UX
authoring.ai.suggestion.acceptedsuggestion_id, accept_mode (full, partial, edited)Accept rate
authoring.ai.suggestion.rejectedsuggestion_id, reason_categoryLearning signal
authoring.ai.regeneratecount, deltasFriction
authoring.autosavelatency_ms, bytes, conflictReliability
authoring.conflict.detectedentity, resolution_strategy, lost_bytesCollab quality
authoring.review.cyclestate_from, state_to, duration_s, reviewer_roleWorkflow
authoring.publishcourse_id, validator_errors, warnings, a11y_scoreRelease quality
authoring.accessibility.scorescore, violations[]Quality gate

9.1 Authoring SLIs

SLITarget
Autosave success≥ 99.95%
Autosave latency P95≤ 250 ms
Merge conflict rate≤ 0.5% of autosaves
AI suggestion TTFB P95≤ 900 ms
AI accept rate (rolling 30d)tracked; alert on ±25% WoW shift
Publish validator pass rate≥ 99%

10. Marketplace Telemetry

Marketplace telemetry joins SRE, product analytics, and finance.

EventAttributes
market.listing.viewedlisting_id, source
market.checkout.startedcart_hash, price_cents, currency, coupon
market.checkout.payment_attemptprovider, method, 3ds
market.checkout.succeeded / .failedreason
market.refund.requested / .approved / .deniedreason_category
market.payout.scheduled / .completedcreator_id_hash, amount_cents
market.fraud.flagsignals[], decision
market.dispute.opened / .resolvedprovider_case_id_hash
market.license.activated / .exhaustedlisting_id, seats

10.1 Marketplace SLIs

SLITarget
Checkout success rate≥ 98% (excluding legit declines)
Payment provider failover success≥ 99%
Refund SLA (request → decision)P95 ≤ 72h
Chargeback rate≤ 0.5%
Fraud false positive≤ 2%
Payout on-time rate≥ 99.5%

Revenue metrics (market_gmv_cents_total, market_refund_cents_total) are reconciled daily against the ledger in billing-service. Divergence > 0.1% pages finance oncall.


11. Data Retention, Residency & Privacy

11.1 Retention Matrix

SignalHotWarmColdMax
App logs (non-PII)14d Loki90d S3395d S3 Glacier395d
Audit logs30d hot7y WORM S37y
Metrics30d Prom13mo Mimir13mo
Traces (sampled)7d Tempo90d S390d
Traces (errors, AI, payment)30d395d395d
AI transcripts (learner)30d180d per tenant config≤ 365d
AI transcripts (safety-flagged)180d2y audit2y
Offline device telemetry14d post-sync90d90d
Learner event stream (analytics)90d2y aggregatedK–12 tenants: raw ≤ 13 mo
Marketplace financial events7yregulatory

11.2 Residency

Tenants select a home region (af-jnb-1, eu-fra-1, us-iad-1, me-bah-1). Logs, metrics, traces, and transcripts are pinned to that region. Cross-region replication is off by default; enabled only for enterprise DR with DPA addendum.

11.3 Minor/K–12 Rules

  • No actor email/phone anywhere in telemetry, even hashed with global salt (use tenant salt).
  • AI transcript retention ≤ 30d unless safety-flagged.
  • No behavioral profiling data exported to third parties.
  • DSAR deletion propagates to Loki, Mimir, Tempo, ClickHouse, and AI transcripts within 30 days; deletion receipts logged to audit-service.

11.4 Access Control

  • Grafana folders per tenant + per function (SRE, Product, Finance, Safety).
  • Raw logs accessible only to SRE+Security; product engineers see sanitized views.
  • Break-glass access is MFA-gated, time-boxed (≤ 4h), auto-audited.

12. Dashboards

Dashboards are stored as JSON in grafana/ and provisioned via CI. Each has an owner and an SLO link.

12.1 Global Dashboards

  1. Platform Overview — availability, latency, saturation across all services.
  2. Error Budget Burn — per service, per SLO, 1h/6h/24h burn rates.
  3. Tenant Health — per-tenant error rate, latency, AI spend, offline sync.
  4. Release Radar — correlates deploys → error rate / latency deltas.
  5. Cost Control — infra + AI + egress.

12.2 Per-Capability Dashboards

  • Identity & Access — login success, MFA mix, session revocations, impersonations.
  • Authoring — autosave, AI accept rate, conflict rate, publish validator, review cycle time.
  • Player — session starts, lesson completion funnel, quiz funnel, tutor usage, a11y usage.
  • Assessment — attempts, grading latency, regrades, integrity flags.
  • AI / Tutor — TTFB, cache hit, safety actions, grounding, cost.
  • Offline — sync success, conflict %, tamper flags, bundle integrity.
  • Marketplace — checkout funnel, refunds, fraud, payout SLA, GMV vs ledger.
  • Billing — invoice status, dunning, MRR proxy.
  • Safety & Moderation — blocks, overrides, appeals, SLA to decision.
  • Data Platform — Kafka lag, DLQ depth, consumer group health.

12.3 Per-Service Dashboards (template)

Each service auto-gets a dashboard with: RED panels, USE panels, top errors, top slow endpoints, top consumers, DB pool, Kafka lag (if applicable), dependency map.


13. Alerts, SLIs/SLOs, Error Budgets

13.1 SLO Framework

  • All SLOs defined in slo/*.yaml (Sloth format), reviewed via PR.
  • Multiwindow, multi-burn-rate alerts (Google SRE): 1h+5m (fast burn), 6h+30m (slow burn).
  • 28-day rolling windows; error-budget policy enforced automatically.

13.2 Alert Contract

Every alert declares:

alert: PlayerTutorTTFBHigh
expr: ...
for: 5m
severity: SEV-2
owner: ai-platform
runbook: https://runbooks.ghasi/obs/ai-tutor-ttfb
auto_remediation: ai.failover.downgrade_model
dashboards: [ai/tutor, player/engagement]
slos: [player.tutor.ttfb]

Alerts without runbook + owner are rejected in CI.

13.3 Severity Ladder

SeverityDefinitionResponse
SEV-1User-visible outage, data loss risk, safety breach, payment outagePage + bridge + Statuspage within 5 min
SEV-2Capability degraded, SLO fast-burnPage primary oncall
SEV-3Slow-burn SLO, non-blockingTicket + Slack
SEV-4HousekeepingTicket

13.4 Example Alerts (non-exhaustive)

  • IdentityLoginErrorRate (SEV-2): 5xx on /auth/login > 1% for 5m.
  • AssessmentGradeCommitFailure (SEV-1): any ERROR on grade commit path — always-on pager.
  • AIContentModerationBypass (SEV-1): ai.safety.post != block but guardrail.violations non-empty.
  • AITutorTTFBSlowBurn (SEV-3): 6h burn > 1× budget on player.tutor.ttfb.
  • OfflineBundleTamper (SEV-1): any offline.bundle.integrity_failure.
  • OfflineSyncConflictSpike (SEV-2): conflict rate > 2% over 30m per tenant.
  • AICostBudgetBreach (SEV-2): tenant hourly AI spend > 120% budget; auto-downgrade fires first, alert on 2 consecutive windows.
  • MarketplaceGMVReconMismatch (SEV-1): daily GMV vs ledger diff > 0.1%.
  • KafkaConsumerLagHigh (SEV-2): any consumer lag > 50k for 10m.
  • DSARDeletionSLA (SEV-2): any open DSAR > 27 days.
  • AuditWriteFailure (SEV-1): any failed audit write; blocks transactions.
  • AuthoringAutosaveFailure (SEV-2): autosave success < 99.5% for 10m.
  • PlayerA11yViolation (SEV-3): a11y violations on published course > threshold.

13.5 Error Budget Policy

When a service consumes:

  • 50% of monthly budget → notify service owner; feature freeze optional.
  • 75% → feature freeze mandatory; only reliability PRs merge.
  • 100% → rollback latest risky changes; post-incident review required; no launches until budget restored.

Policy enforcement is automated via a GitHub check that reads the SLO service.


14. Incident Response Hooks

14.1 Auto-Declare

Any SEV-1 or two concurrent SEV-2s auto-declares an incident:

  1. incident-bot opens a Slack channel #inc-YYYYMMDD-NN.
  2. Creates PagerDuty incident, pages on-call per owner.
  3. Posts runbook, current burn rate, recent deploys, related alerts.
  4. Opens a bridge (Zoom/Meet) with auto-invite.
  5. Updates Statuspage with tenant-scoped visibility.
  6. Starts a timeline log (every human comment + alert transition captured).

14.2 Automated Remediation Hooks

TriggerAction
AI provider error rate > 5%Failover to secondary provider + model downgrade
Tutor TTFB P95 breachReduce streaming concurrency per tenant
Kafka DLQ depth spikePause producer, alert, enable dead-letter drain tool
Offline tamper spike from a tenantAuto-revoke affected device bindings, require re-enroll
Payment provider outageRoute new checkouts to secondary
Audit write failureTrip global "no-write" breaker on affected capability
Cost breachModel downgrade, cache-only mode, then 503 with graceful copy

All auto-remediation actions are logged as AUDIT and reversible via single command.

14.3 Postmortems

  • Blameless template generated from incident timeline + telemetry (pm-bot).
  • Required within 5 business days for SEV-1/2.
  • Action items tracked with SLA; overdue AIs appear on the error-budget dashboard.

15. Integration with analytics-service

analytics-service (Doc 03) is the learning & product analytics sink — not an SRE tool, but it shares the telemetry substrate.

15.1 Dual Emission

Domain events (player.*, authoring.*, market.*) are:

  1. Logged (for SRE forensics).
  2. Emitted as Kafka domain events (source of truth, Doc 04).
  3. Projected into ClickHouse by analytics-service consumers.

Metrics are not duplicated into analytics; analytics re-aggregates from events.

15.2 Contract Between Telemetry Library and Analytics

  • Events declared once in events/*.proto with @analytics and @telemetry annotations.
  • Codegen produces: (a) producer client, (b) log emitter, (c) ClickHouse DDL, (d) dbt models, (e) Grafana/Looker source docs.
  • Adding an event without updating the schema CI blocks the PR.

15.3 PII Boundary

Analytics warehouse receives tenant-salted hashed IDs. Reverse identification requires a signed, audited join against identity-service — gated, rate-limited, logged.


16. Frontend & Mobile Client Telemetry

  • Web: OTel Web SDK + custom @ghasi/telemetry-web. CLS/LCP/INP collected. Session replay disabled by default (privacy); enabled per-tenant with explicit consent, never for learners under 18.
  • Mobile: OTel Android/iOS + Dart. Offline buffer as in §8.
  • Crash reporting: native platform crash collectors, symbolicated, linked to trace_id via breadcrumbs. No PII in crash payloads.
  • Network errors and retry storms are explicit metrics (client_retry_storm_total).

Consent flags (consent.telemetry, consent.session_replay) are read on every emission; denial drops events at the client before transport.


17. Security & Threat Telemetry

  • authn_bruteforce_detected_total, authn_credential_stuffing_detected_total, abuse_rate_limit_triggered_total.
  • WAF events forwarded with rule_id, action.
  • Secret scanner alerts (source code, logs) mapped to SEV-2.
  • Anomaly detection: per-tenant usage baselines in ClickHouse; deviations > 4σ raise SecurityAnomaly alerts.
  • Supply chain: SBOM diffs per deploy, signed; rollback triggers if vulnerability severity ≥ HIGH.

18. Implementation Checklist (per service)

A service is not production-ready until:

  • Uses @ghasi/telemetry wrapper (no raw OTel).
  • Emits correlation context per §3.
  • Passes PII redaction CI test (fixtures of bad logs).
  • Declares RED SLIs; defines at least 1 domain SLI.
  • Has a per-service dashboard provisioned.
  • Has at least one SEV-2 alert with runbook.
  • Declares Kafka consumer lag alert (if a consumer).
  • Audit events (if in scope) use synchronous audit write.
  • Offline surfaces (if any) use buffered, MAC-signed telemetry.
  • AI surfaces (if any) emit §7.1 dimensions and pass safety-replay test.
  • Domain events registered in events/ with both telemetry + analytics projections.
  • Documented retention & residency overrides if non-default.

19. Open Questions & Roadmap

  1. On-device anomaly detection for offline tamper — currently server-side only; prototype lightweight ML on-device in H2.
  2. Learner-visible AI transparency card — surfaces ai.model, citations, grounding_score to learners ≥ age 13. Needs legal review.
  3. Per-course cost attribution — currently tenant+purpose; extend to course+creator for marketplace unit economics.
  4. Formal privacy budgets (differential privacy) for analytics exports — exploratory.
  5. eBPF-based USE telemetry for node-local kernel metrics — under PoC.
  6. Synthetic journeys for K–12 low-bandwidth scenarios — planned Q3.

20. Change Management

  • This document is versioned; material changes require an RFC in rfcs/observability/.
  • Breaking changes to the log schema require log_schema_version bump and a deprecation window of 2 minor releases.
  • Alert and SLO changes are PR-reviewed by SRE + capability owner + (for safety/AI) Trust & Safety.

End of Document 15 — Observability & Telemetry Specification.