Skip to main content

Testing Strategy

:::info Source Sourced from docs/16-testing-strategy-qa.md in the documentation repo. :::

Document status: Canonical specification — v1.0 Owner: Platform QA Guild & Chief Architect Related specs: 01-enterprise-architecture, 02-ddd-bounded-contexts, 03-microservices, 04-event-driven-architecture, 05-api-design, 06-traceability-matrix, 10-authoring-tool-spec, 11-lms-runtime-player-spec, 12-data-models, 13-security-compliance-tenancy, 14-risks-and-tradeoffs


Table of Contents

  1. Purpose, Scope & Quality Philosophy
  2. Testing Principles & Non-Negotiables
  3. The Ghasi Test Pyramid (Extended)
  4. Unit Testing — Domain Layer
  5. Integration Testing — Service Layer
  6. Contract Testing — APIs & Events
  7. End-to-End Testing — User Journeys
  8. Offline Testing
  9. AI Testing (Prompt, Safety, Hallucination, Structured Output)
  10. Load, Performance & Scalability Testing
  11. Accessibility Testing — WCAG 2.2 AA
  12. Security Testing — SAST, DAST, Pen-Test, AuthZ Matrix
  13. Chaos & Resilience Testing
  14. Replay Testing — Event Log Rebuild
  15. Multi-Device Sync Testing
  16. Testing Environments & Data Management
  17. CI/CD Quality Gates
  18. Coverage Expectations per Service
  19. Tooling Matrix
  20. Governance, Ownership & RACI
  21. Appendix A — Test IDs & Traceability
  22. Appendix B — Flake Policy & Quarantine

1. Purpose, Scope & Quality Philosophy

Ghasi-edTech is a multi-tenant, AI-first, offline-first, event-driven education platform serving learners, instructors, authors, and administrators across heterogeneous devices (mobile, tablet, desktop, low-end Android, kiosk). Quality assurance on a system of this complexity cannot be reduced to "does it build and do the tests pass." It must be a layered, continuously executed evidence system that proves:

  • Correctness — The domain behaves as specified in 02-ddd-bounded-contexts.
  • Safety — AI outputs cannot harm learners (harassment, hallucinated facts, PII leakage, curriculum drift).
  • Resilience — The system degrades gracefully under failure, network loss, or adversarial load.
  • Portability of state — A learner's progress, attempts, and artifacts survive device loss, network partitions, and bundle corruption.
  • Isolation — Tenants cannot observe, influence, or exfiltrate each other's data (per 13-security-compliance-tenancy).
  • Accessibility — No learner is excluded by disability, device class, or connectivity.
  • Reproducibility — Any historical state can be rebuilt from the event log (04-event-driven-architecture).

1.1 Scope

This document governs testing for:

  • All 18 microservices enumerated in 03-microservices.
  • All frontends: Learner Web, Learner Mobile (iOS/Android), Instructor Console, Authoring Tool, Admin Portal, Public Marketing Site.
  • All AI pipelines: tutor, content-generation, rubric-grading, summarization, translation, accessibility-captioning, adaptive-path.
  • All event producers, consumers, and projections.
  • The offline runtime (11-lms-runtime-player-spec).
  • All infrastructure-as-code (Terraform, Helm, Crossplane).

1.2 Out of Scope

  • Manufacturer-level hardware testing of tablets distributed to schools (vendor-owned).
  • Legal/regulatory sign-off of content (editorial, not QA).
  • Third-party vendor SaaS internals (Stripe, Clerk, OpenAI, Anthropic) — we test our contracts with them, not their systems.

1.3 Quality Philosophy

"Every production incident is a missing test. Every missing test is a missed risk conversation."

We adopt four guiding stances:

  1. Tests as executable specifications. Domain tests read like the ubiquitous language.
  2. Shift-left, shift-right equally. Pre-merge gates catch regressions; production observability catches emergent behavior; both feed the backlog.
  3. AI is a first-class citizen of the test strategy. Non-determinism does not excuse non-testing — it raises the bar.
  4. Offline is not a feature; it is a tier-0 invariant. Any test plan that only works online is incomplete.

2. Testing Principles & Non-Negotiables

2.1 Principles

#PrincipleImplication
P1Tests are first-class codeReviewed, refactored, owned by the producing team; never "thrown over the wall" to QA.
P2Deterministic by defaultFlakes are bugs; see Appendix B.
P3One assertion concept per testMultiple expect lines are fine; multiple concepts are not.
P4Arrange-Act-Assert, alwaysReadability > brevity.
P5Test behavior, not implementationPrivate methods are never tested directly.
P6Real dependencies where cheapTestcontainers for Postgres, Redis, Kafka, S3 (MinIO). Mocks only at hard boundaries (payments, external LLMs).
P7Data builders over fixturesaLearner().enrolledIn(course).withAttempts(3).build() beats JSON blobs.
P8Tests must fail loudlyNo silent try/catch; no if (err) return.
P9Coverage is a floor, not a ceiling80% line coverage is the minimum; 95% branch coverage for domain aggregates.
P10Every bug fix ships with a regression testEnforced via PR template checkbox + CI grep.

2.2 Non-Negotiables (merge-blocking)

  • No PR merges without green CI across all stages listed in §17.
  • No production deploy without a successful canary and a rollback plan verified in staging within the last 7 days.
  • No new AI prompt in production without a prompt regression suite (§9.2).
  • No new event schema without a consumer contract test (§6.3).
  • No new public API endpoint without a Pact contract and an OpenAPI diff review.
  • No schema migration without a forward+backward compatibility test and a replay test (§14).

3. The Ghasi Test Pyramid (Extended)

The classic pyramid (unit → integration → E2E) is insufficient. Ghasi uses an extended pyramid with orthogonal axes:

┌──────────────────┐
│ Chaos / Replay │ (continuous, production-like)
├──────────────────┤
│ E2E Journeys │ (~200 tests, <15 min)
├──────────────────┤
│ Contract Tests │ (per API, per event; ~500 tests)
├──────────────────┤
│ Integration │ (Testcontainers; ~2,000 tests)
├──────────────────┤
│ Unit │ (domain + utils; ~15,000 tests)
└──────────────────┘

Orthogonal: Accessibility │ Security │ AI-Eval │ Offline │ Load │ Sync

Each orthogonal axis runs against multiple tiers. For example, accessibility applies to unit (component render), E2E (axe scan), and manual (assistive-tech spot checks).

3.1 Rough Volume Targets

TierCount targetWall-clock budgetFrequency
Unit12k–18k< 4 min parallelizedEvery commit
Integration1.5k–2.5k< 12 minEvery PR
Contract400–700< 3 minEvery PR
E2E150–250 critical paths< 15 minEvery PR (smoke) / nightly (full)
Load30+ scenariosNightly + pre-releaseNightly
Chaos20+ experimentsWeekly (staging), monthly (prod)Weekly
AI-eval1k+ prompts × modelsOn prompt change + nightlyContinuous

4. Unit Testing — Domain Layer

4.1 What Counts as a Domain Unit Test

A domain unit test exercises one aggregate, entity, value object, or domain service in isolation, with zero I/O, zero time-dependence unless injected, zero randomness unless seeded.

4.2 Coverage Targets

LayerLineBranchMutation (Stryker)
Aggregates (Course, Enrollment, AttemptSession, LearnerProgress, PaymentOrder)95%95%≥ 75%
Value objects (Score, TenantId, ContentRef, OfflineBundleHash)100%100%≥ 85%
Domain services90%90%≥ 70%
Policy/specification classes95%95%≥ 80%

4.3 Patterns & Conventions

  • Given/When/Then naming: given_expired_enrollment_when_starting_attempt_then_throws_EnrollmentExpired.
  • Builders in __builders__/ colocated with the aggregate.
  • No ORM, no DB, no network, no filesystem. Tests import the pure TS/Go/Python domain module only.
  • Time is injected via a Clock port; tests use a FakeClock at a fixed ISO-8601 instant.
  • UUIDs are injected via IdFactory; tests use a deterministic SeededIdFactory.

4.4 Example Coverage — AttemptSession Aggregate

Invariants that MUST have dedicated tests (excerpt):

  • Cannot start a new attempt if a previous attempt is in IN_PROGRESS and lock window has not expired.
  • Cannot submit an answer after timeLimit has elapsed (using injected clock).
  • score is recomputed only on transition to SUBMITTED.
  • Re-submission of the same question version is idempotent.
  • Offline-replay ordering (see §14) produces identical final state regardless of event arrival order within the same causal group.

4.5 Mutation Testing

  • Tool: Stryker (TS), go-mutesting (Go), mutmut (Python).
  • Run frequency: Nightly on main; on-demand in PRs touching domain/**.
  • Budget: Each domain module ≤ 15 minutes; otherwise sharded.
  • Gate: Mutation score drop > 5 percentage points blocks merge.

5. Integration Testing — Service Layer

5.1 Definition

Integration tests exercise one service's real code against real infrastructure dependencies (Postgres, Redis, Kafka/NATS, S3, OpenSearch) running in Testcontainers. External SaaS (Stripe, Clerk, LLM providers) are replaced by recorded contract fakes (see §6).

5.2 What They Prove

  • SQL queries match the schema and indexes (regression against migration drift).
  • Transactional boundaries hold (including SAGA compensations for cross-service flows — §5.5).
  • Event publication via the transactional outbox is correct and exactly-once from the consumer's perspective.
  • Caching layers (Redis) are invalidated on writes.
  • Row-Level-Security policies enforce tenant isolation (see 13-security-compliance-tenancy.md §4).

5.3 Environment

  • Testcontainers spin up: Postgres 16, Redis 7, Kafka (KRaft mode), MinIO, OpenSearch, Mailpit.
  • Containers are per-test-class (shared) with per-test transactional rollback for SQL, and topic-prefix isolation for Kafka.
  • Seeds applied once per suite via Flyway/Liquibase migrations + deterministic fixture loaders.

5.4 Tenant Isolation Tests (mandatory, every service)

Every service that touches tenant data MUST ship a test file tenant_isolation.test.ts containing:

  1. Two tenants seeded with identical entity IDs (to catch ID-leak bugs).
  2. Queries executed as Tenant A MUST NOT return Tenant B rows under any filter.
  3. Direct row ID access (bypassing the service filter) MUST be blocked by RLS.
  4. Bulk/admin endpoints MUST require X-Tenant-Scope: global and elevated role.

5.5 SAGA and Outbox Tests

For each cross-service SAGA (e.g., Enroll → Payment → Provisioning → Notification):

  • Happy path: all steps succeed, final events produced in order.
  • Compensating path: inject failure at step N, assert reversal events and idempotent cleanup.
  • Outbox poll crash: kill the outbox relay mid-flush; on restart, no duplicate publishes (dedupe via event_id).

6. Contract Testing — APIs & Events

Contracts are the load-bearing beams of a distributed system. We use consumer-driven contracts for both HTTP APIs and events.

6.1 HTTP Contract Testing (Pact)

  • Consumer side: Each frontend and each downstream service publishes a Pact file to the broker on every PR.
  • Provider side: Services verify against all current consumer contracts in CI; failure blocks merge.
  • OpenAPI alignment: A generator validates that each Pact interaction is covered by the OpenAPI spec, and vice-versa — no undocumented endpoints, no unused OpenAPI paths.
  • Breaking change policy: Removing or narrowing a field requires a versioned endpoint (/v2/...) and a 90-day deprecation window.

6.2 GraphQL Contract Testing

  • Federation subgraphs run rover subgraph check against the published supergraph schema.
  • Consumer queries are extracted from frontend builds and replayed as persisted-query tests.

6.3 Event Contract Testing

Events are versioned via Avro schemas in the Schema Registry (Confluent-compatible). Each event topic has:

  • schema-compatibility: BACKWARD_TRANSITIVE (consumers can read any historical version).
  • A producer contract test that publishes a golden sample per version and asserts schema-registry acceptance.
  • A consumer contract test per downstream service that replays golden samples and asserts projection correctness.

6.4 Event Catalog Tests

The event catalog (see 04-event-driven-architecture.md §6) is machine-readable YAML. CI asserts:

  • Every event produced in code appears in the catalog.
  • Every catalog entry has ≥1 producer test and ≥1 consumer test.
  • Every event has a documented retention, PII classification, and replay eligibility flag.

7. End-to-End Testing — User Journeys

7.1 Canonical Journeys (must always be green on main)

IDJourneyPrimary actorTooling
J-01Learner signs up → enrolls in free course → completes first lessonLearnerPlaywright
J-02Learner purchases paid course → receives receipt → starts courseLearnerPlaywright + Stripe test clock
J-03Learner downloads bundle → goes airplane mode → completes quiz → syncsLearnerPlaywright + device emulation
J-04Instructor creates assignment → grades with AI-assist → publishes feedbackInstructorPlaywright
J-05Author builds course (drag-drop, AI-generate, preview) → publishes v1 → edits v2AuthorPlaywright
J-06Admin provisions tenant → configures SSO → invites cohortAdminPlaywright
J-07Learner asks AI tutor a question → receives cited answer → flags hallucinationLearnerPlaywright + AI-eval harness
J-08Cohort teacher runs live session → projects results → exports gradebookInstructorPlaywright
J-09Learner on low-end Android completes lesson offline for 7 days → reconnectsLearnerPlaywright + Android emulator
J-10Tenant requests data export (GDPR Art. 20) → receives archive → confirms erasureAdmin/DPOPlaywright + archive validator

7.2 E2E Principles

  • Run against deployed staging, not local docker-compose, for PR smoke.
  • Seeded tenants are prefixed e2e-<runId>- and cleaned within 1 hour by a janitor job.
  • No shared state between tests; each test creates and destroys its own learner/cohort.
  • Visual regression via Playwright snapshots at 320, 768, 1024, 1440 for all learner-facing pages.
  • Trace + video + screenshot always captured, uploaded to artifact store, linked from failure report.

7.3 Performance Budget (inside E2E)

  • Lighthouse assertions inside Playwright tests for key landing, dashboard, and player pages. See §10 for thresholds.

8. Offline Testing

Offline is a tier-0 invariant. A single offline regression is a P0 incident.

8.1 Scenarios

IDScenarioExpected behavior
O-01Airplane mode mid-lessonPlayer continues; progress queued; UI shows offline badge.
O-027-day offline streakAll queued events persisted; no data loss on sync.
O-03Bundle tamper (hash mismatch)Player refuses to load; user prompted to re-download; incident logged.
O-04Clock skew (device time in the past)Events accepted server-side via logical clocks; no silent re-ordering.
O-05Conflict: same attempt edited on two devicesDeterministic resolution per CRDT/LWW rules in 12-data-models.md §8.
O-06Storage full on deviceGraceful degradation: evict LRU bundles; warn user; never lose unsynced events.
O-07Background sync killed by OSResume on next app open; no duplicate submissions.
O-08Partial bundle downloadResumable via range requests; hash verified before activation.
O-09Expired offline licensePlayer blocks new attempts; allows sync of existing queued events.
O-10Cross-device restore from account recoveryLearner sees same progress on new device after sync.

8.2 Tooling

  • Playwright with context.setOffline(true) for web.
  • Detox + Android emulator airplane-mode toggle for mobile.
  • Chaos Monkey for networks: toxiproxy to simulate latency, packet loss, bandwidth caps (2G, 3G, flaky Wi-Fi).
  • Bundle fuzzer: bit-flip + truncation + metadata tampering harness that must always be rejected.

8.3 Sync Conflict Test Matrix

Every mergeable entity (progress, notes, bookmarks, assignment drafts) has a matrix:

Device A change type
│ add │ update │ delete │
Device B ────┼─────┼────────┼────────┤
add │ M1 │ M2 │ M3 │
update │ M2 │ M4 │ M5 │
delete │ M3 │ M5 │ M6 │

Each cell (M1–M6) has at least 2 tests: one where A wins by timestamp, one where B wins. Expected resolutions are documented in 12-data-models.md §8.4.


9. AI Testing

AI testing is the most novel and the most risky surface. We treat it with the same rigor as security testing.

9.1 Categories

CategoryWhat it catches
Prompt regressionOutput drift across prompt or model version changes.
Safety (harms)Harassment, self-harm encouragement, age-inappropriate content.
HallucinationFabricated citations, invented facts, wrong math.
Structured outputInvalid JSON, missing required fields, wrong types.
PII leakageNames, emails, tokens, embedded training data leaking into output.
Jailbreak resistancePrompt injection from user content (assignment text, PDF OCR, uploaded images).
Cost / latencyToken usage, p95 latency, provider-side error rates.
Bias & fairnessDifferential output quality across demographics, languages, dialects.
Curriculum driftOutput contradicts published curriculum standards.

9.2 Prompt Regression

  • Golden set: ≥ 1,000 prompts per AI feature (tutor, grader, generator, summarizer), curated by subject-matter experts and labeled with expected answer classes.
  • Scoring:
    • Deterministic checks: JSON schema, required citations, forbidden tokens.
    • LLM-as-judge: a separate, cheaper model scores similarity to expected answer along rubric dimensions (accuracy, helpfulness, tone).
    • Human-in-the-loop: 5% random sample reviewed weekly.
  • Gate: A new prompt/model must score ≥ parity on the golden set before replacing the current version. Statistical test: bootstrapped 95% CI on delta.

9.3 Safety Testing

  • Red-team corpus: 5,000+ adversarial prompts covering self-harm, hate, sexual content, violence, drugs, weapons, cheating, and education-specific attacks ("write my essay", "solve my test for me and hide it").
  • Every model deploy: 100% of corpus re-run; zero policy violations permitted. Violations block deploy.
  • Policy coverage: Every safety policy (internal doc SAFETY-001 through SAFETY-037) has ≥ 10 probes.

9.4 Hallucination Detection

  • Citation verifier: Every AI tutor answer with a citation is validated — the cited source must exist in the retrieval index, and the cited passage must contain a semantically similar sentence (bi-encoder similarity > 0.75).
  • Fact checker sub-chain: For math/science answers, a deterministic verifier (e.g., SymPy for algebra) re-checks the AI's final numeric answer.
  • Uncertainty escalation: Answers below a confidence threshold trigger a "not sure, routing to teacher" fallback; tests assert this path fires on known ambiguous prompts.

9.5 Structured Output Validation

  • All AI outputs consumed programmatically MUST declare a JSON schema (Zod/Pydantic/Ajv).
  • Failure test: Inject adversarial model outputs (truncated JSON, trailing comments, wrong types); system MUST retry-with-repair then fall back to a safe default. Silent coercion is forbidden.

9.6 PII Leakage

  • Canary data: Seeded synthetic PII strings (canary-pii-XXXX) in training-adjacent logs and retrieval sources.
  • Test: Daily probe suite asks the model innocuous questions; if any canary surfaces in output, incident P1.

9.7 Jailbreak / Prompt Injection

  • Corpus of 2,000+ injection attempts embedded in user-uploaded content (PDFs, images via OCR, assignment text).
  • Assertion: System prompt and tool-access invariants hold; assistant never reveals system prompt, never exfiltrates other users' data, never calls destructive tools.

9.8 Eval Infrastructure

  • Harness: Internal ghasi-evals framework built on Inspect / promptfoo conventions.
  • Storage: Eval runs versioned in object storage with prompt hash, model ID, parameters, and outputs. Queryable in ClickHouse.
  • Dashboards: Grafana — per-feature quality score over time, regression alerts.

10. Load, Performance & Scalability Testing

10.1 Tooling

  • k6 for HTTP and gRPC (primary).
  • Locust for complex stateful flows (scripted learners over long sessions).
  • Gatling for a secondary, JVM-native cross-check on Payments.

10.2 Per-Service Load Scenarios

ServicePeak targetSLO (p95)Scenario
API Gateway20k rps80 msMixed auth + unauth.
Identity2k rps sign-in200 msPassword + SSO + MFA.
Enrollment1k rps150 msFirst-day-of-term burst × 10.
Progress10k events/s300 ms (end-to-end to projection)Cohort of 50k learners simultaneously answering.
AI Tutor500 concurrent sessions2.5 s first tokenSustained + spike.
Authoring200 rps write400 msBulk import of 10k questions.
Payments300 rps500 msBlack-Friday-class burst.
Content Delivery (CDN origin)5k rps100 msBundle downloads, warm + cold cache.
Notifications5k msg/sasyncFan-out to cohort of 100k.

10.3 System-Wide Scenarios

  • First-Day-of-Term: 500k learners log in within 1 hour, each enrolls in 3 courses, each starts 1 lesson. Success: no 5xx > 0.1%, p95 < SLO.
  • Global Event Storm: 1M progress events/minute for 10 minutes. Success: no consumer lag > 30 s; no data loss.
  • Graceful Degradation: Kill the AI provider; assert tutor fails open with clear UX; no cascading failures.

10.4 Frontend Performance Budgets

Per [web/performance.md], enforced in CI via Lighthouse CI:

PageLCPINPCLSJS (gzip)
Marketing landing2.0 s150 ms0.05120 kb
Learner dashboard2.5 s200 ms0.10280 kb
Player (lesson)2.0 s150 ms0.05220 kb
Authoring canvas3.0 s250 ms0.10600 kb (tooling-heavy, justified)

11. Accessibility Testing — WCAG 2.2 AA

11.1 Commitment

Ghasi targets WCAG 2.2 Level AA across all learner- and instructor-facing surfaces, with Level AAA aspiration for the Player (lesson runtime).

11.2 Layered Approach

LayerToolCoverage
Staticeslint-plugin-jsx-a11yEvery PR
Component@storybook/addon-a11y + axe-coreEvery story, every variant
E2Eaxe-playwright scan per E2E journeyEvery PR
KeyboardScripted tab-order + focus-trap testsEvery interactive component
Screen readerManual NVDA/VoiceOver/TalkBack auditsPer release
Low-visionContrast, zoom 200%, reflow at 320 CSS pxPer release
CognitiveReading-level + plain-language reviewPer content release (editorial)
MotorTarget size ≥ 24×24 CSS px (2.5.8), no hover-onlyLinter + manual

11.3 Required Assertions

  • All interactive elements reachable by keyboard.
  • Focus ring visible and meets 3:1 contrast against adjacent colors.
  • prefers-reduced-motion honored (all motion in §10-adjacent UI).
  • Captions on all instructional video; transcripts downloadable.
  • Math rendered via MathML + ARIA label fallback (not image-only).
  • Error messages programmatically associated with fields (aria-describedby).
  • Language of page and parts declared (lang attribute per passage for multilingual content).

11.4 Gate

Axe violations at serious or critical severity block merge. moderate requires a triage ticket within 48 h.


12. Security Testing

12.1 SAST (Static)

  • Tools: Semgrep (custom ruleset + OWASP Top 10), CodeQL, gitleaks, trivy-config.
  • Gate: Any high or critical finding blocks merge. medium opens a ticket.
  • Custom rules: No dangerouslySetInnerHTML; no raw SQL concatenation; no eval; no child_process.exec with user input; no import of deprecated crypto.

12.2 SCA (Dependencies)

  • Tools: Dependabot + Snyk + OSV-Scanner.
  • Policy: Criticals patched within 24 h; highs within 7 d; mediums within 30 d.
  • Supply-chain: SBOM (CycloneDX) produced per build; artifact signing via Sigstore/cosign; provenance via SLSA level 3.

12.3 DAST (Dynamic)

  • Tools: OWASP ZAP (authenticated scans), Burp Suite Enterprise.
  • Cadence: Nightly against staging; pre-release against pre-prod.
  • Coverage: Every public endpoint + representative authenticated endpoints per role.

12.4 Pen-Testing

  • External firm engaged quarterly with rotating scope: external perimeter, internal tenant isolation, mobile app, AI surfaces (prompt injection, model extraction).
  • Internal red team runs a monthly exercise mirroring a realistic attacker TTP chain; blue team must detect and respond.
  • Bug bounty: Public program (scoped) with clear SLA for response.

12.5 AuthZ Matrix Testing

Every role × resource × action combination is expressed as a policy decision test against the OPA/Cedar policy engine:

roles: [anonymous, learner, instructor, author, tenant-admin, super-admin, dpo, support]
resources:[Course, Lesson, Attempt, Grade, Cohort, Tenant, BillingAccount, UserPII, AuditLog, AIPrompt]
actions: [read, list, create, update, delete, export, impersonate, configure]

Matrix size ≈ 8 × 10 × 8 = 640 cells. Each cell declares allow or deny; tests assert both directions. A newly added role or resource cannot ship without filling its column/row.

12.6 Threat-Model-Driven Tests

For every threat in the STRIDE model maintained in 13-security-compliance-tenancy.md, at least one test case exists proving the mitigation works.


13. Chaos & Resilience Testing

13.1 Principles

  • Run chaos in staging continuously, in production monthly with a blast-radius limit.
  • Every chaos experiment has a written hypothesis, a measurable steady state, and a rollback.

13.2 Experiment Catalog (excerpt)

IDExperimentHypothesisBlast radius
C-01Kill one Progress service pod≤ 0.5% increase in 5xx for 30 s1 pod
C-0250% packet loss between Gateway and IdentityRetries absorb; auth p99 < 2 sone AZ pair
C-03Partition Kafka brokerProducers buffer via outbox; no data lossone broker
C-04CPU stress on all AI workersTutor degrades to cached responses; non-AI unaffectedAI pool only
C-05Redis primary failoverSession cache rebuilds within 60 s; no user-visible logoutsstaging only
C-06Postgres read replica lag 30 sRead-your-writes queries route to primary; no stale reads surfaced to learners mid-attemptstaging only
C-07LLM provider 500s for 10 minCircuit breaker opens; fallback model engaged; eval-latency SLO holdsAI pool only
C-08CDN origin offlineBundle downloads route to secondary; no 404s surfacestaging only

13.3 Tooling

  • LitmusChaos or Chaos Mesh in Kubernetes.
  • AWS Fault Injection Simulator for managed services.
  • toxiproxy for deterministic network faults in integration tests.

14. Replay Testing — Event Log Rebuild

Per 04-event-driven-architecture, any projection/read-model is reproducible from the event log. This is a compliance and recovery guarantee, not an optimization.

14.1 Guarantees Tested

  1. Deterministic replay: Rebuilding a projection from event index 0 produces a byte-identical (or schema-equivalent) result to the live projection.
  2. Partial replay: Replaying from any checkpoint reconstructs correct state.
  3. Out-of-order safety: Replaying events in a shuffled-but-causally-valid order produces the same final state (for CRDT/idempotent projections) or rejects (for strict-ordering projections).
  4. Schema migration: Old event schemas are readable by current consumers via registry + upcasters.
  5. PII redaction on replay: Events marked PII:erased (post-GDPR-erasure) MUST replay with tombstoned fields; downstream state must reflect erasure.

14.2 Harness

  • Nightly job: Rebuild every projection from scratch in a sandboxed cluster; diff against live; any drift is a P1.
  • Per-PR job: For changed projections, rebuild from a canned 10k-event fixture and assert equality.

15. Multi-Device Sync Testing

15.1 Why It Warrants Its Own Section

Ghasi learners routinely use 2–3 devices (phone + tablet + school PC). Sync bugs are invisible in single-device test suites.

15.2 Scenarios

  • Simultaneous edit: Same learner edits notes on phone and tablet within 2 s; both devices converge to the same final state within 10 s of reconnection.
  • Staggered edit: Phone edits offline for 3 days; tablet edits online; merge on reconnection respects documented precedence.
  • Device decommission: Learner signs out of phone; queued unsynced events must flush before logout completes (or be migrated to new device via account-bound queue).
  • Clock skew: Phone clock 10 min ahead; sync still ordered correctly via logical clocks (Lamport/HLC).
  • Partial-sync visibility: Tablet shows a progress indicator during large sync; never presents partially-synced state as authoritative.

15.3 Harness

  • Playwright multi-context: Two browser contexts simulate two devices; a third orchestrator drives scenarios.
  • Mobile: Detox drives two Android emulators in parallel under Bazel/Tuist.
  • Assertion library: Custom expectConvergedState(deviceA, deviceB, within: '10s').

16. Testing Environments & Data Management

16.1 Environments

EnvPurposeDataRefresh
localDeveloper loopEphemeral TestcontainersPer-run
ciPer-PR validationEphemeral Testcontainers + synthetic seedPer-run
previewPR review env (optional, per-service)Scrubbed prod snapshotPer-PR
stagingIntegration + chaos + loadScrubbed prod snapshot, refreshed weeklyWeekly
pre-prodRelease candidate validationFull synthetic at prod scalePer-release
prodReal usersRealn/a

16.2 Test Data Principles

  • No real PII in non-prod. Scrubbing pipeline uses deterministic pseudonymization (FPE for IDs, Faker for names/emails, stable per-learner).
  • Synthetic generators for load: ghasi-datagen produces realistic cohorts, courses, and event streams at arbitrary scale.
  • Fixture libraries versioned under /fixtures/<service>/; generated artifacts checked in for reproducibility.
  • Data retention in test envs: 30 days; automatic purge; no backups.

16.3 Secrets in Tests

  • Never real secrets. Each env has its own keyring (AWS Secrets Manager / Vault).
  • Test secrets rotated weekly via automation.

17. CI/CD Quality Gates

17.1 Pipeline Stages

┌─────────────────────────────────────────────────────────────────────┐
│ 1. Lint + Format + Type-check (≤ 90 s) │
│ 2. Unit tests (≤ 4 min) │
│ 3. Mutation (changed files only) (≤ 3 min) │
│ 4. SAST + SCA + secrets scan (≤ 2 min) │
│ 5. Integration tests (Testcontainers) (≤ 12 min) │
│ 6. Contract tests (Pact + Avro) (≤ 3 min) │
│ 7. Build + SBOM + sign (≤ 5 min) │
│ 8. E2E smoke (10 critical journeys) (≤ 8 min) │
│ 9. Accessibility scan (≤ 3 min) │
│10. Lighthouse budgets (≤ 3 min) │
│11. Deploy to preview (optional) │
│12. DAST (nightly on staging) │
│13. Load smoke (nightly on staging) │
│14. AI-eval regression (on AI changes or nightly) │
└─────────────────────────────────────────────────────────────────────┘

17.2 Gate Policy

StageFailure policy
1–7Hard block — no override
8Hard block — no override
9Hard block for serious/critical; ticket for moderate
10Hard block if any budget exceeded by > 10%
12Hard block for high/critical findings; incident for medium
14Hard block if golden-set score regresses > 2% or any safety violation

17.3 Deployment Gates

  • Canary: 5% traffic for 30 min; auto-rollback on error-rate or latency breach vs. baseline.
  • Progressive: 25% → 50% → 100% over 2 h.
  • Feature flags: All risky features dark-launched behind LaunchDarkly flags with ramp plan in the PR description.

17.4 Post-Deploy Verification

  • Synthetic checks (uptime, critical journeys) every minute.
  • SLO burn-rate alerts (2%, 5%, 10% of monthly budget) to on-call.
  • Automatic rollback if error budget burns > 10% in 1 h.

18. Coverage Expectations per Service

ServiceUnitIntegrationContractE2E (journeys)AI-evalLoad SLO
identity90%85%Pact + OIDCJ-01, J-06200 ms p95
tenancy90%85%PactJ-06150 ms
catalog85%80%Pact + GraphQLJ-01, J-02100 ms
enrollment90%85%Pact + eventsJ-01, J-02150 ms
progress95%90%eventsJ-03, J-09300 ms end-to-end
attempts/grading95%90%eventsJ-04grader-eval400 ms
authoring85%80%PactJ-05generator-eval400 ms
ai-tutor80%75%PactJ-07tutor-eval2.5 s TTFT
ai-content-gen80%75%eventsJ-05generator-evalasync
payments95%90%Pact + StripeJ-02500 ms
notifications85%80%eventsJ-04async
content-delivery90%85%PactJ-03100 ms
offline-sync95%90%eventsJ-03, J-09300 ms
analytics80%75%eventsJ-08async
admin85%80%PactJ-06, J-10300 ms
audit90%85%eventsJ-10append-only; no data loss
search85%80%PactJ-01200 ms
localization85%80%PactJ-01translation-eval150 ms

19. Tooling Matrix

ConcernPrimarySecondaryNotes
Unit (TS)VitestJestVitest preferred; shared config in tooling/vitest
Unit (Go)go testtestify
Unit (Python)pytestpytest-xdist for parallel
Mutation (TS)Stryker
Mutation (Go)go-mutesting
Mutation (Python)mutmut
IntegrationTestcontainersdocker-composeTestcontainers first
Contract (HTTP)PactBroker self-hosted
Contract (Events)Schema Registry + custom harnessAvro-first
E2EPlaywrightCypressPlaywright standard
Mobile E2EDetoxMaestro
VisualPlaywright snapshotsChromatic
Accessibilityaxe-core + axe-playwrightLighthouse a11y
Perf (frontend)Lighthouse CIWebPageTest
Loadk6Locust, Gatling
ChaosLitmusChaosChaos Mesh, AWS FIS
Security (SAST)Semgrep + CodeQL
Security (DAST)OWASP ZAPBurp
Security (SCA)Snyk + OSVDependabot
Secretsgitleakstrufflehog
AI evalghasi-evals (on promptfoo + Inspect)
ObservabilityOpenTelemetry + Grafana + ClickHouse
Feature flagsLaunchDarklyOpenFeature SDK
Test dataghasi-datagenFaker

20. Governance, Ownership & RACI

20.1 Roles

  • Service team — owns unit, integration, contract, service-level E2E, and performance SLOs for its service.
  • Platform QA Guild — owns cross-service E2E, load/chaos harnesses, flake policy, and tooling.
  • AI Safety team — owns safety corpus, eval harness, red-team.
  • Security team — owns SAST/DAST/SCA tooling, authz matrix enforcement, pen-test coordination.
  • Accessibility team — owns axe ruleset, assistive-tech audits.
  • SRE — owns chaos experiments in prod, SLO burn alerts, rollback automation.

20.2 RACI (excerpt)

ActivityService teamQA GuildAI SafetySecuritySRE
Write unit + integration testsR/AC
Maintain E2E journey suiteCR/A
Maintain AI-eval golden setCCR/A
Pen-test schedulingCCR/AC
Chaos in prodCCCR/A
Flake triageRA
Release gate override (emergency)CCCCR/A

20.3 Cadence

  • Weekly: QA Guild reviews flakes, coverage drifts, load regressions.
  • Monthly: AI Safety red-team debrief; chaos experiment retro.
  • Quarterly: External pen-test; full-system GameDay.
  • Annually: Strategy refresh (this document).

Appendix A — Test IDs & Traceability

Every test carries a stable ID following:

<service>.<layer>.<capability>.<scenario>

Examples:

  • enrollment.unit.policy.expired-enrollment-blocks-attempt
  • progress.integration.outbox.crash-mid-flush-no-duplicate
  • tutor.ai.safety.self-harm-corpus-v3
  • player.e2e.offline.seven-day-streak

Traceability is maintained in 06-traceability-matrix.md: every epic/story maps to ≥ 1 test ID at each applicable layer.


Appendix B — Flake Policy & Quarantine

  1. A test that fails non-deterministically is quarantined within 24 h (moved to a @quarantined tag, excluded from merge gates).
  2. The producing team has 5 business days to either fix and restore, or delete.
  3. A test may not remain quarantined > 10 business days without a written exception from the QA Guild lead.
  4. Quarantine count per service is a tracked KPI; > 5 active quarantines triggers a focused reliability sprint.
  5. Root-cause categories are tagged (flake:timing, flake:shared-state, flake:network, flake:ai-nondeterminism) to drive harness improvements.

End of document.