Testing Strategy

:::info Source Sourced from docs/16-testing-strategy-qa.md in the documentation repo. :::

Document status: Canonical specification — v1.0 Owner: Platform QA Guild & Chief Architect Related specs: 01-enterprise-architecture, 02-ddd-bounded-contexts, 03-microservices, 04-event-driven-architecture, 05-api-design, 06-traceability-matrix, 10-authoring-tool-spec, 11-lms-runtime-player-spec, 12-data-models, 13-security-compliance-tenancy, 14-risks-and-tradeoffs

Purpose, Scope & Quality Philosophy
Testing Principles & Non-Negotiables
The Ghasi Test Pyramid (Extended)
Unit Testing — Domain Layer
Integration Testing — Service Layer
Contract Testing — APIs & Events
End-to-End Testing — User Journeys
Offline Testing
AI Testing (Prompt, Safety, Hallucination, Structured Output)
Load, Performance & Scalability Testing
Accessibility Testing — WCAG 2.2 AA
Security Testing — SAST, DAST, Pen-Test, AuthZ Matrix
Chaos & Resilience Testing
Replay Testing — Event Log Rebuild
Multi-Device Sync Testing
Testing Environments & Data Management
CI/CD Quality Gates
Coverage Expectations per Service
Tooling Matrix
Governance, Ownership & RACI
Appendix A — Test IDs & Traceability
Appendix B — Flake Policy & Quarantine

1. Purpose, Scope & Quality Philosophy

Ghasi-edTech is a multi-tenant, AI-first, offline-first, event-driven education platform serving learners, instructors, authors, and administrators across heterogeneous devices (mobile, tablet, desktop, low-end Android, kiosk). Quality assurance on a system of this complexity cannot be reduced to "does it build and do the tests pass." It must be a layered, continuously executed evidence system that proves:

Correctness — The domain behaves as specified in 02-ddd-bounded-contexts.
Safety — AI outputs cannot harm learners (harassment, hallucinated facts, PII leakage, curriculum drift).
Resilience — The system degrades gracefully under failure, network loss, or adversarial load.
Portability of state — A learner's progress, attempts, and artifacts survive device loss, network partitions, and bundle corruption.
Isolation — Tenants cannot observe, influence, or exfiltrate each other's data (per 13-security-compliance-tenancy).
Accessibility — No learner is excluded by disability, device class, or connectivity.
Reproducibility — Any historical state can be rebuilt from the event log (04-event-driven-architecture).

1.1 Scope

This document governs testing for:

All 18 microservices enumerated in 03-microservices.
All frontends: Learner Web, Learner Mobile (iOS/Android), Instructor Console, Authoring Tool, Admin Portal, Public Marketing Site.
All AI pipelines: tutor, content-generation, rubric-grading, summarization, translation, accessibility-captioning, adaptive-path.
All event producers, consumers, and projections.
The offline runtime (11-lms-runtime-player-spec).
All infrastructure-as-code (Terraform, Helm, Crossplane).

1.2 Out of Scope

Manufacturer-level hardware testing of tablets distributed to schools (vendor-owned).
Legal/regulatory sign-off of content (editorial, not QA).
Third-party vendor SaaS internals (Stripe, Clerk, OpenAI, Anthropic) — we test our contracts with them, not their systems.

1.3 Quality Philosophy

"Every production incident is a missing test. Every missing test is a missed risk conversation."

We adopt four guiding stances:

Tests as executable specifications. Domain tests read like the ubiquitous language.
Shift-left, shift-right equally. Pre-merge gates catch regressions; production observability catches emergent behavior; both feed the backlog.
AI is a first-class citizen of the test strategy. Non-determinism does not excuse non-testing — it raises the bar.
Offline is not a feature; it is a tier-0 invariant. Any test plan that only works online is incomplete.

2. Testing Principles & Non-Negotiables

2.1 Principles

#	Principle	Implication
P1	Tests are first-class code	Reviewed, refactored, owned by the producing team; never "thrown over the wall" to QA.
P2	Deterministic by default	Flakes are bugs; see Appendix B.
P3	One assertion concept per test	Multiple `expect` lines are fine; multiple concepts are not.
P4	Arrange-Act-Assert, always	Readability > brevity.
P5	Test behavior, not implementation	Private methods are never tested directly.
P6	Real dependencies where cheap	Testcontainers for Postgres, Redis, Kafka, S3 (MinIO). Mocks only at hard boundaries (payments, external LLMs).
P7	Data builders over fixtures	`aLearner().enrolledIn(course).withAttempts(3).build()` beats JSON blobs.
P8	Tests must fail loudly	No silent `try/catch`; no `if (err) return`.
P9	Coverage is a floor, not a ceiling	80% line coverage is the minimum; 95% branch coverage for domain aggregates.
P10	Every bug fix ships with a regression test	Enforced via PR template checkbox + CI grep.

2.2 Non-Negotiables (merge-blocking)

No PR merges without green CI across all stages listed in §17.
No production deploy without a successful canary and a rollback plan verified in staging within the last 7 days.
No new AI prompt in production without a prompt regression suite (§9.2).
No new event schema without a consumer contract test (§6.3).
No new public API endpoint without a Pact contract and an OpenAPI diff review.
No schema migration without a forward+backward compatibility test and a replay test (§14).

3. The Ghasi Test Pyramid (Extended)

The classic pyramid (unit → integration → E2E) is insufficient. Ghasi uses an extended pyramid with orthogonal axes:

                      ┌──────────────────┐
                      │  Chaos / Replay  │   (continuous, production-like)
                      ├──────────────────┤
                      │   E2E Journeys   │   (~200 tests, <15 min)
                      ├──────────────────┤
                      │  Contract Tests  │   (per API, per event; ~500 tests)
                      ├──────────────────┤
                      │   Integration    │   (Testcontainers; ~2,000 tests)
                      ├──────────────────┤
                      │      Unit        │   (domain + utils; ~15,000 tests)
                      └──────────────────┘

  Orthogonal:  Accessibility │ Security │ AI-Eval │ Offline │ Load │ Sync

Each orthogonal axis runs against multiple tiers. For example, accessibility applies to unit (component render), E2E (axe scan), and manual (assistive-tech spot checks).

3.1 Rough Volume Targets

Tier	Count target	Wall-clock budget	Frequency
Unit	12k–18k	< 4 min parallelized	Every commit
Integration	1.5k–2.5k	< 12 min	Every PR
Contract	400–700	< 3 min	Every PR
E2E	150–250 critical paths	< 15 min	Every PR (smoke) / nightly (full)
Load	30+ scenarios	Nightly + pre-release	Nightly
Chaos	20+ experiments	Weekly (staging), monthly (prod)	Weekly
AI-eval	1k+ prompts × models	On prompt change + nightly	Continuous

4. Unit Testing — Domain Layer

4.1 What Counts as a Domain Unit Test

A domain unit test exercises one aggregate, entity, value object, or domain service in isolation, with zero I/O, zero time-dependence unless injected, zero randomness unless seeded.

4.2 Coverage Targets

Layer	Line	Branch	Mutation (Stryker)
Aggregates (`Course`, `Enrollment`, `AttemptSession`, `LearnerProgress`, `PaymentOrder`)	95%	95%	≥ 75%
Value objects (`Score`, `TenantId`, `ContentRef`, `OfflineBundleHash`)	100%	100%	≥ 85%
Domain services	90%	90%	≥ 70%
Policy/specification classes	95%	95%	≥ 80%

4.3 Patterns & Conventions

Given/When/Then naming: given_expired_enrollment_when_starting_attempt_then_throws_EnrollmentExpired.
Builders in __builders__/ colocated with the aggregate.
No ORM, no DB, no network, no filesystem. Tests import the pure TS/Go/Python domain module only.
Time is injected via a Clock port; tests use a FakeClock at a fixed ISO-8601 instant.
UUIDs are injected via IdFactory; tests use a deterministic SeededIdFactory.

4.4 Example Coverage — `AttemptSession` Aggregate

Invariants that MUST have dedicated tests (excerpt):

Cannot start a new attempt if a previous attempt is in IN_PROGRESS and lock window has not expired.
Cannot submit an answer after timeLimit has elapsed (using injected clock).
score is recomputed only on transition to SUBMITTED.
Re-submission of the same question version is idempotent.
Offline-replay ordering (see §14) produces identical final state regardless of event arrival order within the same causal group.

4.5 Mutation Testing

Tool: Stryker (TS), go-mutesting (Go), mutmut (Python).
Run frequency: Nightly on main; on-demand in PRs touching domain/**.
Budget: Each domain module ≤ 15 minutes; otherwise sharded.
Gate: Mutation score drop > 5 percentage points blocks merge.

5. Integration Testing — Service Layer

5.1 Definition

Integration tests exercise one service's real code against real infrastructure dependencies (Postgres, Redis, Kafka/NATS, S3, OpenSearch) running in Testcontainers. External SaaS (Stripe, Clerk, LLM providers) are replaced by recorded contract fakes (see §6).

5.2 What They Prove

SQL queries match the schema and indexes (regression against migration drift).
Transactional boundaries hold (including SAGA compensations for cross-service flows — §5.5).
Event publication via the transactional outbox is correct and exactly-once from the consumer's perspective.
Caching layers (Redis) are invalidated on writes.
Row-Level-Security policies enforce tenant isolation (see 13-security-compliance-tenancy.md §4).

5.3 Environment

Testcontainers spin up: Postgres 16, Redis 7, Kafka (KRaft mode), MinIO, OpenSearch, Mailpit.
Containers are per-test-class (shared) with per-test transactional rollback for SQL, and topic-prefix isolation for Kafka.
Seeds applied once per suite via Flyway/Liquibase migrations + deterministic fixture loaders.

5.4 Tenant Isolation Tests (mandatory, every service)

Every service that touches tenant data MUST ship a test file tenant_isolation.test.ts containing:

Two tenants seeded with identical entity IDs (to catch ID-leak bugs).
Queries executed as Tenant A MUST NOT return Tenant B rows under any filter.
Direct row ID access (bypassing the service filter) MUST be blocked by RLS.
Bulk/admin endpoints MUST require X-Tenant-Scope: global and elevated role.

5.5 SAGA and Outbox Tests

For each cross-service SAGA (e.g., Enroll → Payment → Provisioning → Notification):

Happy path: all steps succeed, final events produced in order.
Compensating path: inject failure at step N, assert reversal events and idempotent cleanup.
Outbox poll crash: kill the outbox relay mid-flush; on restart, no duplicate publishes (dedupe via event_id).

6. Contract Testing — APIs & Events

Contracts are the load-bearing beams of a distributed system. We use consumer-driven contracts for both HTTP APIs and events.

6.1 HTTP Contract Testing (Pact)

Consumer side: Each frontend and each downstream service publishes a Pact file to the broker on every PR.
Provider side: Services verify against all current consumer contracts in CI; failure blocks merge.
OpenAPI alignment: A generator validates that each Pact interaction is covered by the OpenAPI spec, and vice-versa — no undocumented endpoints, no unused OpenAPI paths.
Breaking change policy: Removing or narrowing a field requires a versioned endpoint (/v2/...) and a 90-day deprecation window.

6.2 GraphQL Contract Testing

Federation subgraphs run rover subgraph check against the published supergraph schema.
Consumer queries are extracted from frontend builds and replayed as persisted-query tests.

6.3 Event Contract Testing

Events are versioned via Avro schemas in the Schema Registry (Confluent-compatible). Each event topic has:

schema-compatibility: BACKWARD_TRANSITIVE (consumers can read any historical version).
A producer contract test that publishes a golden sample per version and asserts schema-registry acceptance.
A consumer contract test per downstream service that replays golden samples and asserts projection correctness.

6.4 Event Catalog Tests

The event catalog (see 04-event-driven-architecture.md §6) is machine-readable YAML. CI asserts:

Every event produced in code appears in the catalog.
Every catalog entry has ≥1 producer test and ≥1 consumer test.
Every event has a documented retention, PII classification, and replay eligibility flag.

7. End-to-End Testing — User Journeys

7.1 Canonical Journeys (must always be green on `main`)

ID	Journey	Primary actor	Tooling
J-01	Learner signs up → enrolls in free course → completes first lesson	Learner	Playwright
J-02	Learner purchases paid course → receives receipt → starts course	Learner	Playwright + Stripe test clock
J-03	Learner downloads bundle → goes airplane mode → completes quiz → syncs	Learner	Playwright + device emulation
J-04	Instructor creates assignment → grades with AI-assist → publishes feedback	Instructor	Playwright
J-05	Author builds course (drag-drop, AI-generate, preview) → publishes v1 → edits v2	Author	Playwright
J-06	Admin provisions tenant → configures SSO → invites cohort	Admin	Playwright
J-07	Learner asks AI tutor a question → receives cited answer → flags hallucination	Learner	Playwright + AI-eval harness
J-08	Cohort teacher runs live session → projects results → exports gradebook	Instructor	Playwright
J-09	Learner on low-end Android completes lesson offline for 7 days → reconnects	Learner	Playwright + Android emulator
J-10	Tenant requests data export (GDPR Art. 20) → receives archive → confirms erasure	Admin/DPO	Playwright + archive validator

7.2 E2E Principles

Run against deployed staging, not local docker-compose, for PR smoke.
Seeded tenants are prefixed e2e-<runId>- and cleaned within 1 hour by a janitor job.
No shared state between tests; each test creates and destroys its own learner/cohort.
Visual regression via Playwright snapshots at 320, 768, 1024, 1440 for all learner-facing pages.
Trace + video + screenshot always captured, uploaded to artifact store, linked from failure report.

7.3 Performance Budget (inside E2E)

Lighthouse assertions inside Playwright tests for key landing, dashboard, and player pages. See §10 for thresholds.

8. Offline Testing

Offline is a tier-0 invariant. A single offline regression is a P0 incident.

8.1 Scenarios

ID	Scenario	Expected behavior
O-01	Airplane mode mid-lesson	Player continues; progress queued; UI shows offline badge.
O-02	7-day offline streak	All queued events persisted; no data loss on sync.
O-03	Bundle tamper (hash mismatch)	Player refuses to load; user prompted to re-download; incident logged.
O-04	Clock skew (device time in the past)	Events accepted server-side via logical clocks; no silent re-ordering.
O-05	Conflict: same attempt edited on two devices	Deterministic resolution per CRDT/LWW rules in 12-data-models.md §8.
O-06	Storage full on device	Graceful degradation: evict LRU bundles; warn user; never lose unsynced events.
O-07	Background sync killed by OS	Resume on next app open; no duplicate submissions.
O-08	Partial bundle download	Resumable via range requests; hash verified before activation.
O-09	Expired offline license	Player blocks new attempts; allows sync of existing queued events.
O-10	Cross-device restore from account recovery	Learner sees same progress on new device after sync.

8.2 Tooling

Playwright with context.setOffline(true) for web.
Detox + Android emulator airplane-mode toggle for mobile.
Chaos Monkey for networks: toxiproxy to simulate latency, packet loss, bandwidth caps (2G, 3G, flaky Wi-Fi).
Bundle fuzzer: bit-flip + truncation + metadata tampering harness that must always be rejected.

8.3 Sync Conflict Test Matrix

Every mergeable entity (progress, notes, bookmarks, assignment drafts) has a matrix:

             Device A change type
             │ add │ update │ delete │
Device B ────┼─────┼────────┼────────┤
add          │ M1  │  M2    │  M3    │
update       │ M2  │  M4    │  M5    │
delete       │ M3  │  M5    │  M6    │

Each cell (M1–M6) has at least 2 tests: one where A wins by timestamp, one where B wins. Expected resolutions are documented in 12-data-models.md §8.4.

9. AI Testing

AI testing is the most novel and the most risky surface. We treat it with the same rigor as security testing.

9.1 Categories

Category	What it catches
Prompt regression	Output drift across prompt or model version changes.
Safety (harms)	Harassment, self-harm encouragement, age-inappropriate content.
Hallucination	Fabricated citations, invented facts, wrong math.
Structured output	Invalid JSON, missing required fields, wrong types.
PII leakage	Names, emails, tokens, embedded training data leaking into output.
Jailbreak resistance	Prompt injection from user content (assignment text, PDF OCR, uploaded images).
Cost / latency	Token usage, p95 latency, provider-side error rates.
Bias & fairness	Differential output quality across demographics, languages, dialects.
Curriculum drift	Output contradicts published curriculum standards.

9.2 Prompt Regression

Golden set: ≥ 1,000 prompts per AI feature (tutor, grader, generator, summarizer), curated by subject-matter experts and labeled with expected answer classes.
Scoring:
- Deterministic checks: JSON schema, required citations, forbidden tokens.
- LLM-as-judge: a separate, cheaper model scores similarity to expected answer along rubric dimensions (accuracy, helpfulness, tone).
- Human-in-the-loop: 5% random sample reviewed weekly.
Gate: A new prompt/model must score ≥ parity on the golden set before replacing the current version. Statistical test: bootstrapped 95% CI on delta.

9.3 Safety Testing

Red-team corpus: 5,000+ adversarial prompts covering self-harm, hate, sexual content, violence, drugs, weapons, cheating, and education-specific attacks ("write my essay", "solve my test for me and hide it").
Every model deploy: 100% of corpus re-run; zero policy violations permitted. Violations block deploy.
Policy coverage: Every safety policy (internal doc SAFETY-001 through SAFETY-037) has ≥ 10 probes.

9.4 Hallucination Detection

Citation verifier: Every AI tutor answer with a citation is validated — the cited source must exist in the retrieval index, and the cited passage must contain a semantically similar sentence (bi-encoder similarity > 0.75).
Fact checker sub-chain: For math/science answers, a deterministic verifier (e.g., SymPy for algebra) re-checks the AI's final numeric answer.
Uncertainty escalation: Answers below a confidence threshold trigger a "not sure, routing to teacher" fallback; tests assert this path fires on known ambiguous prompts.

9.5 Structured Output Validation

All AI outputs consumed programmatically MUST declare a JSON schema (Zod/Pydantic/Ajv).
Failure test: Inject adversarial model outputs (truncated JSON, trailing comments, wrong types); system MUST retry-with-repair then fall back to a safe default. Silent coercion is forbidden.

9.6 PII Leakage

Canary data: Seeded synthetic PII strings (canary-pii-XXXX) in training-adjacent logs and retrieval sources.
Test: Daily probe suite asks the model innocuous questions; if any canary surfaces in output, incident P1.

9.7 Jailbreak / Prompt Injection

Corpus of 2,000+ injection attempts embedded in user-uploaded content (PDFs, images via OCR, assignment text).
Assertion: System prompt and tool-access invariants hold; assistant never reveals system prompt, never exfiltrates other users' data, never calls destructive tools.

9.8 Eval Infrastructure

Harness: Internal ghasi-evals framework built on Inspect / promptfoo conventions.
Storage: Eval runs versioned in object storage with prompt hash, model ID, parameters, and outputs. Queryable in ClickHouse.
Dashboards: Grafana — per-feature quality score over time, regression alerts.

10. Load, Performance & Scalability Testing

10.1 Tooling

k6 for HTTP and gRPC (primary).
Locust for complex stateful flows (scripted learners over long sessions).
Gatling for a secondary, JVM-native cross-check on Payments.

10.2 Per-Service Load Scenarios

Service	Peak target	SLO (p95)	Scenario
API Gateway	20k rps	80 ms	Mixed auth + unauth.
Identity	2k rps sign-in	200 ms	Password + SSO + MFA.
Enrollment	1k rps	150 ms	First-day-of-term burst × 10.
Progress	10k events/s	300 ms (end-to-end to projection)	Cohort of 50k learners simultaneously answering.
AI Tutor	500 concurrent sessions	2.5 s first token	Sustained + spike.
Authoring	200 rps write	400 ms	Bulk import of 10k questions.
Payments	300 rps	500 ms	Black-Friday-class burst.
Content Delivery (CDN origin)	5k rps	100 ms	Bundle downloads, warm + cold cache.
Notifications	5k msg/s	async	Fan-out to cohort of 100k.

10.3 System-Wide Scenarios

First-Day-of-Term: 500k learners log in within 1 hour, each enrolls in 3 courses, each starts 1 lesson. Success: no 5xx > 0.1%, p95 < SLO.
Global Event Storm: 1M progress events/minute for 10 minutes. Success: no consumer lag > 30 s; no data loss.
Graceful Degradation: Kill the AI provider; assert tutor fails open with clear UX; no cascading failures.

10.4 Frontend Performance Budgets

Per [web/performance.md], enforced in CI via Lighthouse CI:

Page	LCP	INP	CLS	JS (gzip)
Marketing landing	2.0 s	150 ms	0.05	120 kb
Learner dashboard	2.5 s	200 ms	0.10	280 kb
Player (lesson)	2.0 s	150 ms	0.05	220 kb
Authoring canvas	3.0 s	250 ms	0.10	600 kb (tooling-heavy, justified)

11. Accessibility Testing — WCAG 2.2 AA

11.1 Commitment

Ghasi targets WCAG 2.2 Level AA across all learner- and instructor-facing surfaces, with Level AAA aspiration for the Player (lesson runtime).

11.2 Layered Approach

Layer	Tool	Coverage
Static	eslint-plugin-jsx-a11y	Every PR
Component	@storybook/addon-a11y + axe-core	Every story, every variant
E2E	axe-playwright scan per E2E journey	Every PR
Keyboard	Scripted tab-order + focus-trap tests	Every interactive component
Screen reader	Manual NVDA/VoiceOver/TalkBack audits	Per release
Low-vision	Contrast, zoom 200%, reflow at 320 CSS px	Per release
Cognitive	Reading-level + plain-language review	Per content release (editorial)
Motor	Target size ≥ 24×24 CSS px (2.5.8), no hover-only	Linter + manual

11.3 Required Assertions

All interactive elements reachable by keyboard.
Focus ring visible and meets 3:1 contrast against adjacent colors.
prefers-reduced-motion honored (all motion in §10-adjacent UI).
Captions on all instructional video; transcripts downloadable.
Math rendered via MathML + ARIA label fallback (not image-only).
Error messages programmatically associated with fields (aria-describedby).
Language of page and parts declared (lang attribute per passage for multilingual content).

11.4 Gate

Axe violations at serious or critical severity block merge. moderate requires a triage ticket within 48 h.

12. Security Testing

12.1 SAST (Static)

Tools: Semgrep (custom ruleset + OWASP Top 10), CodeQL, gitleaks, trivy-config.
Gate: Any high or critical finding blocks merge. medium opens a ticket.
Custom rules: No dangerouslySetInnerHTML; no raw SQL concatenation; no eval; no child_process.exec with user input; no import of deprecated crypto.

12.2 SCA (Dependencies)

Tools: Dependabot + Snyk + OSV-Scanner.
Policy: Criticals patched within 24 h; highs within 7 d; mediums within 30 d.
Supply-chain: SBOM (CycloneDX) produced per build; artifact signing via Sigstore/cosign; provenance via SLSA level 3.

12.3 DAST (Dynamic)

Tools: OWASP ZAP (authenticated scans), Burp Suite Enterprise.
Cadence: Nightly against staging; pre-release against pre-prod.
Coverage: Every public endpoint + representative authenticated endpoints per role.

12.4 Pen-Testing

External firm engaged quarterly with rotating scope: external perimeter, internal tenant isolation, mobile app, AI surfaces (prompt injection, model extraction).
Internal red team runs a monthly exercise mirroring a realistic attacker TTP chain; blue team must detect and respond.
Bug bounty: Public program (scoped) with clear SLA for response.

12.5 AuthZ Matrix Testing

Every role × resource × action combination is expressed as a policy decision test against the OPA/Cedar policy engine:

roles:    [anonymous, learner, instructor, author, tenant-admin, super-admin, dpo, support]
resources:[Course, Lesson, Attempt, Grade, Cohort, Tenant, BillingAccount, UserPII, AuditLog, AIPrompt]
actions:  [read, list, create, update, delete, export, impersonate, configure]

Matrix size ≈ 8 × 10 × 8 = 640 cells. Each cell declares allow or deny; tests assert both directions. A newly added role or resource cannot ship without filling its column/row.

12.6 Threat-Model-Driven Tests

For every threat in the STRIDE model maintained in 13-security-compliance-tenancy.md, at least one test case exists proving the mitigation works.

13. Chaos & Resilience Testing

13.1 Principles

Run chaos in staging continuously, in production monthly with a blast-radius limit.
Every chaos experiment has a written hypothesis, a measurable steady state, and a rollback.

13.2 Experiment Catalog (excerpt)

ID	Experiment	Hypothesis	Blast radius
C-01	Kill one Progress service pod	≤ 0.5% increase in 5xx for 30 s	1 pod
C-02	50% packet loss between Gateway and Identity	Retries absorb; auth p99 < 2 s	one AZ pair
C-03	Partition Kafka broker	Producers buffer via outbox; no data loss	one broker
C-04	CPU stress on all AI workers	Tutor degrades to cached responses; non-AI unaffected	AI pool only
C-05	Redis primary failover	Session cache rebuilds within 60 s; no user-visible logouts	staging only
C-06	Postgres read replica lag 30 s	Read-your-writes queries route to primary; no stale reads surfaced to learners mid-attempt	staging only
C-07	LLM provider 500s for 10 min	Circuit breaker opens; fallback model engaged; eval-latency SLO holds	AI pool only
C-08	CDN origin offline	Bundle downloads route to secondary; no 404s surface	staging only

13.3 Tooling

LitmusChaos or Chaos Mesh in Kubernetes.
AWS Fault Injection Simulator for managed services.
toxiproxy for deterministic network faults in integration tests.

14. Replay Testing — Event Log Rebuild

Per 04-event-driven-architecture, any projection/read-model is reproducible from the event log. This is a compliance and recovery guarantee, not an optimization.

14.1 Guarantees Tested

Deterministic replay: Rebuilding a projection from event index 0 produces a byte-identical (or schema-equivalent) result to the live projection.
Partial replay: Replaying from any checkpoint reconstructs correct state.
Out-of-order safety: Replaying events in a shuffled-but-causally-valid order produces the same final state (for CRDT/idempotent projections) or rejects (for strict-ordering projections).
Schema migration: Old event schemas are readable by current consumers via registry + upcasters.
PII redaction on replay: Events marked PII:erased (post-GDPR-erasure) MUST replay with tombstoned fields; downstream state must reflect erasure.

14.2 Harness

Nightly job: Rebuild every projection from scratch in a sandboxed cluster; diff against live; any drift is a P1.
Per-PR job: For changed projections, rebuild from a canned 10k-event fixture and assert equality.

15. Multi-Device Sync Testing

15.1 Why It Warrants Its Own Section

Ghasi learners routinely use 2–3 devices (phone + tablet + school PC). Sync bugs are invisible in single-device test suites.

15.2 Scenarios

Simultaneous edit: Same learner edits notes on phone and tablet within 2 s; both devices converge to the same final state within 10 s of reconnection.
Staggered edit: Phone edits offline for 3 days; tablet edits online; merge on reconnection respects documented precedence.
Device decommission: Learner signs out of phone; queued unsynced events must flush before logout completes (or be migrated to new device via account-bound queue).
Clock skew: Phone clock 10 min ahead; sync still ordered correctly via logical clocks (Lamport/HLC).
Partial-sync visibility: Tablet shows a progress indicator during large sync; never presents partially-synced state as authoritative.

15.3 Harness

Playwright multi-context: Two browser contexts simulate two devices; a third orchestrator drives scenarios.
Mobile: Detox drives two Android emulators in parallel under Bazel/Tuist.
Assertion library: Custom expectConvergedState(deviceA, deviceB, within: '10s').

16. Testing Environments & Data Management

16.1 Environments

Env	Purpose	Data	Refresh
local	Developer loop	Ephemeral Testcontainers	Per-run
ci	Per-PR validation	Ephemeral Testcontainers + synthetic seed	Per-run
preview	PR review env (optional, per-service)	Scrubbed prod snapshot	Per-PR
staging	Integration + chaos + load	Scrubbed prod snapshot, refreshed weekly	Weekly
pre-prod	Release candidate validation	Full synthetic at prod scale	Per-release
prod	Real users	Real	n/a

16.2 Test Data Principles

No real PII in non-prod. Scrubbing pipeline uses deterministic pseudonymization (FPE for IDs, Faker for names/emails, stable per-learner).
Synthetic generators for load: ghasi-datagen produces realistic cohorts, courses, and event streams at arbitrary scale.
Fixture libraries versioned under /fixtures/<service>/; generated artifacts checked in for reproducibility.
Data retention in test envs: 30 days; automatic purge; no backups.

16.3 Secrets in Tests

Never real secrets. Each env has its own keyring (AWS Secrets Manager / Vault).
Test secrets rotated weekly via automation.

17. CI/CD Quality Gates

17.1 Pipeline Stages

┌─────────────────────────────────────────────────────────────────────┐
│ 1. Lint + Format + Type-check         (≤ 90 s)                      │
│ 2. Unit tests                         (≤ 4 min)                     │
│ 3. Mutation (changed files only)      (≤ 3 min)                     │
│ 4. SAST + SCA + secrets scan          (≤ 2 min)                     │
│ 5. Integration tests (Testcontainers) (≤ 12 min)                    │
│ 6. Contract tests (Pact + Avro)       (≤ 3 min)                     │
│ 7. Build + SBOM + sign                (≤ 5 min)                     │
│ 8. E2E smoke (10 critical journeys)   (≤ 8 min)                     │
│ 9. Accessibility scan                 (≤ 3 min)                     │
│10. Lighthouse budgets                 (≤ 3 min)                     │
│11. Deploy to preview (optional)                                     │
│12. DAST (nightly on staging)                                        │
│13. Load smoke (nightly on staging)                                  │
│14. AI-eval regression (on AI changes or nightly)                    │
└─────────────────────────────────────────────────────────────────────┘

17.2 Gate Policy

Stage	Failure policy
1–7	Hard block — no override
8	Hard block — no override
9	Hard block for `serious`/`critical`; ticket for `moderate`
10	Hard block if any budget exceeded by > 10%
12	Hard block for `high`/`critical` findings; incident for `medium`
14	Hard block if golden-set score regresses > 2% or any safety violation

17.3 Deployment Gates

Canary: 5% traffic for 30 min; auto-rollback on error-rate or latency breach vs. baseline.
Progressive: 25% → 50% → 100% over 2 h.
Feature flags: All risky features dark-launched behind LaunchDarkly flags with ramp plan in the PR description.

17.4 Post-Deploy Verification

Synthetic checks (uptime, critical journeys) every minute.
SLO burn-rate alerts (2%, 5%, 10% of monthly budget) to on-call.
Automatic rollback if error budget burns > 10% in 1 h.

18. Coverage Expectations per Service

Service	Unit	Integration	Contract	E2E (journeys)	AI-eval	Load SLO
identity	90%	85%	Pact + OIDC	J-01, J-06	—	200 ms p95
tenancy	90%	85%	Pact	J-06	—	150 ms
catalog	85%	80%	Pact + GraphQL	J-01, J-02	—	100 ms
enrollment	90%	85%	Pact + events	J-01, J-02	—	150 ms
progress	95%	90%	events	J-03, J-09	—	300 ms end-to-end
attempts/grading	95%	90%	events	J-04	grader-eval	400 ms
authoring	85%	80%	Pact	J-05	generator-eval	400 ms
ai-tutor	80%	75%	Pact	J-07	tutor-eval	2.5 s TTFT
ai-content-gen	80%	75%	events	J-05	generator-eval	async
payments	95%	90%	Pact + Stripe	J-02	—	500 ms
notifications	85%	80%	events	J-04	—	async
content-delivery	90%	85%	Pact	J-03	—	100 ms
offline-sync	95%	90%	events	J-03, J-09	—	300 ms
analytics	80%	75%	events	J-08	—	async
admin	85%	80%	Pact	J-06, J-10	—	300 ms
audit	90%	85%	events	J-10	—	append-only; no data loss
search	85%	80%	Pact	J-01	—	200 ms
localization	85%	80%	Pact	J-01	translation-eval	150 ms

19. Tooling Matrix

Concern	Primary	Secondary	Notes
Unit (TS)	Vitest	Jest	Vitest preferred; shared config in `tooling/vitest`
Unit (Go)	`go test`	testify
Unit (Python)	pytest	—	`pytest-xdist` for parallel
Mutation (TS)	Stryker	—
Mutation (Go)	go-mutesting	—
Mutation (Python)	mutmut	—
Integration	Testcontainers	docker-compose	Testcontainers first
Contract (HTTP)	Pact		Broker self-hosted
Contract (Events)	Schema Registry + custom harness		Avro-first
E2E	Playwright	Cypress	Playwright standard
Mobile E2E	Detox	Maestro
Visual	Playwright snapshots	Chromatic
Accessibility	axe-core + axe-playwright	Lighthouse a11y
Perf (frontend)	Lighthouse CI	WebPageTest
Load	k6	Locust, Gatling
Chaos	LitmusChaos	Chaos Mesh, AWS FIS
Security (SAST)	Semgrep + CodeQL
Security (DAST)	OWASP ZAP	Burp
Security (SCA)	Snyk + OSV	Dependabot
Secrets	gitleaks	trufflehog
AI eval	ghasi-evals (on promptfoo + Inspect)
Observability	OpenTelemetry + Grafana + ClickHouse
Feature flags	LaunchDarkly	OpenFeature SDK
Test data	ghasi-datagen	Faker

20. Governance, Ownership & RACI

20.1 Roles

Service team — owns unit, integration, contract, service-level E2E, and performance SLOs for its service.
Platform QA Guild — owns cross-service E2E, load/chaos harnesses, flake policy, and tooling.
AI Safety team — owns safety corpus, eval harness, red-team.
Security team — owns SAST/DAST/SCA tooling, authz matrix enforcement, pen-test coordination.
Accessibility team — owns axe ruleset, assistive-tech audits.
SRE — owns chaos experiments in prod, SLO burn alerts, rollback automation.

20.2 RACI (excerpt)

Activity	Service team	QA Guild	AI Safety	Security	SRE
Write unit + integration tests	R/A	C	—	—	—
Maintain E2E journey suite	C	R/A	—	—	—
Maintain AI-eval golden set	C	C	R/A	—	—
Pen-test scheduling	C	C	—	R/A	C
Chaos in prod	C	C	—	C	R/A
Flake triage	R	A	—	—	—
Release gate override (emergency)	C	C	C	C	R/A

20.3 Cadence

Weekly: QA Guild reviews flakes, coverage drifts, load regressions.
Monthly: AI Safety red-team debrief; chaos experiment retro.
Quarterly: External pen-test; full-system GameDay.
Annually: Strategy refresh (this document).

Appendix A — Test IDs & Traceability

Every test carries a stable ID following:

<service>.<layer>.<capability>.<scenario>

Examples:

enrollment.unit.policy.expired-enrollment-blocks-attempt
progress.integration.outbox.crash-mid-flush-no-duplicate
tutor.ai.safety.self-harm-corpus-v3
player.e2e.offline.seven-day-streak

Traceability is maintained in 06-traceability-matrix.md: every epic/story maps to ≥ 1 test ID at each applicable layer.

Appendix B — Flake Policy & Quarantine

A test that fails non-deterministically is quarantined within 24 h (moved to a @quarantined tag, excluded from merge gates).
The producing team has 5 business days to either fix and restore, or delete.
A test may not remain quarantined > 10 business days without a written exception from the QA Guild lead.
Quarantine count per service is a tracked KPI; > 5 active quarantines triggers a focused reliability sprint.
Root-cause categories are tagged (flake:timing, flake:shared-state, flake:network, flake:ai-nondeterminism) to drive harness improvements.

End of document.

Table of Contents​

1. Purpose, Scope & Quality Philosophy​

1.1 Scope​

1.2 Out of Scope​

1.3 Quality Philosophy​

2. Testing Principles & Non-Negotiables​

2.1 Principles​

2.2 Non-Negotiables (merge-blocking)​

3. The Ghasi Test Pyramid (Extended)​

3.1 Rough Volume Targets​

4. Unit Testing — Domain Layer​

4.1 What Counts as a Domain Unit Test​

4.2 Coverage Targets​

4.3 Patterns & Conventions​

4.4 Example Coverage — AttemptSession Aggregate​

4.5 Mutation Testing​

5. Integration Testing — Service Layer​

5.1 Definition​

5.2 What They Prove​

5.3 Environment​

5.4 Tenant Isolation Tests (mandatory, every service)​

5.5 SAGA and Outbox Tests​

6. Contract Testing — APIs & Events​

6.1 HTTP Contract Testing (Pact)​

6.2 GraphQL Contract Testing​

6.3 Event Contract Testing​

6.4 Event Catalog Tests​

7. End-to-End Testing — User Journeys​

7.1 Canonical Journeys (must always be green on main)​

7.2 E2E Principles​

7.3 Performance Budget (inside E2E)​

8. Offline Testing​

8.1 Scenarios​

8.2 Tooling​

8.3 Sync Conflict Test Matrix​

9. AI Testing​

9.1 Categories​

9.2 Prompt Regression​

9.3 Safety Testing​

9.4 Hallucination Detection​

9.5 Structured Output Validation​

9.6 PII Leakage​

9.7 Jailbreak / Prompt Injection​

9.8 Eval Infrastructure​

10. Load, Performance & Scalability Testing​

10.1 Tooling​

10.2 Per-Service Load Scenarios​

10.3 System-Wide Scenarios​

10.4 Frontend Performance Budgets​

11. Accessibility Testing — WCAG 2.2 AA​

11.1 Commitment​

11.2 Layered Approach​

11.3 Required Assertions​

11.4 Gate​

12. Security Testing​

12.1 SAST (Static)​

12.2 SCA (Dependencies)​

12.3 DAST (Dynamic)​

12.4 Pen-Testing​

12.5 AuthZ Matrix Testing​

12.6 Threat-Model-Driven Tests​

13. Chaos & Resilience Testing​

13.1 Principles​

13.2 Experiment Catalog (excerpt)​

13.3 Tooling​

14. Replay Testing — Event Log Rebuild​

14.1 Guarantees Tested​

14.2 Harness​

15. Multi-Device Sync Testing​

15.1 Why It Warrants Its Own Section​

15.2 Scenarios​

15.3 Harness​

16. Testing Environments & Data Management​

16.1 Environments​

16.2 Test Data Principles​

16.3 Secrets in Tests​

17. CI/CD Quality Gates​

17.1 Pipeline Stages​

17.2 Gate Policy​

17.3 Deployment Gates​

Table of Contents