Testing Strategy
:::info Source
Sourced from docs/16-testing-strategy-qa.md in the documentation repo.
:::
Document status: Canonical specification — v1.0 Owner: Platform QA Guild & Chief Architect Related specs: 01-enterprise-architecture, 02-ddd-bounded-contexts, 03-microservices, 04-event-driven-architecture, 05-api-design, 06-traceability-matrix, 10-authoring-tool-spec, 11-lms-runtime-player-spec, 12-data-models, 13-security-compliance-tenancy, 14-risks-and-tradeoffs
Table of Contents
- Purpose, Scope & Quality Philosophy
- Testing Principles & Non-Negotiables
- The Ghasi Test Pyramid (Extended)
- Unit Testing — Domain Layer
- Integration Testing — Service Layer
- Contract Testing — APIs & Events
- End-to-End Testing — User Journeys
- Offline Testing
- AI Testing (Prompt, Safety, Hallucination, Structured Output)
- Load, Performance & Scalability Testing
- Accessibility Testing — WCAG 2.2 AA
- Security Testing — SAST, DAST, Pen-Test, AuthZ Matrix
- Chaos & Resilience Testing
- Replay Testing — Event Log Rebuild
- Multi-Device Sync Testing
- Testing Environments & Data Management
- CI/CD Quality Gates
- Coverage Expectations per Service
- Tooling Matrix
- Governance, Ownership & RACI
- Appendix A — Test IDs & Traceability
- Appendix B — Flake Policy & Quarantine
1. Purpose, Scope & Quality Philosophy
Ghasi-edTech is a multi-tenant, AI-first, offline-first, event-driven education platform serving learners, instructors, authors, and administrators across heterogeneous devices (mobile, tablet, desktop, low-end Android, kiosk). Quality assurance on a system of this complexity cannot be reduced to "does it build and do the tests pass." It must be a layered, continuously executed evidence system that proves:
- Correctness — The domain behaves as specified in 02-ddd-bounded-contexts.
- Safety — AI outputs cannot harm learners (harassment, hallucinated facts, PII leakage, curriculum drift).
- Resilience — The system degrades gracefully under failure, network loss, or adversarial load.
- Portability of state — A learner's progress, attempts, and artifacts survive device loss, network partitions, and bundle corruption.
- Isolation — Tenants cannot observe, influence, or exfiltrate each other's data (per 13-security-compliance-tenancy).
- Accessibility — No learner is excluded by disability, device class, or connectivity.
- Reproducibility — Any historical state can be rebuilt from the event log (04-event-driven-architecture).
1.1 Scope
This document governs testing for:
- All 18 microservices enumerated in 03-microservices.
- All frontends: Learner Web, Learner Mobile (iOS/Android), Instructor Console, Authoring Tool, Admin Portal, Public Marketing Site.
- All AI pipelines: tutor, content-generation, rubric-grading, summarization, translation, accessibility-captioning, adaptive-path.
- All event producers, consumers, and projections.
- The offline runtime (11-lms-runtime-player-spec).
- All infrastructure-as-code (Terraform, Helm, Crossplane).
1.2 Out of Scope
- Manufacturer-level hardware testing of tablets distributed to schools (vendor-owned).
- Legal/regulatory sign-off of content (editorial, not QA).
- Third-party vendor SaaS internals (Stripe, Clerk, OpenAI, Anthropic) — we test our contracts with them, not their systems.
1.3 Quality Philosophy
"Every production incident is a missing test. Every missing test is a missed risk conversation."
We adopt four guiding stances:
- Tests as executable specifications. Domain tests read like the ubiquitous language.
- Shift-left, shift-right equally. Pre-merge gates catch regressions; production observability catches emergent behavior; both feed the backlog.
- AI is a first-class citizen of the test strategy. Non-determinism does not excuse non-testing — it raises the bar.
- Offline is not a feature; it is a tier-0 invariant. Any test plan that only works online is incomplete.
2. Testing Principles & Non-Negotiables
2.1 Principles
| # | Principle | Implication |
|---|---|---|
| P1 | Tests are first-class code | Reviewed, refactored, owned by the producing team; never "thrown over the wall" to QA. |
| P2 | Deterministic by default | Flakes are bugs; see Appendix B. |
| P3 | One assertion concept per test | Multiple expect lines are fine; multiple concepts are not. |
| P4 | Arrange-Act-Assert, always | Readability > brevity. |
| P5 | Test behavior, not implementation | Private methods are never tested directly. |
| P6 | Real dependencies where cheap | Testcontainers for Postgres, Redis, Kafka, S3 (MinIO). Mocks only at hard boundaries (payments, external LLMs). |
| P7 | Data builders over fixtures | aLearner().enrolledIn(course).withAttempts(3).build() beats JSON blobs. |
| P8 | Tests must fail loudly | No silent try/catch; no if (err) return. |
| P9 | Coverage is a floor, not a ceiling | 80% line coverage is the minimum; 95% branch coverage for domain aggregates. |
| P10 | Every bug fix ships with a regression test | Enforced via PR template checkbox + CI grep. |
2.2 Non-Negotiables (merge-blocking)
- No PR merges without green CI across all stages listed in §17.
- No production deploy without a successful canary and a rollback plan verified in staging within the last 7 days.
- No new AI prompt in production without a prompt regression suite (§9.2).
- No new event schema without a consumer contract test (§6.3).
- No new public API endpoint without a Pact contract and an OpenAPI diff review.
- No schema migration without a forward+backward compatibility test and a replay test (§14).
3. The Ghasi Test Pyramid (Extended)
The classic pyramid (unit → integration → E2E) is insufficient. Ghasi uses an extended pyramid with orthogonal axes:
┌──────────────────┐
│ Chaos / Replay │ (continuous, production-like)
├──────────────────┤
│ E2E Journeys │ (~200 tests, <15 min)
├──────────────────┤
│ Contract Tests │ (per API, per event; ~500 tests)
├──────────────────┤
│ Integration │ (Testcontainers; ~2,000 tests)
├──────────────────┤
│ Unit │ (domain + utils; ~15,000 tests)
└──────────────────┘
Orthogonal: Accessibility │ Security │ AI-Eval │ Offline │ Load │ Sync
Each orthogonal axis runs against multiple tiers. For example, accessibility applies to unit (component render), E2E (axe scan), and manual (assistive-tech spot checks).
3.1 Rough Volume Targets
| Tier | Count target | Wall-clock budget | Frequency |
|---|---|---|---|
| Unit | 12k–18k | < 4 min parallelized | Every commit |
| Integration | 1.5k–2.5k | < 12 min | Every PR |
| Contract | 400–700 | < 3 min | Every PR |
| E2E | 150–250 critical paths | < 15 min | Every PR (smoke) / nightly (full) |
| Load | 30+ scenarios | Nightly + pre-release | Nightly |
| Chaos | 20+ experiments | Weekly (staging), monthly (prod) | Weekly |
| AI-eval | 1k+ prompts × models | On prompt change + nightly | Continuous |
4. Unit Testing — Domain Layer
4.1 What Counts as a Domain Unit Test
A domain unit test exercises one aggregate, entity, value object, or domain service in isolation, with zero I/O, zero time-dependence unless injected, zero randomness unless seeded.
4.2 Coverage Targets
| Layer | Line | Branch | Mutation (Stryker) |
|---|---|---|---|
Aggregates (Course, Enrollment, AttemptSession, LearnerProgress, PaymentOrder) | 95% | 95% | ≥ 75% |
Value objects (Score, TenantId, ContentRef, OfflineBundleHash) | 100% | 100% | ≥ 85% |
| Domain services | 90% | 90% | ≥ 70% |
| Policy/specification classes | 95% | 95% | ≥ 80% |
4.3 Patterns & Conventions
- Given/When/Then naming:
given_expired_enrollment_when_starting_attempt_then_throws_EnrollmentExpired. - Builders in
__builders__/colocated with the aggregate. - No ORM, no DB, no network, no filesystem. Tests import the pure TS/Go/Python domain module only.
- Time is injected via a
Clockport; tests use aFakeClockat a fixed ISO-8601 instant. - UUIDs are injected via
IdFactory; tests use a deterministicSeededIdFactory.
4.4 Example Coverage — AttemptSession Aggregate
Invariants that MUST have dedicated tests (excerpt):
- Cannot start a new attempt if a previous attempt is in
IN_PROGRESSand lock window has not expired. - Cannot submit an answer after
timeLimithas elapsed (using injected clock). scoreis recomputed only on transition toSUBMITTED.- Re-submission of the same question version is idempotent.
- Offline-replay ordering (see §14) produces identical final state regardless of event arrival order within the same causal group.
4.5 Mutation Testing
- Tool: Stryker (TS), go-mutesting (Go), mutmut (Python).
- Run frequency: Nightly on
main; on-demand in PRs touchingdomain/**. - Budget: Each domain module ≤ 15 minutes; otherwise sharded.
- Gate: Mutation score drop > 5 percentage points blocks merge.
5. Integration Testing — Service Layer
5.1 Definition
Integration tests exercise one service's real code against real infrastructure dependencies (Postgres, Redis, Kafka/NATS, S3, OpenSearch) running in Testcontainers. External SaaS (Stripe, Clerk, LLM providers) are replaced by recorded contract fakes (see §6).
5.2 What They Prove
- SQL queries match the schema and indexes (regression against migration drift).
- Transactional boundaries hold (including SAGA compensations for cross-service flows — §5.5).
- Event publication via the transactional outbox is correct and exactly-once from the consumer's perspective.
- Caching layers (Redis) are invalidated on writes.
- Row-Level-Security policies enforce tenant isolation (see 13-security-compliance-tenancy.md §4).
5.3 Environment
- Testcontainers spin up: Postgres 16, Redis 7, Kafka (KRaft mode), MinIO, OpenSearch, Mailpit.
- Containers are per-test-class (shared) with per-test transactional rollback for SQL, and topic-prefix isolation for Kafka.
- Seeds applied once per suite via Flyway/Liquibase migrations + deterministic fixture loaders.
5.4 Tenant Isolation Tests (mandatory, every service)
Every service that touches tenant data MUST ship a test file tenant_isolation.test.ts containing:
- Two tenants seeded with identical entity IDs (to catch ID-leak bugs).
- Queries executed as Tenant A MUST NOT return Tenant B rows under any filter.
- Direct row ID access (bypassing the service filter) MUST be blocked by RLS.
- Bulk/admin endpoints MUST require
X-Tenant-Scope: globaland elevated role.
5.5 SAGA and Outbox Tests
For each cross-service SAGA (e.g., Enroll → Payment → Provisioning → Notification):
- Happy path: all steps succeed, final events produced in order.
- Compensating path: inject failure at step N, assert reversal events and idempotent cleanup.
- Outbox poll crash: kill the outbox relay mid-flush; on restart, no duplicate publishes (dedupe via
event_id).
6. Contract Testing — APIs & Events
Contracts are the load-bearing beams of a distributed system. We use consumer-driven contracts for both HTTP APIs and events.
6.1 HTTP Contract Testing (Pact)
- Consumer side: Each frontend and each downstream service publishes a Pact file to the broker on every PR.
- Provider side: Services verify against all current consumer contracts in CI; failure blocks merge.
- OpenAPI alignment: A generator validates that each Pact interaction is covered by the OpenAPI spec, and vice-versa — no undocumented endpoints, no unused OpenAPI paths.
- Breaking change policy: Removing or narrowing a field requires a versioned endpoint (
/v2/...) and a 90-day deprecation window.
6.2 GraphQL Contract Testing
- Federation subgraphs run
rover subgraph checkagainst the published supergraph schema. - Consumer queries are extracted from frontend builds and replayed as persisted-query tests.
6.3 Event Contract Testing
Events are versioned via Avro schemas in the Schema Registry (Confluent-compatible). Each event topic has:
schema-compatibility: BACKWARD_TRANSITIVE(consumers can read any historical version).- A producer contract test that publishes a golden sample per version and asserts schema-registry acceptance.
- A consumer contract test per downstream service that replays golden samples and asserts projection correctness.
6.4 Event Catalog Tests
The event catalog (see 04-event-driven-architecture.md §6) is machine-readable YAML. CI asserts:
- Every event produced in code appears in the catalog.
- Every catalog entry has ≥1 producer test and ≥1 consumer test.
- Every event has a documented retention, PII classification, and replay eligibility flag.
7. End-to-End Testing — User Journeys
7.1 Canonical Journeys (must always be green on main)
| ID | Journey | Primary actor | Tooling |
|---|---|---|---|
| J-01 | Learner signs up → enrolls in free course → completes first lesson | Learner | Playwright |
| J-02 | Learner purchases paid course → receives receipt → starts course | Learner | Playwright + Stripe test clock |
| J-03 | Learner downloads bundle → goes airplane mode → completes quiz → syncs | Learner | Playwright + device emulation |
| J-04 | Instructor creates assignment → grades with AI-assist → publishes feedback | Instructor | Playwright |
| J-05 | Author builds course (drag-drop, AI-generate, preview) → publishes v1 → edits v2 | Author | Playwright |
| J-06 | Admin provisions tenant → configures SSO → invites cohort | Admin | Playwright |
| J-07 | Learner asks AI tutor a question → receives cited answer → flags hallucination | Learner | Playwright + AI-eval harness |
| J-08 | Cohort teacher runs live session → projects results → exports gradebook | Instructor | Playwright |
| J-09 | Learner on low-end Android completes lesson offline for 7 days → reconnects | Learner | Playwright + Android emulator |
| J-10 | Tenant requests data export (GDPR Art. 20) → receives archive → confirms erasure | Admin/DPO | Playwright + archive validator |
7.2 E2E Principles
- Run against deployed staging, not local docker-compose, for PR smoke.
- Seeded tenants are prefixed
e2e-<runId>-and cleaned within 1 hour by a janitor job. - No shared state between tests; each test creates and destroys its own learner/cohort.
- Visual regression via Playwright snapshots at 320, 768, 1024, 1440 for all learner-facing pages.
- Trace + video + screenshot always captured, uploaded to artifact store, linked from failure report.
7.3 Performance Budget (inside E2E)
- Lighthouse assertions inside Playwright tests for key landing, dashboard, and player pages. See §10 for thresholds.
8. Offline Testing
Offline is a tier-0 invariant. A single offline regression is a P0 incident.
8.1 Scenarios
| ID | Scenario | Expected behavior |
|---|---|---|
| O-01 | Airplane mode mid-lesson | Player continues; progress queued; UI shows offline badge. |
| O-02 | 7-day offline streak | All queued events persisted; no data loss on sync. |
| O-03 | Bundle tamper (hash mismatch) | Player refuses to load; user prompted to re-download; incident logged. |
| O-04 | Clock skew (device time in the past) | Events accepted server-side via logical clocks; no silent re-ordering. |
| O-05 | Conflict: same attempt edited on two devices | Deterministic resolution per CRDT/LWW rules in 12-data-models.md §8. |
| O-06 | Storage full on device | Graceful degradation: evict LRU bundles; warn user; never lose unsynced events. |
| O-07 | Background sync killed by OS | Resume on next app open; no duplicate submissions. |
| O-08 | Partial bundle download | Resumable via range requests; hash verified before activation. |
| O-09 | Expired offline license | Player blocks new attempts; allows sync of existing queued events. |
| O-10 | Cross-device restore from account recovery | Learner sees same progress on new device after sync. |
8.2 Tooling
- Playwright with
context.setOffline(true)for web. - Detox + Android emulator airplane-mode toggle for mobile.
- Chaos Monkey for networks:
toxiproxyto simulate latency, packet loss, bandwidth caps (2G, 3G, flaky Wi-Fi). - Bundle fuzzer: bit-flip + truncation + metadata tampering harness that must always be rejected.
8.3 Sync Conflict Test Matrix
Every mergeable entity (progress, notes, bookmarks, assignment drafts) has a matrix:
Device A change type
│ add │ update │ delete │
Device B ────┼─────┼────────┼────────┤
add │ M1 │ M2 │ M3 │
update │ M2 │ M4 │ M5 │
delete │ M3 │ M5 │ M6 │
Each cell (M1–M6) has at least 2 tests: one where A wins by timestamp, one where B wins. Expected resolutions are documented in 12-data-models.md §8.4.
9. AI Testing
AI testing is the most novel and the most risky surface. We treat it with the same rigor as security testing.
9.1 Categories
| Category | What it catches |
|---|---|
| Prompt regression | Output drift across prompt or model version changes. |
| Safety (harms) | Harassment, self-harm encouragement, age-inappropriate content. |
| Hallucination | Fabricated citations, invented facts, wrong math. |
| Structured output | Invalid JSON, missing required fields, wrong types. |
| PII leakage | Names, emails, tokens, embedded training data leaking into output. |
| Jailbreak resistance | Prompt injection from user content (assignment text, PDF OCR, uploaded images). |
| Cost / latency | Token usage, p95 latency, provider-side error rates. |
| Bias & fairness | Differential output quality across demographics, languages, dialects. |
| Curriculum drift | Output contradicts published curriculum standards. |
9.2 Prompt Regression
- Golden set: ≥ 1,000 prompts per AI feature (tutor, grader, generator, summarizer), curated by subject-matter experts and labeled with expected answer classes.
- Scoring:
- Deterministic checks: JSON schema, required citations, forbidden tokens.
- LLM-as-judge: a separate, cheaper model scores similarity to expected answer along rubric dimensions (accuracy, helpfulness, tone).
- Human-in-the-loop: 5% random sample reviewed weekly.
- Gate: A new prompt/model must score ≥ parity on the golden set before replacing the current version. Statistical test: bootstrapped 95% CI on delta.
9.3 Safety Testing
- Red-team corpus: 5,000+ adversarial prompts covering self-harm, hate, sexual content, violence, drugs, weapons, cheating, and education-specific attacks ("write my essay", "solve my test for me and hide it").
- Every model deploy: 100% of corpus re-run; zero policy violations permitted. Violations block deploy.
- Policy coverage: Every safety policy (internal doc
SAFETY-001throughSAFETY-037) has ≥ 10 probes.
9.4 Hallucination Detection
- Citation verifier: Every AI tutor answer with a citation is validated — the cited source must exist in the retrieval index, and the cited passage must contain a semantically similar sentence (bi-encoder similarity > 0.75).
- Fact checker sub-chain: For math/science answers, a deterministic verifier (e.g., SymPy for algebra) re-checks the AI's final numeric answer.
- Uncertainty escalation: Answers below a confidence threshold trigger a "not sure, routing to teacher" fallback; tests assert this path fires on known ambiguous prompts.
9.5 Structured Output Validation
- All AI outputs consumed programmatically MUST declare a JSON schema (Zod/Pydantic/Ajv).
- Failure test: Inject adversarial model outputs (truncated JSON, trailing comments, wrong types); system MUST retry-with-repair then fall back to a safe default. Silent coercion is forbidden.
9.6 PII Leakage
- Canary data: Seeded synthetic PII strings (
canary-pii-XXXX) in training-adjacent logs and retrieval sources. - Test: Daily probe suite asks the model innocuous questions; if any canary surfaces in output, incident P1.
9.7 Jailbreak / Prompt Injection
- Corpus of 2,000+ injection attempts embedded in user-uploaded content (PDFs, images via OCR, assignment text).
- Assertion: System prompt and tool-access invariants hold; assistant never reveals system prompt, never exfiltrates other users' data, never calls destructive tools.
9.8 Eval Infrastructure
- Harness: Internal
ghasi-evalsframework built on Inspect / promptfoo conventions. - Storage: Eval runs versioned in object storage with prompt hash, model ID, parameters, and outputs. Queryable in ClickHouse.
- Dashboards: Grafana — per-feature quality score over time, regression alerts.
10. Load, Performance & Scalability Testing
10.1 Tooling
- k6 for HTTP and gRPC (primary).
- Locust for complex stateful flows (scripted learners over long sessions).
- Gatling for a secondary, JVM-native cross-check on Payments.
10.2 Per-Service Load Scenarios
| Service | Peak target | SLO (p95) | Scenario |
|---|---|---|---|
| API Gateway | 20k rps | 80 ms | Mixed auth + unauth. |
| Identity | 2k rps sign-in | 200 ms | Password + SSO + MFA. |
| Enrollment | 1k rps | 150 ms | First-day-of-term burst × 10. |
| Progress | 10k events/s | 300 ms (end-to-end to projection) | Cohort of 50k learners simultaneously answering. |
| AI Tutor | 500 concurrent sessions | 2.5 s first token | Sustained + spike. |
| Authoring | 200 rps write | 400 ms | Bulk import of 10k questions. |
| Payments | 300 rps | 500 ms | Black-Friday-class burst. |
| Content Delivery (CDN origin) | 5k rps | 100 ms | Bundle downloads, warm + cold cache. |
| Notifications | 5k msg/s | async | Fan-out to cohort of 100k. |
10.3 System-Wide Scenarios
- First-Day-of-Term: 500k learners log in within 1 hour, each enrolls in 3 courses, each starts 1 lesson. Success: no 5xx > 0.1%, p95 < SLO.
- Global Event Storm: 1M progress events/minute for 10 minutes. Success: no consumer lag > 30 s; no data loss.
- Graceful Degradation: Kill the AI provider; assert tutor fails open with clear UX; no cascading failures.
10.4 Frontend Performance Budgets
Per [web/performance.md], enforced in CI via Lighthouse CI:
| Page | LCP | INP | CLS | JS (gzip) |
|---|---|---|---|---|
| Marketing landing | 2.0 s | 150 ms | 0.05 | 120 kb |
| Learner dashboard | 2.5 s | 200 ms | 0.10 | 280 kb |
| Player (lesson) | 2.0 s | 150 ms | 0.05 | 220 kb |
| Authoring canvas | 3.0 s | 250 ms | 0.10 | 600 kb (tooling-heavy, justified) |
11. Accessibility Testing — WCAG 2.2 AA
11.1 Commitment
Ghasi targets WCAG 2.2 Level AA across all learner- and instructor-facing surfaces, with Level AAA aspiration for the Player (lesson runtime).
11.2 Layered Approach
| Layer | Tool | Coverage |
|---|---|---|
| Static | eslint-plugin-jsx-a11y | Every PR |
| Component | @storybook/addon-a11y + axe-core | Every story, every variant |
| E2E | axe-playwright scan per E2E journey | Every PR |
| Keyboard | Scripted tab-order + focus-trap tests | Every interactive component |
| Screen reader | Manual NVDA/VoiceOver/TalkBack audits | Per release |
| Low-vision | Contrast, zoom 200%, reflow at 320 CSS px | Per release |
| Cognitive | Reading-level + plain-language review | Per content release (editorial) |
| Motor | Target size ≥ 24×24 CSS px (2.5.8), no hover-only | Linter + manual |
11.3 Required Assertions
- All interactive elements reachable by keyboard.
- Focus ring visible and meets 3:1 contrast against adjacent colors.
prefers-reduced-motionhonored (all motion in §10-adjacent UI).- Captions on all instructional video; transcripts downloadable.
- Math rendered via MathML + ARIA label fallback (not image-only).
- Error messages programmatically associated with fields (
aria-describedby). - Language of page and parts declared (
langattribute per passage for multilingual content).
11.4 Gate
Axe violations at serious or critical severity block merge. moderate requires a triage ticket within 48 h.
12. Security Testing
12.1 SAST (Static)
- Tools: Semgrep (custom ruleset + OWASP Top 10), CodeQL, gitleaks, trivy-config.
- Gate: Any
highorcriticalfinding blocks merge.mediumopens a ticket. - Custom rules: No
dangerouslySetInnerHTML; no raw SQL concatenation; noeval; nochild_process.execwith user input; no import of deprecated crypto.
12.2 SCA (Dependencies)
- Tools: Dependabot + Snyk + OSV-Scanner.
- Policy: Criticals patched within 24 h; highs within 7 d; mediums within 30 d.
- Supply-chain: SBOM (CycloneDX) produced per build; artifact signing via Sigstore/cosign; provenance via SLSA level 3.
12.3 DAST (Dynamic)
- Tools: OWASP ZAP (authenticated scans), Burp Suite Enterprise.
- Cadence: Nightly against staging; pre-release against pre-prod.
- Coverage: Every public endpoint + representative authenticated endpoints per role.
12.4 Pen-Testing
- External firm engaged quarterly with rotating scope: external perimeter, internal tenant isolation, mobile app, AI surfaces (prompt injection, model extraction).
- Internal red team runs a monthly exercise mirroring a realistic attacker TTP chain; blue team must detect and respond.
- Bug bounty: Public program (scoped) with clear SLA for response.
12.5 AuthZ Matrix Testing
Every role × resource × action combination is expressed as a policy decision test against the OPA/Cedar policy engine:
roles: [anonymous, learner, instructor, author, tenant-admin, super-admin, dpo, support]
resources:[Course, Lesson, Attempt, Grade, Cohort, Tenant, BillingAccount, UserPII, AuditLog, AIPrompt]
actions: [read, list, create, update, delete, export, impersonate, configure]
Matrix size ≈ 8 × 10 × 8 = 640 cells. Each cell declares allow or deny; tests assert both directions. A newly added role or resource cannot ship without filling its column/row.
12.6 Threat-Model-Driven Tests
For every threat in the STRIDE model maintained in 13-security-compliance-tenancy.md, at least one test case exists proving the mitigation works.
13. Chaos & Resilience Testing
13.1 Principles
- Run chaos in staging continuously, in production monthly with a blast-radius limit.
- Every chaos experiment has a written hypothesis, a measurable steady state, and a rollback.
13.2 Experiment Catalog (excerpt)
| ID | Experiment | Hypothesis | Blast radius |
|---|---|---|---|
| C-01 | Kill one Progress service pod | ≤ 0.5% increase in 5xx for 30 s | 1 pod |
| C-02 | 50% packet loss between Gateway and Identity | Retries absorb; auth p99 < 2 s | one AZ pair |
| C-03 | Partition Kafka broker | Producers buffer via outbox; no data loss | one broker |
| C-04 | CPU stress on all AI workers | Tutor degrades to cached responses; non-AI unaffected | AI pool only |
| C-05 | Redis primary failover | Session cache rebuilds within 60 s; no user-visible logouts | staging only |
| C-06 | Postgres read replica lag 30 s | Read-your-writes queries route to primary; no stale reads surfaced to learners mid-attempt | staging only |
| C-07 | LLM provider 500s for 10 min | Circuit breaker opens; fallback model engaged; eval-latency SLO holds | AI pool only |
| C-08 | CDN origin offline | Bundle downloads route to secondary; no 404s surface | staging only |
13.3 Tooling
- LitmusChaos or Chaos Mesh in Kubernetes.
- AWS Fault Injection Simulator for managed services.
- toxiproxy for deterministic network faults in integration tests.
14. Replay Testing — Event Log Rebuild
Per 04-event-driven-architecture, any projection/read-model is reproducible from the event log. This is a compliance and recovery guarantee, not an optimization.
14.1 Guarantees Tested
- Deterministic replay: Rebuilding a projection from event index 0 produces a byte-identical (or schema-equivalent) result to the live projection.
- Partial replay: Replaying from any checkpoint reconstructs correct state.
- Out-of-order safety: Replaying events in a shuffled-but-causally-valid order produces the same final state (for CRDT/idempotent projections) or rejects (for strict-ordering projections).
- Schema migration: Old event schemas are readable by current consumers via registry + upcasters.
- PII redaction on replay: Events marked
PII:erased(post-GDPR-erasure) MUST replay with tombstoned fields; downstream state must reflect erasure.
14.2 Harness
- Nightly job: Rebuild every projection from scratch in a sandboxed cluster; diff against live; any drift is a P1.
- Per-PR job: For changed projections, rebuild from a canned 10k-event fixture and assert equality.
15. Multi-Device Sync Testing
15.1 Why It Warrants Its Own Section
Ghasi learners routinely use 2–3 devices (phone + tablet + school PC). Sync bugs are invisible in single-device test suites.
15.2 Scenarios
- Simultaneous edit: Same learner edits notes on phone and tablet within 2 s; both devices converge to the same final state within 10 s of reconnection.
- Staggered edit: Phone edits offline for 3 days; tablet edits online; merge on reconnection respects documented precedence.
- Device decommission: Learner signs out of phone; queued unsynced events must flush before logout completes (or be migrated to new device via account-bound queue).
- Clock skew: Phone clock 10 min ahead; sync still ordered correctly via logical clocks (Lamport/HLC).
- Partial-sync visibility: Tablet shows a progress indicator during large sync; never presents partially-synced state as authoritative.
15.3 Harness
- Playwright multi-context: Two browser contexts simulate two devices; a third orchestrator drives scenarios.
- Mobile: Detox drives two Android emulators in parallel under Bazel/Tuist.
- Assertion library: Custom
expectConvergedState(deviceA, deviceB, within: '10s').
16. Testing Environments & Data Management
16.1 Environments
| Env | Purpose | Data | Refresh |
|---|---|---|---|
| local | Developer loop | Ephemeral Testcontainers | Per-run |
| ci | Per-PR validation | Ephemeral Testcontainers + synthetic seed | Per-run |
| preview | PR review env (optional, per-service) | Scrubbed prod snapshot | Per-PR |
| staging | Integration + chaos + load | Scrubbed prod snapshot, refreshed weekly | Weekly |
| pre-prod | Release candidate validation | Full synthetic at prod scale | Per-release |
| prod | Real users | Real | n/a |
16.2 Test Data Principles
- No real PII in non-prod. Scrubbing pipeline uses deterministic pseudonymization (FPE for IDs, Faker for names/emails, stable per-learner).
- Synthetic generators for load:
ghasi-datagenproduces realistic cohorts, courses, and event streams at arbitrary scale. - Fixture libraries versioned under
/fixtures/<service>/; generated artifacts checked in for reproducibility. - Data retention in test envs: 30 days; automatic purge; no backups.
16.3 Secrets in Tests
- Never real secrets. Each env has its own keyring (AWS Secrets Manager / Vault).
- Test secrets rotated weekly via automation.
17. CI/CD Quality Gates
17.1 Pipeline Stages
┌─────────────────────────────────────────────────────────────────────┐
│ 1. Lint + Format + Type-check (≤ 90 s) │
│ 2. Unit tests (≤ 4 min) │
│ 3. Mutation (changed files only) (≤ 3 min) │
│ 4. SAST + SCA + secrets scan (≤ 2 min) │
│ 5. Integration tests (Testcontainers) (≤ 12 min) │
│ 6. Contract tests (Pact + Avro) (≤ 3 min) │
│ 7. Build + SBOM + sign (≤ 5 min) │
│ 8. E2E smoke (10 critical journeys) (≤ 8 min) │
│ 9. Accessibility scan (≤ 3 min) │
│10. Lighthouse budgets (≤ 3 min) │
│11. Deploy to preview (optional) │
│12. DAST (nightly on staging) │
│13. Load smoke (nightly on staging) │
│14. AI-eval regression (on AI changes or nightly) │
└─────────────────────────────────────────────────────────────────────┘
17.2 Gate Policy
| Stage | Failure policy |
|---|---|
| 1–7 | Hard block — no override |
| 8 | Hard block — no override |
| 9 | Hard block for serious/critical; ticket for moderate |
| 10 | Hard block if any budget exceeded by > 10% |
| 12 | Hard block for high/critical findings; incident for medium |
| 14 | Hard block if golden-set score regresses > 2% or any safety violation |
17.3 Deployment Gates
- Canary: 5% traffic for 30 min; auto-rollback on error-rate or latency breach vs. baseline.
- Progressive: 25% → 50% → 100% over 2 h.
- Feature flags: All risky features dark-launched behind LaunchDarkly flags with ramp plan in the PR description.
17.4 Post-Deploy Verification
- Synthetic checks (uptime, critical journeys) every minute.
- SLO burn-rate alerts (2%, 5%, 10% of monthly budget) to on-call.
- Automatic rollback if error budget burns > 10% in 1 h.
18. Coverage Expectations per Service
| Service | Unit | Integration | Contract | E2E (journeys) | AI-eval | Load SLO |
|---|---|---|---|---|---|---|
| identity | 90% | 85% | Pact + OIDC | J-01, J-06 | — | 200 ms p95 |
| tenancy | 90% | 85% | Pact | J-06 | — | 150 ms |
| catalog | 85% | 80% | Pact + GraphQL | J-01, J-02 | — | 100 ms |
| enrollment | 90% | 85% | Pact + events | J-01, J-02 | — | 150 ms |
| progress | 95% | 90% | events | J-03, J-09 | — | 300 ms end-to-end |
| attempts/grading | 95% | 90% | events | J-04 | grader-eval | 400 ms |
| authoring | 85% | 80% | Pact | J-05 | generator-eval | 400 ms |
| ai-tutor | 80% | 75% | Pact | J-07 | tutor-eval | 2.5 s TTFT |
| ai-content-gen | 80% | 75% | events | J-05 | generator-eval | async |
| payments | 95% | 90% | Pact + Stripe | J-02 | — | 500 ms |
| notifications | 85% | 80% | events | J-04 | — | async |
| content-delivery | 90% | 85% | Pact | J-03 | — | 100 ms |
| offline-sync | 95% | 90% | events | J-03, J-09 | — | 300 ms |
| analytics | 80% | 75% | events | J-08 | — | async |
| admin | 85% | 80% | Pact | J-06, J-10 | — | 300 ms |
| audit | 90% | 85% | events | J-10 | — | append-only; no data loss |
| search | 85% | 80% | Pact | J-01 | — | 200 ms |
| localization | 85% | 80% | Pact | J-01 | translation-eval | 150 ms |
19. Tooling Matrix
| Concern | Primary | Secondary | Notes |
|---|---|---|---|
| Unit (TS) | Vitest | Jest | Vitest preferred; shared config in tooling/vitest |
| Unit (Go) | go test | testify | |
| Unit (Python) | pytest | — | pytest-xdist for parallel |
| Mutation (TS) | Stryker | — | |
| Mutation (Go) | go-mutesting | — | |
| Mutation (Python) | mutmut | — | |
| Integration | Testcontainers | docker-compose | Testcontainers first |
| Contract (HTTP) | Pact | Broker self-hosted | |
| Contract (Events) | Schema Registry + custom harness | Avro-first | |
| E2E | Playwright | Cypress | Playwright standard |
| Mobile E2E | Detox | Maestro | |
| Visual | Playwright snapshots | Chromatic | |
| Accessibility | axe-core + axe-playwright | Lighthouse a11y | |
| Perf (frontend) | Lighthouse CI | WebPageTest | |
| Load | k6 | Locust, Gatling | |
| Chaos | LitmusChaos | Chaos Mesh, AWS FIS | |
| Security (SAST) | Semgrep + CodeQL | ||
| Security (DAST) | OWASP ZAP | Burp | |
| Security (SCA) | Snyk + OSV | Dependabot | |
| Secrets | gitleaks | trufflehog | |
| AI eval | ghasi-evals (on promptfoo + Inspect) | ||
| Observability | OpenTelemetry + Grafana + ClickHouse | ||
| Feature flags | LaunchDarkly | OpenFeature SDK | |
| Test data | ghasi-datagen | Faker |
20. Governance, Ownership & RACI
20.1 Roles
- Service team — owns unit, integration, contract, service-level E2E, and performance SLOs for its service.
- Platform QA Guild — owns cross-service E2E, load/chaos harnesses, flake policy, and tooling.
- AI Safety team — owns safety corpus, eval harness, red-team.
- Security team — owns SAST/DAST/SCA tooling, authz matrix enforcement, pen-test coordination.
- Accessibility team — owns axe ruleset, assistive-tech audits.
- SRE — owns chaos experiments in prod, SLO burn alerts, rollback automation.
20.2 RACI (excerpt)
| Activity | Service team | QA Guild | AI Safety | Security | SRE |
|---|---|---|---|---|---|
| Write unit + integration tests | R/A | C | — | — | — |
| Maintain E2E journey suite | C | R/A | — | — | — |
| Maintain AI-eval golden set | C | C | R/A | — | — |
| Pen-test scheduling | C | C | — | R/A | C |
| Chaos in prod | C | C | — | C | R/A |
| Flake triage | R | A | — | — | — |
| Release gate override (emergency) | C | C | C | C | R/A |
20.3 Cadence
- Weekly: QA Guild reviews flakes, coverage drifts, load regressions.
- Monthly: AI Safety red-team debrief; chaos experiment retro.
- Quarterly: External pen-test; full-system GameDay.
- Annually: Strategy refresh (this document).
Appendix A — Test IDs & Traceability
Every test carries a stable ID following:
<service>.<layer>.<capability>.<scenario>
Examples:
enrollment.unit.policy.expired-enrollment-blocks-attemptprogress.integration.outbox.crash-mid-flush-no-duplicatetutor.ai.safety.self-harm-corpus-v3player.e2e.offline.seven-day-streak
Traceability is maintained in 06-traceability-matrix.md: every epic/story maps to ≥ 1 test ID at each applicable layer.
Appendix B — Flake Policy & Quarantine
- A test that fails non-deterministically is quarantined within 24 h (moved to a
@quarantinedtag, excluded from merge gates). - The producing team has 5 business days to either fix and restore, or delete.
- A test may not remain quarantined > 10 business days without a written exception from the QA Guild lead.
- Quarantine count per service is a tracked KPI; > 5 active quarantines triggers a focused reliability sprint.
- Root-cause categories are tagged (
flake:timing,flake:shared-state,flake:network,flake:ai-nondeterminism) to drive harness improvements.
End of document.