Enterprise Architecture

:::info Source Sourced from docs/01-enterprise-architecture.md in the documentation repo. :::

Companion docs: 02 DDD & Bounded Contexts · 03 Microservices · 04 Event-Driven · 13 Security & Tenancy · ADR 0001 — Kong edge gateway

1. Business Architecture

1.1 Stakeholders & Personas

Side	Persona	Primary Goals
Provider	Course Author	Create high-quality interactive courses fast, with AI assistance
Provider	Provider Admin	Manage catalog, pricing, payouts, analytics
Organization	Org Admin	Assign courses to employees, enforce compliance, get reports
Organization	Org Manager	View team progress, intervene on at-risk learners
Learner	Employee Learner	Complete assigned trainings, including offline
Learner	Individual Learner	Buy courses, learn on demand, earn certificates
Internal	Platform Admin	Tenant management, abuse handling, billing reconciliation
Internal	Compliance Officer	Audit logs, AI provenance review, data subject requests

1.2 Value Streams

Author → Publish → Sell (Provider value stream)
Discover → License → Assign (Organization value stream)
Receive → Learn → Certify (Learner value stream)
Operate → Audit → Comply (Platform/Compliance value stream)

1.3 Capability Map (top-level)

Identity & Access | Tenant & Org Mgmt | Catalog & Discovery | Authoring Suite
Marketplace & Licensing | Billing & Payouts | Assignment & Compliance
Learning Delivery & Player | Progress & Records (LRS) | Assessment & Grading
Certification | Notifications | Media Pipeline | Search & Recommendation
Analytics & Reporting | AI Services | Offline Sync | Security & Compliance

2. Application Architecture (Clean Architecture Across the Estate)

Every backend microservice is structured per Clean / Hexagonal Architecture, with Domain-Driven layers:

┌────────────────────────── Presentation ───────────────────────────┐
│ NestJS Controllers · GraphQL Resolvers (optional) · WebSockets/SSE │
└─────────────┬──────────────────────────────────────────────────────┘
              │
┌────────────────────────── Application ───────────────────────────┐
│ Use-Cases · Command/Query Handlers · DTOs · Mappers · Ports      │
└─────────────┬──────────────────────────────────────────────────────┘
              │
┌────────────────────────── Domain (pure TS) ──────────────────────┐
│ Aggregates · Entities · Value Objects · Domain Events · Services │
└─────────────┬──────────────────────────────────────────────────────┘
              │
┌────────────────────────── Infrastructure ────────────────────────┐
│ Postgres Repos · NATS Pub/Sub · S3/R2 Adapters · HTTP Clients    │
│ Outbox · Saga Engines · AI Gateway Client · Sync Client          │
└──────────────────────────────────────────────────────────────────┘

Strict dependency rule: outer layers depend inward only. Domain has zero framework imports. Application defines ports/ interfaces; infrastructure provides adapters bound at module wiring time.

The frontend mirrors this with: app/ (presentation routes) → services/ + hooks/ (application) → lib/domain/ (pure TS domain models) → lib/adapters/ (HTTP, IndexedDB, Service Worker, AI client).

3. Bounded Contexts (Detail in doc 02)

18 contexts grouped by domain class:

Class	Contexts
Core	Authoring · Delivery · Progress (LRS) · Assignment · Marketplace · AI Services · Offline Sync
Supporting	Catalog · Content-Packaging · Assessment · Certification · Enrollment · Search · Analytics
Generic	Identity · Tenant · Billing · Notification · Media

4. Data Architecture

4.1 Storage Topology

Store	Owner	Purpose
Postgres (per service)	Each microservice	OLTP, RLS-isolated
pgvector (extension)	ai-gateway-service	Embeddings, semantic search
OpenSearch	search-service	Lexical search index
Redis	All services	Caches, rate-limit counters, idempotency keys, ephemeral session
S3 / Cloudflare R2	media-service, content-service	Media, SCORM zips, PlayPackage Bundles, AI artifacts
ClickHouse (or BigQuery / Snowflake)	analytics-service	Columnar warehouse for reporting
NATS JetStream	All services	Event log, durable streams
IndexedDB (Dexie)	Web client	Offline data, outbox
SQLite	Mobile/desktop client	Same logical schema as IndexedDB

4.2 Data Ownership

Each service owns its schema. Cross-service queries forbidden. Read-models are projected via NATS subscriptions.
Write-side authoritative; read-side eventually consistent (target lag < 2s p95).
Reference data (e.g., country list, languages) is replicated via NATS broadcast on changes.

4.3 Multi-Tenant Isolation

tenant_id UUID NOT NULL on every row.
Postgres Row-Level Security policies (USING (tenant_id = current_setting('app.tenant_id')::uuid)) enforce isolation at the DB layer.
Application sets app.tenant_id per request from JWT.
Largest tenants may be promoted to dedicated logical schemas (no shared tables) without API change.
Shared connection pools route through a tenant-aware proxy (PgBouncer + connection-init that sets app.tenant_id).

5. Technology Architecture

5.1 Runtime Components

                ┌──────────────┐         ┌──────────────┐
   Web (PWA)  ─┤   Edge CDN   ├────────►│ Kong Gateway │
   Mobile App ─┤  (Cloudflare)│         │  (API edge)  │
   Desktop    └──────┬───────┘         └──────┬───────┘
                     │                          │
                     │                          ▼
              ┌──────▼─────┐     ┌──────────────────────────┐
              │ Next.js SSR│     │ NestJS Microservices x19 │
              └────────────┘     └──────┬───────────────────┘
                                        │
                       ┌────────────────┼─────────────────┐
                       ▼                ▼                 ▼
                ┌────────────┐   ┌────────────┐    ┌────────────┐
                │ Postgres   │   │   NATS JS  │    │  S3 / R2   │
                └────────────┘   └────────────┘    └────────────┘
                       ▲                ▲                 ▲
                       │                │                 │
                ┌────────────┐   ┌────────────┐    ┌────────────┐
                │ pgvector   │   │ OpenSearch │    │ ClickHouse │
                └────────────┘   └────────────┘    └────────────┘

5.2 Component Notes

Kong API Gateway (see ADR 0001) terminates north-south TLS, enforces edge auth (e.g. JWT/JWKS, API keys on documented routes), applies global and per-route rate limits, and forwards correlation and tenancy headers. Microservices still enforce authorization, tenant invariants, idempotency, and RLS.
Service mesh optional; if used, mTLS between services. JWT continues to carry user identity end-to-end.
Schema registry (NATS subjects → JSON Schema) versioned in a Git-backed registry consumed at build time.

6. Integration Architecture

Integration	Pattern	Details
Inter-service commands	NATS request/reply	Hard-bounded ≤ 2 hops
Inter-service events	NATS JetStream	At-least-once, idempotent consumers
External: Identity	OIDC/SAML	Pluggable per tenant
External: Billing	Webhooks (in/out)	Stripe/Adyen/Tap (vendor-abstract)
External: Email/SMS	Provider adapters	SES/Sendgrid/Twilio (abstracted)
External: AI vendors	Behind ai-gateway	Multiple providers fan-out via router
External: LMS imports	SCORM upload + xAPI ingest	Per content-service
Public APIs	REST + Webhooks (signed)	Per-tenant API keys
Embedded experiences	Signed iframe + LTI 1.3	Player embedding for partners

7. AI-First Positioning

AI is a first-class architectural concern, not a sidecar.

Single point of control: ai-gateway-service is the only egress for any model call. All product services depend on ports/AIClient whose default adapter calls the gateway.
Capability surface: completion (chat), structured generation (JSON), embeddings, image, TTS/STT, moderation, classification, summarization, translation.
Local-first: the gateway exposes the same interface to clients via /ai/v1/* SSE; clients may route to local models (WebGPU/WASM/on-device) and only call the gateway when local inference is unavailable or quality threshold not met. Local inferences still emit ai.inference.local.completed.v1 events for audit when the device next syncs.
Provenance: every AI artifact carries an ai_provenance block (see doc 12). Domain aggregates that hold AI content reject persistence without it.
Human-in-the-loop: AI-generated authoring blocks are persisted with status: 'draft_ai'; promotion to 'reviewed' | 'published' requires a domain action by an authorized user.
Safety: pre-call moderation (input policy), post-call moderation (output policy), PII redaction for any cloud-bound payload, refusal handling with UX path.
Cost & quotas: per-tenant budgets, per-feature quotas, soft-degrade (route to cheaper model) before hard-stop.
Auditability: every gateway call writes ai.gateway.call.completed.v1 with prompt-hash, model, tokens, cost, traceId, tenantId, userId, decisionId.

8. Offline-First Positioning

Offline is a first-class architectural concern, not a fallback.

Single sync protocol: all clients use sync-service via /sync/v1/pull|push. No service invents its own sync.
Local store port: LocalStore interface implemented by Dexie (web) and SQLite (mobile/desktop). Domain logic on the client targets the port — never the implementation.
Outbox-on-client: every mutation produces a LocalMutation row keyed by clientMutationId. Sync push sends these in causal order; server is idempotent.
PlayPackage Bundles: delivery-service publishes signed, encrypted bundles per (course version, locale). Clients download, verify signature, decrypt with device-bound key, mount in local storage.
Conflict policy per aggregate: declared in each service doc; enforced server-side; surfaced in UI when needed (authoring drafts only).
License enforcement offline: every PlayPackage Bundle includes a license envelope (offline expiry, device limit, feature flags). Player refuses to play expired/revoked bundles. Revocation propagates on next sync.
Tamper detection: bundle hash + signature verified on every mount. Failure raises delivery.bundle.tamper_detected.v1.
Offline AI: local inference engine integrated into client; usage logged locally, replayed at sync.
Storage management: quotas configurable per tenant + per device; eviction policy LRU among non-pinned bundles; user-initiated pinning prevents eviction.

9. Multi-Tenant Model

Layer	Mechanism
API	JWT carries `tenant_id`; gateway sets `RequestContext.tenant`; deny if absent on tenant-scoped endpoints
Application	All use-cases require `TenantId` parameter; cross-tenant references rejected at constructor
Domain	`TenantId` is a value object on every aggregate root; aggregates refuse construction without it
DB	`tenant_id` column + RLS on every table; no service-account bypass except for ops jobs which set context explicitly
Storage (S3/R2)	Per-tenant prefix `s3://bucket/tenants/<tenantId>/...`; signed URLs scoped per object
Search	Tenant filter injected into every query; index aliases per tenant for the largest customers
Vector	Tenant filter on every similarity query; collection partitioning for largest
AI	Gateway pins per-tenant prompts, models, budgets, safety policies
Sync	Cursor scoped to (tenantId, userId, deviceId)

Tenant types:

Org tenant — multi-user, RBAC, SSO-eligible.
Provider tenant — author + sell on marketplace; can be combined with org tenant.
Individual tenant — single-user, no RBAC, social/email login.

10. Security Model (overview; full detail in doc 13)

AuthN: OIDC/SAML SSO + email+password + magic link + WebAuthn; JWT (15-min access, 30-day rotating refresh).
AuthZ: RBAC at coarse scope, ABAC for fine-grained (attribute = tenant_id, org_unit, course_visibility).
Encryption: TLS 1.3 everywhere; AES-256 at rest; per-tenant data keys via KMS envelope encryption; per-device keys for offline bundles.
Audit: immutable append-only log streamed off NATS; daily Merkle-root anchoring for tamper evidence.
AI safety: see section 7; full policy in doc 13.
Offline: device binding, license envelope, bundle signing — see section 8.

11. Compliance Posture (overview)

GDPR: lawful basis per processing activity; data subject rights (export, erasure, portability) implemented per service via gdpr.subject_request.received.v1 choreography.
SOC 2 Type II: logging, change management, access reviews enforced via automation.
ISO 27001: ISMS controls; risk register integrated with this spec set.
HIPAA (optional add-on for healthcare tenants): BAA, PHI encryption, restricted AI providers (no training on tenant data), audit-export.
Regional residency: per-tenant data region pin; cross-region replication only with explicit opt-in.
Accessibility: WCAG 2.2 AA across all tenant-facing surfaces.
AI-specific: EU AI Act risk-classification for each AI capability; AI provenance + human review records retained for 7 years.

12. Cross-Cutting Concerns (Architectural Mandates)

Concern	Mandate
Observability	OpenTelemetry traces, metrics, logs across every service; trace propagation via `traceparent` header through HTTP, NATS, and SSE
Versioning	URLs `/api/v1`, events `*.vN`, schemas in registry, prompts semver-pinned
Idempotency	All write endpoints accept `Idempotency-Key`; required on sync push
Time	UTC everywhere on the wire; client converts; lamport clocks for sync ordering
I18n	Every user-facing string flows through translation pipeline; locale-aware formatting; AI translation behind the AI Gateway
RTL	Logical CSS properties only; tested in both directions; no LTR-only widgets
Accessibility	WCAG 2.2 AA; automated axe checks gate CI; manual audit per release of player + author
Testing	Unit (domain-pure), integration (Testcontainers), contract (Pact), E2E (Playwright), prompt regression (golden + structural), offline (airplane-mode E2E + sync replay), load (k6)
Feature flags	Tenant-scoped; default-off for AI features pending per-tenant opt-in
Disaster recovery	RPO 5min (Postgres PITR + JetStream replay), RTO 60min, region failover quarterly drill

13. Why This Architecture (DDD + Clean Architecture Tie-back)

DDD: Bounded contexts align with team ownership and ubiquitous language. Aggregates protect invariants (e.g., a Course cannot reference blocks from another tenant). Domain events drive cross-context choreography rather than RPC chains.
Clean Architecture: Domain-first design lets us swap NestJS for any framework, Postgres for any RDBMS, and NATS for any broker without touching business rules. Use-cases are explicit and individually testable.
Event-driven: Decouples services, enables replay, gives us a natural audit log, and makes offline-first sync expressible as a stream of intents.
AI-first: Centralizing AI behind a single gateway means safety, provenance, cost, and governance are enforced once — not 19 times.
Offline-first: A single sync protocol and a single client-side store port mean every team builds offline support the same way, with the same conflict semantics, instead of inventing N incompatible local caches.

1. Business Architecture​

1.1 Stakeholders & Personas​

1.2 Value Streams​

1.3 Capability Map (top-level)​

2. Application Architecture (Clean Architecture Across the Estate)​

3. Bounded Contexts (Detail in doc 02)​

4. Data Architecture​

4.1 Storage Topology​

4.2 Data Ownership​

4.3 Multi-Tenant Isolation​

5. Technology Architecture​

5.1 Runtime Components​

5.2 Component Notes​

6. Integration Architecture​

7. AI-First Positioning​

8. Offline-First Positioning​

9. Multi-Tenant Model​

10. Security Model (overview; full detail in doc 13)​

11. Compliance Posture (overview)​

12. Cross-Cutting Concerns (Architectural Mandates)​

13. Why This Architecture (DDD + Clean Architecture Tie-back)​