Enterprise Architecture
:::info Source
Sourced from docs/01-enterprise-architecture.md in the documentation repo.
:::
Companion docs: 02 DDD & Bounded Contexts · 03 Microservices · 04 Event-Driven · 13 Security & Tenancy · ADR 0001 — Kong edge gateway
1. Business Architecture
1.1 Stakeholders & Personas
| Side | Persona | Primary Goals |
|---|---|---|
| Provider | Course Author | Create high-quality interactive courses fast, with AI assistance |
| Provider | Provider Admin | Manage catalog, pricing, payouts, analytics |
| Organization | Org Admin | Assign courses to employees, enforce compliance, get reports |
| Organization | Org Manager | View team progress, intervene on at-risk learners |
| Learner | Employee Learner | Complete assigned trainings, including offline |
| Learner | Individual Learner | Buy courses, learn on demand, earn certificates |
| Internal | Platform Admin | Tenant management, abuse handling, billing reconciliation |
| Internal | Compliance Officer | Audit logs, AI provenance review, data subject requests |
1.2 Value Streams
- Author → Publish → Sell (Provider value stream)
- Discover → License → Assign (Organization value stream)
- Receive → Learn → Certify (Learner value stream)
- Operate → Audit → Comply (Platform/Compliance value stream)
1.3 Capability Map (top-level)
Identity & Access | Tenant & Org Mgmt | Catalog & Discovery | Authoring Suite
Marketplace & Licensing | Billing & Payouts | Assignment & Compliance
Learning Delivery & Player | Progress & Records (LRS) | Assessment & Grading
Certification | Notifications | Media Pipeline | Search & Recommendation
Analytics & Reporting | AI Services | Offline Sync | Security & Compliance
2. Application Architecture (Clean Architecture Across the Estate)
Every backend microservice is structured per Clean / Hexagonal Architecture, with Domain-Driven layers:
┌────────────────────────── Presentation ───────────────────────────┐
│ NestJS Controllers · GraphQL Resolvers (optional) · WebSockets/SSE │
└─────────────┬──────────────────────────────────────────────────────┘
│
┌────────────────────────── Application ───────────────────────────┐
│ Use-Cases · Command/Query Handlers · DTOs · Mappers · Ports │
└─────────────┬──────────────────────────────────────────────────────┘
│
┌────────────────────────── Domain (pure TS) ──────────────────────┐
│ Aggregates · Entities · Value Objects · Domain Events · Services │
└─────────────┬──────────────────────────────────────────────────────┘
│
┌────────────────────────── Infrastructure ────────────────────────┐
│ Postgres Repos · NATS Pub/Sub · S3/R2 Adapters · HTTP Clients │
│ Outbox · Saga Engines · AI Gateway Client · Sync Client │
└──────────────────────────────────────────────────────────────────┘
Strict dependency rule: outer layers depend inward only. Domain has zero framework imports. Application defines ports/ interfaces; infrastructure provides adapters bound at module wiring time.
The frontend mirrors this with: app/ (presentation routes) → services/ + hooks/ (application) → lib/domain/ (pure TS domain models) → lib/adapters/ (HTTP, IndexedDB, Service Worker, AI client).
3. Bounded Contexts (Detail in doc 02)
18 contexts grouped by domain class:
| Class | Contexts |
|---|---|
| Core | Authoring · Delivery · Progress (LRS) · Assignment · Marketplace · AI Services · Offline Sync |
| Supporting | Catalog · Content-Packaging · Assessment · Certification · Enrollment · Search · Analytics |
| Generic | Identity · Tenant · Billing · Notification · Media |
4. Data Architecture
4.1 Storage Topology
| Store | Owner | Purpose |
|---|---|---|
| Postgres (per service) | Each microservice | OLTP, RLS-isolated |
| pgvector (extension) | ai-gateway-service | Embeddings, semantic search |
| OpenSearch | search-service | Lexical search index |
| Redis | All services | Caches, rate-limit counters, idempotency keys, ephemeral session |
| S3 / Cloudflare R2 | media-service, content-service | Media, SCORM zips, PlayPackage Bundles, AI artifacts |
| ClickHouse (or BigQuery / Snowflake) | analytics-service | Columnar warehouse for reporting |
| NATS JetStream | All services | Event log, durable streams |
| IndexedDB (Dexie) | Web client | Offline data, outbox |
| SQLite | Mobile/desktop client | Same logical schema as IndexedDB |
4.2 Data Ownership
- Each service owns its schema. Cross-service queries forbidden. Read-models are projected via NATS subscriptions.
- Write-side authoritative; read-side eventually consistent (target lag < 2s p95).
- Reference data (e.g., country list, languages) is replicated via NATS broadcast on changes.
4.3 Multi-Tenant Isolation
tenant_id UUID NOT NULLon every row.- Postgres Row-Level Security policies (
USING (tenant_id = current_setting('app.tenant_id')::uuid)) enforce isolation at the DB layer. - Application sets
app.tenant_idper request from JWT. - Largest tenants may be promoted to dedicated logical schemas (no shared tables) without API change.
- Shared connection pools route through a tenant-aware proxy (PgBouncer + connection-init that sets
app.tenant_id).
5. Technology Architecture
5.1 Runtime Components
┌──────────────┐ ┌──────────────┐
Web (PWA) ─┤ Edge CDN ├────────►│ Kong Gateway │
Mobile App ─┤ (Cloudflare)│ │ (API edge) │
Desktop └──────┬───────┘ └──────┬───────┘
│ │
│ ▼
┌──────▼─────┐ ┌──────────────────────────┐
│ Next.js SSR│ │ NestJS Microservices x19 │
└────────────┘ └──────┬───────────────────┘
│
┌────────────────┼─────────────────┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Postgres │ │ NATS JS │ │ S3 / R2 │
└────────────┘ └────────────┘ └────────────┘
▲ ▲ ▲
│ │ │
┌────────────┐ ┌────────────┐ ┌────────────┐
│ pgvector │ │ OpenSearch │ │ ClickHouse │
└────────────┘ └────────────┘ └────────────┘
5.2 Component Notes
- Kong API Gateway (see ADR 0001) terminates north-south TLS, enforces edge auth (e.g. JWT/JWKS, API keys on documented routes), applies global and per-route rate limits, and forwards correlation and tenancy headers. Microservices still enforce authorization, tenant invariants, idempotency, and RLS.
- Service mesh optional; if used, mTLS between services. JWT continues to carry user identity end-to-end.
- Schema registry (NATS subjects → JSON Schema) versioned in a Git-backed registry consumed at build time.
6. Integration Architecture
| Integration | Pattern | Details |
|---|---|---|
| Inter-service commands | NATS request/reply | Hard-bounded ≤ 2 hops |
| Inter-service events | NATS JetStream | At-least-once, idempotent consumers |
| External: Identity | OIDC/SAML | Pluggable per tenant |
| External: Billing | Webhooks (in/out) | Stripe/Adyen/Tap (vendor-abstract) |
| External: Email/SMS | Provider adapters | SES/Sendgrid/Twilio (abstracted) |
| External: AI vendors | Behind ai-gateway | Multiple providers fan-out via router |
| External: LMS imports | SCORM upload + xAPI ingest | Per content-service |
| Public APIs | REST + Webhooks (signed) | Per-tenant API keys |
| Embedded experiences | Signed iframe + LTI 1.3 | Player embedding for partners |
7. AI-First Positioning
AI is a first-class architectural concern, not a sidecar.
- Single point of control:
ai-gateway-serviceis the only egress for any model call. All product services depend onports/AIClientwhose default adapter calls the gateway. - Capability surface: completion (chat), structured generation (JSON), embeddings, image, TTS/STT, moderation, classification, summarization, translation.
- Local-first: the gateway exposes the same interface to clients via
/ai/v1/*SSE; clients may route to local models (WebGPU/WASM/on-device) and only call the gateway when local inference is unavailable or quality threshold not met. Local inferences still emitai.inference.local.completed.v1events for audit when the device next syncs. - Provenance: every AI artifact carries an
ai_provenanceblock (see doc 12). Domain aggregates that hold AI content reject persistence without it. - Human-in-the-loop: AI-generated authoring blocks are persisted with
status: 'draft_ai'; promotion to'reviewed' | 'published'requires a domain action by an authorized user. - Safety: pre-call moderation (input policy), post-call moderation (output policy), PII redaction for any cloud-bound payload, refusal handling with UX path.
- Cost & quotas: per-tenant budgets, per-feature quotas, soft-degrade (route to cheaper model) before hard-stop.
- Auditability: every gateway call writes
ai.gateway.call.completed.v1with prompt-hash, model, tokens, cost, traceId, tenantId, userId, decisionId.
8. Offline-First Positioning
Offline is a first-class architectural concern, not a fallback.
- Single sync protocol: all clients use
sync-servicevia/sync/v1/pull|push. No service invents its own sync. - Local store port:
LocalStoreinterface implemented by Dexie (web) and SQLite (mobile/desktop). Domain logic on the client targets the port — never the implementation. - Outbox-on-client: every mutation produces a
LocalMutationrow keyed byclientMutationId. Sync push sends these in causal order; server is idempotent. - PlayPackage Bundles: delivery-service publishes signed, encrypted bundles per (course version, locale). Clients download, verify signature, decrypt with device-bound key, mount in local storage.
- Conflict policy per aggregate: declared in each service doc; enforced server-side; surfaced in UI when needed (authoring drafts only).
- License enforcement offline: every PlayPackage Bundle includes a license envelope (offline expiry, device limit, feature flags). Player refuses to play expired/revoked bundles. Revocation propagates on next sync.
- Tamper detection: bundle hash + signature verified on every mount. Failure raises
delivery.bundle.tamper_detected.v1. - Offline AI: local inference engine integrated into client; usage logged locally, replayed at sync.
- Storage management: quotas configurable per tenant + per device; eviction policy LRU among non-pinned bundles; user-initiated pinning prevents eviction.
9. Multi-Tenant Model
| Layer | Mechanism |
|---|---|
| API | JWT carries tenant_id; gateway sets RequestContext.tenant; deny if absent on tenant-scoped endpoints |
| Application | All use-cases require TenantId parameter; cross-tenant references rejected at constructor |
| Domain | TenantId is a value object on every aggregate root; aggregates refuse construction without it |
| DB | tenant_id column + RLS on every table; no service-account bypass except for ops jobs which set context explicitly |
| Storage (S3/R2) | Per-tenant prefix s3://bucket/tenants/<tenantId>/...; signed URLs scoped per object |
| Search | Tenant filter injected into every query; index aliases per tenant for the largest customers |
| Vector | Tenant filter on every similarity query; collection partitioning for largest |
| AI | Gateway pins per-tenant prompts, models, budgets, safety policies |
| Sync | Cursor scoped to (tenantId, userId, deviceId) |
Tenant types:
- Org tenant — multi-user, RBAC, SSO-eligible.
- Provider tenant — author + sell on marketplace; can be combined with org tenant.
- Individual tenant — single-user, no RBAC, social/email login.
10. Security Model (overview; full detail in doc 13)
- AuthN: OIDC/SAML SSO + email+password + magic link + WebAuthn; JWT (15-min access, 30-day rotating refresh).
- AuthZ: RBAC at coarse scope, ABAC for fine-grained (attribute =
tenant_id,org_unit,course_visibility). - Encryption: TLS 1.3 everywhere; AES-256 at rest; per-tenant data keys via KMS envelope encryption; per-device keys for offline bundles.
- Audit: immutable append-only log streamed off NATS; daily Merkle-root anchoring for tamper evidence.
- AI safety: see section 7; full policy in doc 13.
- Offline: device binding, license envelope, bundle signing — see section 8.
11. Compliance Posture (overview)
- GDPR: lawful basis per processing activity; data subject rights (export, erasure, portability) implemented per service via
gdpr.subject_request.received.v1choreography. - SOC 2 Type II: logging, change management, access reviews enforced via automation.
- ISO 27001: ISMS controls; risk register integrated with this spec set.
- HIPAA (optional add-on for healthcare tenants): BAA, PHI encryption, restricted AI providers (no training on tenant data), audit-export.
- Regional residency: per-tenant data region pin; cross-region replication only with explicit opt-in.
- Accessibility: WCAG 2.2 AA across all tenant-facing surfaces.
- AI-specific: EU AI Act risk-classification for each AI capability; AI provenance + human review records retained for 7 years.
12. Cross-Cutting Concerns (Architectural Mandates)
| Concern | Mandate |
|---|---|
| Observability | OpenTelemetry traces, metrics, logs across every service; trace propagation via traceparent header through HTTP, NATS, and SSE |
| Versioning | URLs /api/v1, events *.vN, schemas in registry, prompts semver-pinned |
| Idempotency | All write endpoints accept Idempotency-Key; required on sync push |
| Time | UTC everywhere on the wire; client converts; lamport clocks for sync ordering |
| I18n | Every user-facing string flows through translation pipeline; locale-aware formatting; AI translation behind the AI Gateway |
| RTL | Logical CSS properties only; tested in both directions; no LTR-only widgets |
| Accessibility | WCAG 2.2 AA; automated axe checks gate CI; manual audit per release of player + author |
| Testing | Unit (domain-pure), integration (Testcontainers), contract (Pact), E2E (Playwright), prompt regression (golden + structural), offline (airplane-mode E2E + sync replay), load (k6) |
| Feature flags | Tenant-scoped; default-off for AI features pending per-tenant opt-in |
| Disaster recovery | RPO 5min (Postgres PITR + JetStream replay), RTO 60min, region failover quarterly drill |
13. Why This Architecture (DDD + Clean Architecture Tie-back)
- DDD: Bounded contexts align with team ownership and ubiquitous language. Aggregates protect invariants (e.g., a
Coursecannot reference blocks from another tenant). Domain events drive cross-context choreography rather than RPC chains. - Clean Architecture: Domain-first design lets us swap NestJS for any framework, Postgres for any RDBMS, and NATS for any broker without touching business rules. Use-cases are explicit and individually testable.
- Event-driven: Decouples services, enables replay, gives us a natural audit log, and makes offline-first sync expressible as a stream of intents.
- AI-first: Centralizing AI behind a single gateway means safety, provenance, cost, and governance are enforced once — not 19 times.
- Offline-first: A single sync protocol and a single client-side store port mean every team builds offline support the same way, with the same conflict semantics, instead of inventing N incompatible local caches.