Skip to main content

Enterprise Architecture

:::info Source Sourced from docs/01-enterprise-architecture.md in the documentation repo. :::

Companion docs: 02 DDD & Bounded Contexts · 03 Microservices · 04 Event-Driven · 13 Security & Tenancy · ADR 0001 — Kong edge gateway

1. Business Architecture

1.1 Stakeholders & Personas

SidePersonaPrimary Goals
ProviderCourse AuthorCreate high-quality interactive courses fast, with AI assistance
ProviderProvider AdminManage catalog, pricing, payouts, analytics
OrganizationOrg AdminAssign courses to employees, enforce compliance, get reports
OrganizationOrg ManagerView team progress, intervene on at-risk learners
LearnerEmployee LearnerComplete assigned trainings, including offline
LearnerIndividual LearnerBuy courses, learn on demand, earn certificates
InternalPlatform AdminTenant management, abuse handling, billing reconciliation
InternalCompliance OfficerAudit logs, AI provenance review, data subject requests

1.2 Value Streams

  1. Author → Publish → Sell (Provider value stream)
  2. Discover → License → Assign (Organization value stream)
  3. Receive → Learn → Certify (Learner value stream)
  4. Operate → Audit → Comply (Platform/Compliance value stream)

1.3 Capability Map (top-level)

Identity & Access | Tenant & Org Mgmt | Catalog & Discovery | Authoring Suite
Marketplace & Licensing | Billing & Payouts | Assignment & Compliance
Learning Delivery & Player | Progress & Records (LRS) | Assessment & Grading
Certification | Notifications | Media Pipeline | Search & Recommendation
Analytics & Reporting | AI Services | Offline Sync | Security & Compliance

2. Application Architecture (Clean Architecture Across the Estate)

Every backend microservice is structured per Clean / Hexagonal Architecture, with Domain-Driven layers:

┌────────────────────────── Presentation ───────────────────────────┐
│ NestJS Controllers · GraphQL Resolvers (optional) · WebSockets/SSE │
└─────────────┬──────────────────────────────────────────────────────┘

┌────────────────────────── Application ───────────────────────────┐
│ Use-Cases · Command/Query Handlers · DTOs · Mappers · Ports │
└─────────────┬──────────────────────────────────────────────────────┘

┌────────────────────────── Domain (pure TS) ──────────────────────┐
│ Aggregates · Entities · Value Objects · Domain Events · Services │
└─────────────┬──────────────────────────────────────────────────────┘

┌────────────────────────── Infrastructure ────────────────────────┐
│ Postgres Repos · NATS Pub/Sub · S3/R2 Adapters · HTTP Clients │
│ Outbox · Saga Engines · AI Gateway Client · Sync Client │
└──────────────────────────────────────────────────────────────────┘

Strict dependency rule: outer layers depend inward only. Domain has zero framework imports. Application defines ports/ interfaces; infrastructure provides adapters bound at module wiring time.

The frontend mirrors this with: app/ (presentation routes) → services/ + hooks/ (application) → lib/domain/ (pure TS domain models) → lib/adapters/ (HTTP, IndexedDB, Service Worker, AI client).

3. Bounded Contexts (Detail in doc 02)

18 contexts grouped by domain class:

ClassContexts
CoreAuthoring · Delivery · Progress (LRS) · Assignment · Marketplace · AI Services · Offline Sync
SupportingCatalog · Content-Packaging · Assessment · Certification · Enrollment · Search · Analytics
GenericIdentity · Tenant · Billing · Notification · Media

4. Data Architecture

4.1 Storage Topology

StoreOwnerPurpose
Postgres (per service)Each microserviceOLTP, RLS-isolated
pgvector (extension)ai-gateway-serviceEmbeddings, semantic search
OpenSearchsearch-serviceLexical search index
RedisAll servicesCaches, rate-limit counters, idempotency keys, ephemeral session
S3 / Cloudflare R2media-service, content-serviceMedia, SCORM zips, PlayPackage Bundles, AI artifacts
ClickHouse (or BigQuery / Snowflake)analytics-serviceColumnar warehouse for reporting
NATS JetStreamAll servicesEvent log, durable streams
IndexedDB (Dexie)Web clientOffline data, outbox
SQLiteMobile/desktop clientSame logical schema as IndexedDB

4.2 Data Ownership

  • Each service owns its schema. Cross-service queries forbidden. Read-models are projected via NATS subscriptions.
  • Write-side authoritative; read-side eventually consistent (target lag < 2s p95).
  • Reference data (e.g., country list, languages) is replicated via NATS broadcast on changes.

4.3 Multi-Tenant Isolation

  • tenant_id UUID NOT NULL on every row.
  • Postgres Row-Level Security policies (USING (tenant_id = current_setting('app.tenant_id')::uuid)) enforce isolation at the DB layer.
  • Application sets app.tenant_id per request from JWT.
  • Largest tenants may be promoted to dedicated logical schemas (no shared tables) without API change.
  • Shared connection pools route through a tenant-aware proxy (PgBouncer + connection-init that sets app.tenant_id).

5. Technology Architecture

5.1 Runtime Components

┌──────────────┐ ┌──────────────┐
Web (PWA) ─┤ Edge CDN ├────────►│ Kong Gateway │
Mobile App ─┤ (Cloudflare)│ │ (API edge) │
Desktop └──────┬───────┘ └──────┬───────┘
│ │
│ ▼
┌──────▼─────┐ ┌──────────────────────────┐
│ Next.js SSR│ │ NestJS Microservices x19 │
└────────────┘ └──────┬───────────────────┘

┌────────────────┼─────────────────┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Postgres │ │ NATS JS │ │ S3 / R2 │
└────────────┘ └────────────┘ └────────────┘
▲ ▲ ▲
│ │ │
┌────────────┐ ┌────────────┐ ┌────────────┐
│ pgvector │ │ OpenSearch │ │ ClickHouse │
└────────────┘ └────────────┘ └────────────┘

5.2 Component Notes

  • Kong API Gateway (see ADR 0001) terminates north-south TLS, enforces edge auth (e.g. JWT/JWKS, API keys on documented routes), applies global and per-route rate limits, and forwards correlation and tenancy headers. Microservices still enforce authorization, tenant invariants, idempotency, and RLS.
  • Service mesh optional; if used, mTLS between services. JWT continues to carry user identity end-to-end.
  • Schema registry (NATS subjects → JSON Schema) versioned in a Git-backed registry consumed at build time.

6. Integration Architecture

IntegrationPatternDetails
Inter-service commandsNATS request/replyHard-bounded ≤ 2 hops
Inter-service eventsNATS JetStreamAt-least-once, idempotent consumers
External: IdentityOIDC/SAMLPluggable per tenant
External: BillingWebhooks (in/out)Stripe/Adyen/Tap (vendor-abstract)
External: Email/SMSProvider adaptersSES/Sendgrid/Twilio (abstracted)
External: AI vendorsBehind ai-gatewayMultiple providers fan-out via router
External: LMS importsSCORM upload + xAPI ingestPer content-service
Public APIsREST + Webhooks (signed)Per-tenant API keys
Embedded experiencesSigned iframe + LTI 1.3Player embedding for partners

7. AI-First Positioning

AI is a first-class architectural concern, not a sidecar.

  • Single point of control: ai-gateway-service is the only egress for any model call. All product services depend on ports/AIClient whose default adapter calls the gateway.
  • Capability surface: completion (chat), structured generation (JSON), embeddings, image, TTS/STT, moderation, classification, summarization, translation.
  • Local-first: the gateway exposes the same interface to clients via /ai/v1/* SSE; clients may route to local models (WebGPU/WASM/on-device) and only call the gateway when local inference is unavailable or quality threshold not met. Local inferences still emit ai.inference.local.completed.v1 events for audit when the device next syncs.
  • Provenance: every AI artifact carries an ai_provenance block (see doc 12). Domain aggregates that hold AI content reject persistence without it.
  • Human-in-the-loop: AI-generated authoring blocks are persisted with status: 'draft_ai'; promotion to 'reviewed' | 'published' requires a domain action by an authorized user.
  • Safety: pre-call moderation (input policy), post-call moderation (output policy), PII redaction for any cloud-bound payload, refusal handling with UX path.
  • Cost & quotas: per-tenant budgets, per-feature quotas, soft-degrade (route to cheaper model) before hard-stop.
  • Auditability: every gateway call writes ai.gateway.call.completed.v1 with prompt-hash, model, tokens, cost, traceId, tenantId, userId, decisionId.

8. Offline-First Positioning

Offline is a first-class architectural concern, not a fallback.

  • Single sync protocol: all clients use sync-service via /sync/v1/pull|push. No service invents its own sync.
  • Local store port: LocalStore interface implemented by Dexie (web) and SQLite (mobile/desktop). Domain logic on the client targets the port — never the implementation.
  • Outbox-on-client: every mutation produces a LocalMutation row keyed by clientMutationId. Sync push sends these in causal order; server is idempotent.
  • PlayPackage Bundles: delivery-service publishes signed, encrypted bundles per (course version, locale). Clients download, verify signature, decrypt with device-bound key, mount in local storage.
  • Conflict policy per aggregate: declared in each service doc; enforced server-side; surfaced in UI when needed (authoring drafts only).
  • License enforcement offline: every PlayPackage Bundle includes a license envelope (offline expiry, device limit, feature flags). Player refuses to play expired/revoked bundles. Revocation propagates on next sync.
  • Tamper detection: bundle hash + signature verified on every mount. Failure raises delivery.bundle.tamper_detected.v1.
  • Offline AI: local inference engine integrated into client; usage logged locally, replayed at sync.
  • Storage management: quotas configurable per tenant + per device; eviction policy LRU among non-pinned bundles; user-initiated pinning prevents eviction.

9. Multi-Tenant Model

LayerMechanism
APIJWT carries tenant_id; gateway sets RequestContext.tenant; deny if absent on tenant-scoped endpoints
ApplicationAll use-cases require TenantId parameter; cross-tenant references rejected at constructor
DomainTenantId is a value object on every aggregate root; aggregates refuse construction without it
DBtenant_id column + RLS on every table; no service-account bypass except for ops jobs which set context explicitly
Storage (S3/R2)Per-tenant prefix s3://bucket/tenants/<tenantId>/...; signed URLs scoped per object
SearchTenant filter injected into every query; index aliases per tenant for the largest customers
VectorTenant filter on every similarity query; collection partitioning for largest
AIGateway pins per-tenant prompts, models, budgets, safety policies
SyncCursor scoped to (tenantId, userId, deviceId)

Tenant types:

  • Org tenant — multi-user, RBAC, SSO-eligible.
  • Provider tenant — author + sell on marketplace; can be combined with org tenant.
  • Individual tenant — single-user, no RBAC, social/email login.

10. Security Model (overview; full detail in doc 13)

  • AuthN: OIDC/SAML SSO + email+password + magic link + WebAuthn; JWT (15-min access, 30-day rotating refresh).
  • AuthZ: RBAC at coarse scope, ABAC for fine-grained (attribute = tenant_id, org_unit, course_visibility).
  • Encryption: TLS 1.3 everywhere; AES-256 at rest; per-tenant data keys via KMS envelope encryption; per-device keys for offline bundles.
  • Audit: immutable append-only log streamed off NATS; daily Merkle-root anchoring for tamper evidence.
  • AI safety: see section 7; full policy in doc 13.
  • Offline: device binding, license envelope, bundle signing — see section 8.

11. Compliance Posture (overview)

  • GDPR: lawful basis per processing activity; data subject rights (export, erasure, portability) implemented per service via gdpr.subject_request.received.v1 choreography.
  • SOC 2 Type II: logging, change management, access reviews enforced via automation.
  • ISO 27001: ISMS controls; risk register integrated with this spec set.
  • HIPAA (optional add-on for healthcare tenants): BAA, PHI encryption, restricted AI providers (no training on tenant data), audit-export.
  • Regional residency: per-tenant data region pin; cross-region replication only with explicit opt-in.
  • Accessibility: WCAG 2.2 AA across all tenant-facing surfaces.
  • AI-specific: EU AI Act risk-classification for each AI capability; AI provenance + human review records retained for 7 years.

12. Cross-Cutting Concerns (Architectural Mandates)

ConcernMandate
ObservabilityOpenTelemetry traces, metrics, logs across every service; trace propagation via traceparent header through HTTP, NATS, and SSE
VersioningURLs /api/v1, events *.vN, schemas in registry, prompts semver-pinned
IdempotencyAll write endpoints accept Idempotency-Key; required on sync push
TimeUTC everywhere on the wire; client converts; lamport clocks for sync ordering
I18nEvery user-facing string flows through translation pipeline; locale-aware formatting; AI translation behind the AI Gateway
RTLLogical CSS properties only; tested in both directions; no LTR-only widgets
AccessibilityWCAG 2.2 AA; automated axe checks gate CI; manual audit per release of player + author
TestingUnit (domain-pure), integration (Testcontainers), contract (Pact), E2E (Playwright), prompt regression (golden + structural), offline (airplane-mode E2E + sync replay), load (k6)
Feature flagsTenant-scoped; default-off for AI features pending per-tenant opt-in
Disaster recoveryRPO 5min (Postgres PITR + JetStream replay), RTO 60min, region failover quarterly drill

13. Why This Architecture (DDD + Clean Architecture Tie-back)

  • DDD: Bounded contexts align with team ownership and ubiquitous language. Aggregates protect invariants (e.g., a Course cannot reference blocks from another tenant). Domain events drive cross-context choreography rather than RPC chains.
  • Clean Architecture: Domain-first design lets us swap NestJS for any framework, Postgres for any RDBMS, and NATS for any broker without touching business rules. Use-cases are explicit and individually testable.
  • Event-driven: Decouples services, enables replay, gives us a natural audit log, and makes offline-first sync expressible as a stream of intents.
  • AI-first: Centralizing AI behind a single gateway means safety, provenance, cost, and governance are enforced once — not 19 times.
  • Offline-first: A single sync protocol and a single client-side store port mean every team builds offline support the same way, with the same conflict semantics, instead of inventing N incompatible local caches.