Skip to main content

Security & Tenancy

:::info Source Sourced from docs/13-security-compliance-tenancy.md in the documentation repo. :::

Companion: 01 Enterprise Architecture · 03 Microservices · 05 API Design · 12 Data Models

This document consolidates the platform's security model, compliance posture, and the multi-layer enforcement of multi-tenant isolation. Every other doc references this one for the canonical rules.

1. Threat Model (summary)

ThreatSurfaceMitigation
Credential stuffing / sprayidentity-serviceargon2id, lockout, adaptive MFA, anomaly classifier
Session hijackAPI gatewayshort JWT TTL + rotating refresh; device binding
Cross-tenant data leakAll servicesRLS + JWT scope + domain invariants
Insecure direct object referenceAll servicesABAC policy on every resource access
SCORM zip RCEcontent-serviceSandbox import + manifest validator + signed origin allowlist
Bundle tamperingclient + content-serviceJWS signature + AES-GCM + tamper report flow
Prompt injectionai-gateway-servicePrompt-injection classifier + system prompt isolation
AI PII exfiltrationai-gateway-servicePre-call PII redaction; provider no-train flag
Webhook replaywebhooks (in/out)HMAC signing + nonce + timestamp window
LMS embed CSRFLTILTI 1.3 platform key validation + per-launch nonce
Storage takeovermedia + contentSigned URLs scoped per object + per-caller
Supply chainAll servicesSBOM + provenance attestations (SLSA) + lockfiles
Insider abuseplatform adminJust-in-time elevation + four-eyes + audit
DoSedge + APIsRate limits + WAF + geographic blocking

2. Identity & Authentication

  • Standards: OAuth 2.1, OIDC, WebAuthn level 2, SAML 2.0.
  • Local AuthN: email + password (argon2id, m=64 MiB, t=3, p=1), magic link, WebAuthn passkeys.
  • MFA factors: TOTP, SMS (deprecated for sensitive scopes), WebAuthn, recovery codes.
  • Adaptive MFA: triggered by risk classifier (new device, atypical IP, behavior).
  • Sessions: access JWT (15 min), rotating refresh (30 d sliding, single-use rotation, family-revoke on detected reuse).
  • Device binding: every offline-capable device has a public key registered with identity-service; PlayPackage Bundles encrypted with key derived from (tenantKey, devicePubKey, bundleId).
  • JWT signing: asymmetric (EdDSA Ed25519) with KMS-backed keys + kid rotation; JWKS published.

3. Authorization (RBAC + ABAC)

3.1 Coarse RBAC roles (system + tenant)

  • platform_admin, compliance_officer
  • org_owner, org_admin, org_manager
  • provider_admin, author, reviewer, publisher
  • learner, individual

3.2 Fine-grained ABAC predicates

Examples (serialized as expression trees in permissions.condition):

  • resource.tenant_id == ctx.tenant_id
  • resource.org_unit_id IN ctx.user.org_units
  • resource.visibility IN ('marketplace','public')
  • resource.created_by == ctx.user.id
  • resource.assignment.owner_id == ctx.user.id

3.3 Decision flow

  1. JWT presented → verified → claims loaded into RequestContext.
  2. Route declares required resource:action.
  3. Policy engine evaluates (role grants action) AND (ABAC predicate true).
  4. Decision logged with decisionId (linked into AI/HITL audit chains).

Endpoint POST /api/v1/authz/check lets UIs ask before showing actions.

4. Multi-Tenant Isolation (Multi-Layer)

LayerEnforcementDetail
Edge (CDN)Per-tenant domain or path; CSP nonces per tenant
Kong + servicesEdge (Kong) validates JWT on protected routes; services inject tenant_id into RequestContext and reject mismatched X-Tenant-Id header (see ADR 0001)
Application (use cases)Use-cases require TenantId parameter; cross-tenant references rejected at construction
Domain (aggregates)TenantId value object on every aggregate root; invariants reject cross-tenant references
PostgresRLS enabled on every table: USING (tenant_id = current_setting('app.tenant_id')::uuid)
Postgres connectionsPool wrapper sets app.tenant_id per request (proxy-init)
Storage (S3/R2)Per-tenant prefix tenants/{tid}/...; signed URLs scoped per object + caller; bucket policy denies cross-tenant prefix access
Search (OpenSearch)Tenant filter injected; alias-per-tenant for largest
Vectors (pgvector)Tenant filter on every k-NN; collection partitions for largest
Caches (Redis)Key prefixes tenants/{tid}/...; eviction never crosses tenants
AI GatewayPer-tenant prompt pinning, per-tenant budgets, per-tenant cache
SyncCursors + mutations + conflicts scoped by tenant + user + device
Logstenant_id on every line; PII-scrubbed; per-tenant retention
BackupsPer-region per-tenant; restore test quarterly

Tenant isolation tests are mandatory in CI: every service runs a "two-tenant simulator" suite that asserts every read/write/event surface refuses cross-tenant access.

5. Data Classification & Encryption

ClassExamplesAt-restIn-transit
PublicMarketing, public certsTLS 1.3TLS 1.3
InternalCourse catalog metadataAES-256, KMS sharedTLS 1.3
ConfidentialLearner progress, quiz keysAES-256, per-tenant KMS data keys (envelope)TLS 1.3 + mTLS internal
RestrictedCredentials, payment refs, PHIAES-256, per-tenant KMS, restricted accessTLS 1.3 + mTLS + JIT access
Offline-bundledPlayPackage BundlesAES-256-GCM, per-device-derived keyTLS 1.3

Key management:

  • KMS-backed (HSM root); per-tenant DEK; hierarchical KEK rotation annual; emergency rotation supported.
  • HSM-backed signing keys for JWT, JWS provenance, certificate proof, bundle signature.

6. Network & Edge

  • TLS 1.3 only; HSTS preload.
  • Strong CSP per-route; nonce-based scripts.
  • WAF rules: OWASP CRS + custom (LMS-specific).
  • Geo controls per tenant.
  • Anti-bot on signup + checkout.
  • Service mesh (mTLS) inside cluster; per-service identities for inter-service auth.

7. Application Security

  • OWASP ASVS L2 baseline; selected modules (auth, sessions, payments) at L3.
  • Input validation at every boundary using Zod (frontend) and Ajv (backend) against shared schemas.
  • Output encoding: React escaping + DOMPurify for any innerHTML (rare).
  • ORM-only DB access; no string-built SQL.
  • File uploads scanned (AV + content-safety) before becoming addressable.
  • SCORM imports validated + sandboxed (no eval in zip; manifest-driven).
  • Webhook signatures HMAC-SHA256 with nonce + timestamp (5-min window).

8. AI Safety & Governance (Mandatory; full surface in 03 ai-gateway-service)

  • All AI calls routed via ai-gateway-service.
  • Pre-call: moderation; PII redaction (configurable); prompt-injection shield (heuristic + classifier).
  • Routing: local → small cloud → large cloud; per-tenant budget gate.
  • Post-call: moderation; structured-output schema validation; refusal handling.
  • Provenance: every artifact carries AIProvenance (see 12).
  • HITL: AI-generated authoring blocks status='draft_ai' until accepted; decisionId ties acceptance into audit chain.
  • No training on tenant data: outbound provider configs explicitly disable; verified at integration test layer with provider-specific assertions.
  • Tenant-scoped embeddings: never cross-tenant; deletion follows tenant + user lifecycle.
  • EU AI Act: each AI capability is classified (limited / high-risk); high-risk capabilities (e.g., AI grading, AI risk-scoring of learners) require additional documentation, post-market monitoring, and explicit human override paths.
  • Bias monitoring: AI assignment recommendations + AI grading evaluated quarterly against demographic-parity + equalized-odds metrics on consenting sample data.
  • Right to explanation: UI surfaces "why this recommendation" / "why this score" using model rationale + feature attribution where available.
  • Refusal & dispute: users may dispute any AI decision; routes to human reviewer with SLA.

9. Offline License Enforcement

  • Every PlayPackage Bundle issued with a LicenseEnvelope (see 12) signed by tenant key.
  • Enforces: expiry, device binding, feature gating (AI tutor on/off offline, certificate eligibility, copy/download).
  • Player refuses to mount expired/revoked bundles; revocation propagates via sync within minutes online.
  • License envelope tampering detected via signature verification on every mount.

10. Tamper Detection (Offline)

  • Bundle SHA-256 verified against signed manifest at mount.
  • Failure → unmount + content.bundle.tamper_detected.v1 queued for next sync.
  • Repeated failures → device flagged; user offered fresh download; admin alert.

11. Audit & Logging

  • Append-only audit log for: identity events, role/permission changes, data access decisions, AI calls, billing actions, GDPR requests, license grants, certificate issuance, sync conflicts, tenant data residency changes.
  • Daily Merkle anchoring: root hash committed to internal anchor store + emitted as audit.merkle.anchored.v1. Optionally anchored externally per tenant policy.
  • Tamper evidence: any audit table change without anchor mismatch is detected by daily verification.
  • PII scrubbing in operational logs; full PII allowed only in dedicated audit log under restricted access.
  • Per-tenant export: compliance officer can export audit slice via analytics-service.

12. Compliance Posture

StandardStatusNotes
GDPRRequiredDSR flow + lawful basis registry + DPA
SOC 2 Type IIRequiredAnnual audit; logging + access reviews
ISO 27001RequiredISMS docs aligned with this spec set
HIPAA (opt-in)AvailableBAA + restricted AI providers + PHI tagging
FERPA (opt-in)AvailableEducation records + parental access flow
ISO/IEC 42001 (AI MS)AdoptAligns with EU AI Act
EU AI ActRequiredRisk classification + transparency obligations
WCAG 2.2 AARequiredAll tenant-facing UIs
PCI DSSOut-of-scope (tokenized)Card data never touches our DB
KSA / UAE PDPLRequired for regionData residency + lawful basis
Schrems IIRequiredCloud LLM transfers via SCCs + supplementary measures

13. Data Subject Rights (GDPR / equivalents)

  • Access / portability: POST /api/v1/me/data-export raises gdpr.subject_request.received.v1; each service contributes; aggregator zip emailed.
  • Erasure: request raised; saga across services; financial/audit data may be retained under legal basis with redacted flag.
  • Rectification: standard profile/preference endpoints.
  • Objection / restriction: AI-decision opt-out and opt-down (manual review only).
  • Right not to be subject to automated decision-making: AI features that meet the threshold (high-risk) provide explicit human-only path.

14. Data Residency

  • Per-tenant region pin (us, eu, me, ap).
  • Cross-region replication only with explicit opt-in.
  • Tenant residency change is a saga (see 04).
  • Vector + AI cache stay in-region; cross-region embeddings forbidden.
  • Backups stay in-region; DR cross-region only with opt-in.

15. Operational Security

  • Just-in-time access: internal staff request elevated tenant access; auto-expires; logged with reason; four-eyes for restricted actions.
  • Bastion + audit: all production access via bastion with session recording.
  • Secrets: vault-backed; rotated; per-environment; never in source.
  • SBOM: generated per build; signed; vulnerability scan gate.
  • Patch SLA: critical 24h, high 7d, medium 30d.
  • DR drills: quarterly; full region failover annually.

16. Incident Response

  • 24×7 on-call rotation per service.
  • Severity matrix (Sev1 → Sev4) with comms playbooks.
  • Customer-impacting incidents disclosed within contractual SLA (typically 72 h for data incidents).
  • Post-incident review within 5 business days; corrective actions tracked.

17. Privacy & Data Minimization

  • Collect only what's needed; defaults privacy-preserving.
  • AI features default-off pending tenant opt-in.
  • Telemetry user-identifying fields hashed; opt-out per user.
  • Cookies: strictly necessary by default; consent for analytics; granular per region.

18. Testing & Verification

  • Tenant isolation suite in every service.
  • AuthZ test matrix (role × resource × condition).
  • DAST + SAST in CI; targeted pen-test per release; bug-bounty program.
  • AI safety suite: prompt-injection corpus; PII corpus; jailbreak corpus; bias evals.
  • Offline integrity suite: bundle tamper, license expiry, revocation under sync.
  • Audit proofs: daily Merkle root verified by independent job.

19. Why

Trust is the product's substrate. Multi-tenant SaaS with AI + offline can fail in many subtle ways — cross-tenant cache hits, AI provenance loss, offline license bypass. Concentrating these mitigations into a single doc + a single set of mandates (RLS, gateway, sync protocol, audit) means we enforce them consistently across 19 services rather than re-deriving them per team.