Security & Tenancy

:::info Source Sourced from docs/13-security-compliance-tenancy.md in the documentation repo. :::

Companion: 01 Enterprise Architecture · 03 Microservices · 05 API Design · 12 Data Models

This document consolidates the platform's security model, compliance posture, and the multi-layer enforcement of multi-tenant isolation. Every other doc references this one for the canonical rules.

1. Threat Model (summary)

Threat	Surface	Mitigation
Credential stuffing / spray	identity-service	argon2id, lockout, adaptive MFA, anomaly classifier
Session hijack	API gateway	short JWT TTL + rotating refresh; device binding
Cross-tenant data leak	All services	RLS + JWT scope + domain invariants
Insecure direct object reference	All services	ABAC policy on every resource access
SCORM zip RCE	content-service	Sandbox import + manifest validator + signed origin allowlist
Bundle tampering	client + content-service	JWS signature + AES-GCM + tamper report flow
Prompt injection	ai-gateway-service	Prompt-injection classifier + system prompt isolation
AI PII exfiltration	ai-gateway-service	Pre-call PII redaction; provider no-train flag
Webhook replay	webhooks (in/out)	HMAC signing + nonce + timestamp window
LMS embed CSRF	LTI	LTI 1.3 platform key validation + per-launch nonce
Storage takeover	media + content	Signed URLs scoped per object + per-caller
Supply chain	All services	SBOM + provenance attestations (SLSA) + lockfiles
Insider abuse	platform admin	Just-in-time elevation + four-eyes + audit
DoS	edge + APIs	Rate limits + WAF + geographic blocking

2. Identity & Authentication

Standards: OAuth 2.1, OIDC, WebAuthn level 2, SAML 2.0.
Local AuthN: email + password (argon2id, m=64 MiB, t=3, p=1), magic link, WebAuthn passkeys.
MFA factors: TOTP, SMS (deprecated for sensitive scopes), WebAuthn, recovery codes.
Adaptive MFA: triggered by risk classifier (new device, atypical IP, behavior).
Sessions: access JWT (15 min), rotating refresh (30 d sliding, single-use rotation, family-revoke on detected reuse).
Device binding: every offline-capable device has a public key registered with identity-service; PlayPackage Bundles encrypted with key derived from (tenantKey, devicePubKey, bundleId).
JWT signing: asymmetric (EdDSA Ed25519) with KMS-backed keys + kid rotation; JWKS published.

3. Authorization (RBAC + ABAC)

3.1 Coarse RBAC roles (system + tenant)

platform_admin, compliance_officer
org_owner, org_admin, org_manager
provider_admin, author, reviewer, publisher
learner, individual

3.2 Fine-grained ABAC predicates

Examples (serialized as expression trees in permissions.condition):

resource.tenant_id == ctx.tenant_id
resource.org_unit_id IN ctx.user.org_units
resource.visibility IN ('marketplace','public')
resource.created_by == ctx.user.id
resource.assignment.owner_id == ctx.user.id

3.3 Decision flow

JWT presented → verified → claims loaded into RequestContext.
Route declares required resource:action.
Policy engine evaluates (role grants action) AND (ABAC predicate true).
Decision logged with decisionId (linked into AI/HITL audit chains).

Endpoint POST /api/v1/authz/check lets UIs ask before showing actions.

4. Multi-Tenant Isolation (Multi-Layer)

Layer	Enforcement	Detail
Edge (CDN)	Per-tenant domain or path; CSP nonces per tenant
Kong + services	Edge (Kong) validates JWT on protected routes; services inject `tenant_id` into `RequestContext` and reject mismatched `X-Tenant-Id` header (see ADR 0001)
Application (use cases)	Use-cases require `TenantId` parameter; cross-tenant references rejected at construction
Domain (aggregates)	`TenantId` value object on every aggregate root; invariants reject cross-tenant references
Postgres	RLS enabled on every table: `USING (tenant_id = current_setting('app.tenant_id')::uuid)`
Postgres connections	Pool wrapper sets `app.tenant_id` per request (proxy-init)
Storage (S3/R2)	Per-tenant prefix `tenants/{tid}/...`; signed URLs scoped per object + caller; bucket policy denies cross-tenant prefix access
Search (OpenSearch)	Tenant filter injected; alias-per-tenant for largest
Vectors (pgvector)	Tenant filter on every k-NN; collection partitions for largest
Caches (Redis)	Key prefixes `tenants/{tid}/...`; eviction never crosses tenants
AI Gateway	Per-tenant prompt pinning, per-tenant budgets, per-tenant cache
Sync	Cursors + mutations + conflicts scoped by tenant + user + device
Logs	`tenant_id` on every line; PII-scrubbed; per-tenant retention
Backups	Per-region per-tenant; restore test quarterly

Tenant isolation tests are mandatory in CI: every service runs a "two-tenant simulator" suite that asserts every read/write/event surface refuses cross-tenant access.

5. Data Classification & Encryption

Class	Examples	At-rest	In-transit
Public	Marketing, public certs	TLS 1.3	TLS 1.3
Internal	Course catalog metadata	AES-256, KMS shared	TLS 1.3
Confidential	Learner progress, quiz keys	AES-256, per-tenant KMS data keys (envelope)	TLS 1.3 + mTLS internal
Restricted	Credentials, payment refs, PHI	AES-256, per-tenant KMS, restricted access	TLS 1.3 + mTLS + JIT access
Offline-bundled	PlayPackage Bundles	AES-256-GCM, per-device-derived key	TLS 1.3

Key management:

KMS-backed (HSM root); per-tenant DEK; hierarchical KEK rotation annual; emergency rotation supported.
HSM-backed signing keys for JWT, JWS provenance, certificate proof, bundle signature.

6. Network & Edge

TLS 1.3 only; HSTS preload.
Strong CSP per-route; nonce-based scripts.
WAF rules: OWASP CRS + custom (LMS-specific).
Geo controls per tenant.
Anti-bot on signup + checkout.
Service mesh (mTLS) inside cluster; per-service identities for inter-service auth.

7. Application Security

OWASP ASVS L2 baseline; selected modules (auth, sessions, payments) at L3.
Input validation at every boundary using Zod (frontend) and Ajv (backend) against shared schemas.
Output encoding: React escaping + DOMPurify for any innerHTML (rare).
ORM-only DB access; no string-built SQL.
File uploads scanned (AV + content-safety) before becoming addressable.
SCORM imports validated + sandboxed (no eval in zip; manifest-driven).
Webhook signatures HMAC-SHA256 with nonce + timestamp (5-min window).

8. AI Safety & Governance (Mandatory; full surface in 03 ai-gateway-service)

All AI calls routed via ai-gateway-service.
Pre-call: moderation; PII redaction (configurable); prompt-injection shield (heuristic + classifier).
Routing: local → small cloud → large cloud; per-tenant budget gate.
Post-call: moderation; structured-output schema validation; refusal handling.
Provenance: every artifact carries AIProvenance (see 12).
HITL: AI-generated authoring blocks status='draft_ai' until accepted; decisionId ties acceptance into audit chain.
No training on tenant data: outbound provider configs explicitly disable; verified at integration test layer with provider-specific assertions.
Tenant-scoped embeddings: never cross-tenant; deletion follows tenant + user lifecycle.
EU AI Act: each AI capability is classified (limited / high-risk); high-risk capabilities (e.g., AI grading, AI risk-scoring of learners) require additional documentation, post-market monitoring, and explicit human override paths.
Bias monitoring: AI assignment recommendations + AI grading evaluated quarterly against demographic-parity + equalized-odds metrics on consenting sample data.
Right to explanation: UI surfaces "why this recommendation" / "why this score" using model rationale + feature attribution where available.
Refusal & dispute: users may dispute any AI decision; routes to human reviewer with SLA.

9. Offline License Enforcement

Every PlayPackage Bundle issued with a LicenseEnvelope (see 12) signed by tenant key.
Enforces: expiry, device binding, feature gating (AI tutor on/off offline, certificate eligibility, copy/download).
Player refuses to mount expired/revoked bundles; revocation propagates via sync within minutes online.
License envelope tampering detected via signature verification on every mount.

10. Tamper Detection (Offline)

Bundle SHA-256 verified against signed manifest at mount.
Failure → unmount + content.bundle.tamper_detected.v1 queued for next sync.
Repeated failures → device flagged; user offered fresh download; admin alert.

11. Audit & Logging

Append-only audit log for: identity events, role/permission changes, data access decisions, AI calls, billing actions, GDPR requests, license grants, certificate issuance, sync conflicts, tenant data residency changes.
Daily Merkle anchoring: root hash committed to internal anchor store + emitted as audit.merkle.anchored.v1. Optionally anchored externally per tenant policy.
Tamper evidence: any audit table change without anchor mismatch is detected by daily verification.
PII scrubbing in operational logs; full PII allowed only in dedicated audit log under restricted access.
Per-tenant export: compliance officer can export audit slice via analytics-service.

12. Compliance Posture

Standard	Status	Notes
GDPR	Required	DSR flow + lawful basis registry + DPA
SOC 2 Type II	Required	Annual audit; logging + access reviews
ISO 27001	Required	ISMS docs aligned with this spec set
HIPAA (opt-in)	Available	BAA + restricted AI providers + PHI tagging
FERPA (opt-in)	Available	Education records + parental access flow
ISO/IEC 42001 (AI MS)	Adopt	Aligns with EU AI Act
EU AI Act	Required	Risk classification + transparency obligations
WCAG 2.2 AA	Required	All tenant-facing UIs
PCI DSS	Out-of-scope (tokenized)	Card data never touches our DB
KSA / UAE PDPL	Required for region	Data residency + lawful basis
Schrems II	Required	Cloud LLM transfers via SCCs + supplementary measures

Access / portability: POST /api/v1/me/data-export raises gdpr.subject_request.received.v1; each service contributes; aggregator zip emailed.
Erasure: request raised; saga across services; financial/audit data may be retained under legal basis with redacted flag.
Rectification: standard profile/preference endpoints.
Objection / restriction: AI-decision opt-out and opt-down (manual review only).
Right not to be subject to automated decision-making: AI features that meet the threshold (high-risk) provide explicit human-only path.

14. Data Residency

Per-tenant region pin (us, eu, me, ap).
Cross-region replication only with explicit opt-in.
Tenant residency change is a saga (see 04).
Vector + AI cache stay in-region; cross-region embeddings forbidden.
Backups stay in-region; DR cross-region only with opt-in.

15. Operational Security

Just-in-time access: internal staff request elevated tenant access; auto-expires; logged with reason; four-eyes for restricted actions.
Bastion + audit: all production access via bastion with session recording.
Secrets: vault-backed; rotated; per-environment; never in source.
SBOM: generated per build; signed; vulnerability scan gate.
Patch SLA: critical 24h, high 7d, medium 30d.
DR drills: quarterly; full region failover annually.

16. Incident Response

24×7 on-call rotation per service.
Severity matrix (Sev1 → Sev4) with comms playbooks.
Customer-impacting incidents disclosed within contractual SLA (typically 72 h for data incidents).
Post-incident review within 5 business days; corrective actions tracked.

17. Privacy & Data Minimization

Collect only what's needed; defaults privacy-preserving.
AI features default-off pending tenant opt-in.
Telemetry user-identifying fields hashed; opt-out per user.
Cookies: strictly necessary by default; consent for analytics; granular per region.

18. Testing & Verification

Tenant isolation suite in every service.
AuthZ test matrix (role × resource × condition).
DAST + SAST in CI; targeted pen-test per release; bug-bounty program.
AI safety suite: prompt-injection corpus; PII corpus; jailbreak corpus; bias evals.
Offline integrity suite: bundle tamper, license expiry, revocation under sync.
Audit proofs: daily Merkle root verified by independent job.

19. Why

Trust is the product's substrate. Multi-tenant SaaS with AI + offline can fail in many subtle ways — cross-tenant cache hits, AI provenance loss, offline license bypass. Concentrating these mitigations into a single doc + a single set of mandates (RLS, gateway, sync protocol, audit) means we enforce them consistently across 19 services rather than re-deriving them per team.

1. Threat Model (summary)​

2. Identity & Authentication​

3. Authorization (RBAC + ABAC)​

3.1 Coarse RBAC roles (system + tenant)​

3.2 Fine-grained ABAC predicates​

3.3 Decision flow​

4. Multi-Tenant Isolation (Multi-Layer)​

5. Data Classification & Encryption​

6. Network & Edge​

7. Application Security​

8. AI Safety & Governance (Mandatory; full surface in 03 ai-gateway-service)​

9. Offline License Enforcement​

10. Tamper Detection (Offline)​

11. Audit & Logging​

12. Compliance Posture​

13. Data Subject Rights (GDPR / equivalents)​

14. Data Residency​

15. Operational Security​

16. Incident Response​

17. Privacy & Data Minimization​

18. Testing & Verification​

19. Why​