11 — Risks and Trade-offs

Status: populated Last updated: 2026-04-18 Companion: 01 enterprise-architecture · 13 security-compliance-tenancy · 14 compliance-security-extended · 16 offline-first · 17 technology-stack

This document inventories the platform-level risks and the architectural trade-offs that shaped Ghasi-eHealth. The goal is not to predict the future but to keep each design decision aware of the failure modes it implies, and to name the alternative someone might later say we should have chosen.

Severity: S1 critical · S2 high · S3 medium · S4 low. Probability: P1 very likely · P2 likely · P3 possible · P4 rare.

1. Risk register (platform)

A. Architectural

ID	Risk	Domain	Sev	Prob	Mitigation	Residual	Owner
R-001	Eventual consistency confuses clinicians	cross-service	S1	P2	Optimistic UI, explicit pending state, "read-your-writes" guarantees on chart + order + billing paths	Low	platform-arch
R-002	27 services = 27 pipelines and 27 deployments	ops	S2	P1	Shared CI template, service scaffold, SLO + runbook at creation, shared libs (`@ghasi/*`)	Low-medium	platform-sre
R-003	Multi-step sagas (referral → interop → billing) fail mid-way	domain	S2	P2	Saga state machines with compensations; saga inspector UI; chaos tests inject mid-saga fail	Rare orphans surface to platform-admin queue	each domain
R-004	NATS JetStream outage cascades	infra	S2	P3	Multi-AZ replication; producer outbox; HTTP fallback drains for safety-critical flows	> 30 min outage degrades sync UX	platform-sre
R-005	Schema drift across versioned events	contracts	S3	P2	Schema registry + Pact gates; backward-compat mandatory; deprecation window	Low	platform-arch
R-006	Test data sprawl (per-tenant fixtures balloon)	QA	S3	P2	Factory libraries + synthetic data generator; scrubbed-prod snapshots for staging	Low	QA guild
R-007	Monorepo vs. split repos debate regresses	ops	S3	P3	Decision log kept in ADRs; rotate leads through release train	—	CTO
R-008	Monolithic monorepo build times exceed 30 min	ops	S3	P2	Turbo + task filters; per-service affected-only CI; remote cache	Accepted	platform-sre

B. Multi-tenancy and isolation

ID	Risk	Sev	Prob	Mitigation	Residual
R-010	Cross-tenant data leak through missed `tenant_id` filter	S1	P3	PostgreSQL RLS as defense in depth + mandatory `tenant_isolation.test.ts` per service	Very low
R-011	Noisy-neighbour tenant degrades others	S2	P2	Per-tenant rate limits; per-tenant AI budgets; schema-per-tenant promotion for largest tenants	Low
R-012	Platform-admin super-admin over-reach	S2	P3	All super-admin reads audited; break-glass signed; time-boxed; quarterly review	Low
R-013	Row-level-only tenancy insufficient for top-tier compliance (national-scale)	S2	P3	Per-schema promotion path already specified in 15 tenancy-decision-matrix; no API change	Low

C. FHIR-first and canonicalisation

ID	Risk	Sev	Prob	Mitigation	Residual
R-020	FHIR R4 profile drift (Afghanistan national profile lags IG updates)	S2	P2	Internal profile pack versioned; conformance tests; quarterly alignment with MoPH IG	Low
R-021	Teams skip FHIR and build parallel custom schemas	S1	P2	ESLint `FHIR_FIRST_STANDARD` rule on controllers; review gate; architect sign-off on any non-FHIR write path	Low
R-022	FHIR R5 upgrade migration scope	S3	P3	Resource-by-resource roadmap; adapter layer isolates breaking changes	Accepted
R-023	Terminology churn (ICD-11 rollout) breaks downstream codings	S2	P2	terminology-service versioned; `ConceptMap` preserves historical mappings; query-time translation	Low

D. Offline-first complexity

ID	Risk	Sev	Prob	Mitigation	Residual
R-030	Sync conflict on clinically-critical aggregate (allergies, meds) causes silent data loss	S1	P3	Conflict policy is always `server-authoritative` on safety-critical; UI forces manual resolve; audit on every resolve	Very low
R-031	Offline field clinic with outdated allergy list delivers contraindicated vaccine	S1	P3	Pre-clinic sync mandatory; cached allergy carry timestamp; clinical alert if last sync > 7 d	Low
R-032	Local device compromise exposes PHI cache	S2	P2	At-rest encryption (device-bound key); MAM policies; remote wipe; auto-lock 15 min	Low
R-033	Clock skew on field device silently corrupts sync order	S3	P3	Hybrid logical clock (HLC) on every event; server rejects > 5 min skew beyond tolerance	Low
R-034	Offline buffer overflow (weeks offline)	S3	P3	Size cap per device + LRU eviction of non-critical data; emergency purge UI	Accepted
R-035	Break-glass access queued offline then reconciled online across conflicting policies	S2	P3	Offline break-glass events signed by device key; server-side policy re-evaluated on replay	Low

E. AI / clinical intelligence

ID	Risk	Sev	Prob	Mitigation	Residual
R-040	LLM hallucination inserts fabricated clinical fact	S1	P2	HITL signature mandatory; provenance on every artifact; refusal when low confidence; grounded prompts with RAG over chart	Non-zero — residual addressed via transparency and signature
R-041	Prompt injection via user-uploaded document / scanned note	S1	P2	Pre-call classifier; system-prompt isolation; structured generation; allowlist tool surface	Continuous tuning
R-042	AI cost runaway (unbounded tutor / scribe usage)	S2	P2	Per-tenant + per-feature budgets; cache by prompt-hash; circuit breakers auto-downgrade model	Low
R-043	Bias in triage / risk-stratification AI	S1	P2	Fairness evaluation per model version; parity + equalised odds on consenting cohorts; explicit human-only path	Ongoing review
R-044	Local (edge) AI quality gap vs. cloud	S2	P2	Local-first only with quality-threshold heuristic; "Local model" badge in UI; cloud refresh when online	Accepted
R-045	Cross-tenant leakage via shared vector store	S2	P3	Tenant filter on every query + schema partitioning; pen-test category for embedding isolation	Low
R-046	AI regulation uncertainty (EU AI Act analogue in Afghanistan / UAE)	S2	P3	Per-feature classification; documentation and logging for high-risk features; quarterly review	—

F. E-prescribing and cross-facility interop

ID	Risk	Sev	Prob	Mitigation	Residual
R-050	Cross-border e-prescribing legality (patient fills prescription in neighbouring country)	S1	P3	Jurisdiction policy engine in ghasi-eprescribing-gateway; block by default; licensed corridor allowlist	Requires legal gate per corridor
R-051	MedicationRequest ↔ MedicationDispense spine outage	S2	P3	Idempotent FHIR writes; gateway persistence; subscription replay	Low
R-052	Duplicate dispense via retry	S2	P3	`Idempotency-Key` on dispense writes; dedupe on `(tenantId, clientMutationId)`	Very low
R-053	Cross-facility identity mismatch (same patient, two MRNs)	S1	P2	MPI with NID + phone + DOB + biometric; explicit merge queue; audit every merge	Low

G. Security / compliance / privacy

ID	Risk	Sev	Prob	Mitigation	Residual
R-060	Keycloak compromise exposes all tenants	S1	P4	Realm-per-tenant; key rotation; HSM-backed signing; IdP isolation	Low
R-061	Kong DB-less config drift between environments	S2	P3	Declarative YAML in VCS; contract tests against Kong in CI	Low
R-062	Insufficient consent capture for secondary use (research, population health)	S2	P2	Consent aggregate (FHIR `Consent`) gates every read; consent policy versioned	Low
R-063	Audit write failure silently accepts transaction	S1	P3	Synchronous audit write; transaction fails 503 if audit unavailable	Very low
R-064	PHI in telemetry logs/traces	S1	P2	`@ghasi/telemetry` redaction at emit + collector re-verify + nightly scanner	Low
R-065	DSAR export misses a service's data	S2	P2	DSAR is a fan-out saga; every service implements `exportForSubject`; coverage CI test	Low
R-066	Minor / guardian delegation mis-scoped (teen access to own sensitive category)	S2	P3	Age-of-majority policy per category; jurisdiction-configurable; quarterly review	Medium

H. Afghanistan / regional operational

ID	Risk	Sev	Prob	Mitigation	Residual
R-070	Connectivity loss in district hospital (power + internet)	S1	P1	Full offline clinician desktop + provider mobile; UPS requirement per facility tier	Accepted
R-071	On-premise vs. cloud deployment split within one country	S2	P2	All services support on-prem + cloud; per-tenant deployment class; private cloud option	Accepted
R-072	Sanctions / vendor access restrictions	S2	P2	Multi-vendor AI router; open-source fallbacks; self-host tier	Low
R-073	Regulatory shift (MoPH data residency tightens)	S2	P3	Residency is a tenant attribute, not a codepath; in-country regions ready	Low
R-074	HMIS data quality from low-digitised facilities	S2	P1	Population-health service pulls from chart data, not manual reports; data-quality indicators visible	Medium
R-075	Paper-first handovers in emergency settings	S3	P1	Print / scan-back workflows in document-service; OCR-assisted ingestion	Accepted

I. Licensing and commercial

ID	Risk	Sev	Prob	Mitigation	Residual
R-080	Licensing boundary enforcement complexity across 27 services	S2	P2	Central licensing service; `ModuleEntitlementGuard` in `@ghasi/nestjs-common`; UI hides unlicensed module nav	Low
R-081	Unlicensed usage via direct NATS event subscription	S2	P3	Subject-level ACLs; licensing-aware NATS consumer registrar	Low
R-082	Over-license charges (mis-seeded license at onboarding)	S3	P2	License seed template per tenant class; migration log; reconciliation job nightly	Low

J. AI-gateway and platform-AI

ID	Risk	Sev	Prob	Mitigation	Residual
R-090	AI vendor lock-in via provider-specific features	S2	P2	All calls through ai-gateway-service; vendor-abstract types; multi-vendor router	Low
R-091	AI provenance loss on export	S2	P2	Domain aggregates refuse writes of AI artifacts without `aiProvenance`; export includes provenance block	Low
R-092	AI-assisted clinical decision without human attestation	S1	P3	Hard rule: no AI-only persistence of clinical facts; sign = human attestation	Very low

2. Trade-offs (explicit)

T-01 — Eventual vs. strong consistency

Choice: Eventual across services; strong within a service (read-your-writes inside one bounded context).
Why: Distributed transactions across 27 services are forbidden by the architecture baseline. Strong per-aggregate is cheap and gives clinicians the guarantees they need on chart, order, and billing.
Alt we rejected: Two-phase commit via Saga Monitor service — too much operational cost.

T-02 — FHIR-first vs. custom schemas

Choice: FHIR R4 is canonical; local tables are operational indexes + workflow state.
Why: Interoperability with MoPH, national registries, and future cross-border HIE is the product's reason to exist. Custom schemas would orphan us.
Cost: Onboarding ramp is steeper; some domains (billing) are verbose in FHIR.

T-03 — Row-level vs. schema-per-tenant vs. DB-per-tenant

Choice: Row-level by default + per-schema promotion path for largest tenants.
Why: Row-level scales operationally; per-schema available without API change for isolation-critical tenants. DB-per-tenant is the nuclear option reserved for sovereign deployments (MoPH-only instance).
Cost: RLS correctness must be tested on every service.

T-04 — Monorepo vs. split repos

Choice: Monorepo with Turbo; one repo per logical platform product (eHealth, edTech).
Why: Shared libraries (@ghasi/*), unified standards, atomic cross-service changes. 27 services × 27 repos would multiply CI/CD overhead 27×.
Cost: Build times; dependency upgrade coordination.

T-05 — On-premise vs. cloud

Choice: Both — every service is packagable for both. Tenant class determines deployment.
Why: Afghanistan reference deployments span national cloud, private MoPH DC, and facility-level on-prem.
Cost: Ops team must keep two deployment modes warm; IaC modules maintained for both.

T-06 — NestJS + Node vs. polyglot

Choice: Single stack — NestJS 11 / Node 22 / TypeScript 5.x for all services.
Why: Team cohesion; shared @ghasi/* libs; one hiring pipeline; one tooling chain.
Cost: Node is not optimal for CPU-bound work (imaging, analytics). Mitigation: offload to worker services when needed.

T-07 — Kong DB-less edge

Choice: Kong in declarative YAML mode as the sole HTTP edge.
Why: Simpler config, no Kong DB to operate, Git-reviewable routes.
Cost: Runtime plugin admin not available; all changes are code-review driven.

T-08 — Realm vs. SQLite vs. Dexie for offline

Choice: Per-surface optimum — Realm on mobile, SQLite (better-sqlite3) on desktop Electron, Dexie/IndexedDB on web. Same sync protocol across all three.
Why: Each local store fits its platform's performance and platform APIs.
Cost: Three adapter implementations; covered by one contract test suite.

T-09 — Keycloak realm-per-tenant vs. single-realm-multi-tenant

Choice: Realm-per-tenant.
Why: IdP isolation, per-tenant policies, jurisdiction-specific federation.
Cost: Realm management scales with tenants; admin automation required.

T-10 — AI default on vs. off

Choice: Off by default per tenant; explicit opt-in per feature per facility.
Why: Clinical trust, predictable cost, regulatory posture.
Cost: AI-driven efficiency gains delayed by onboarding friction.

T-11 — Local AI vs. cloud AI

Choice: Both, behind one port (ai-gateway-service). Local-first only when offline or for low-stakes tasks.
Why: Offline-first demands local; clinical quality demands cloud.
Cost: Dual evaluation suites; provenance must treat both uniformly.

T-12 — GraphQL vs. REST

Choice: REST + FHIR-REST; GraphQL not pursued at the platform edge.
Why: FHIR canonical shape does not benefit from GraphQL; REST tooling is strong.
Cost: Patient-portal composite queries require BFF composition work.

T-13 — pgvector vs. external vector DB

Choice: pgvector inside ai-gateway-service DB.
Why: Tenant isolation via RLS is symmetric with the rest of Postgres; one operational model.
Cost: Scale ceiling is a future problem; abstracted behind VectorIndex port.

T-14 — Yjs CRDT vs. server-authoritative for concurrent authoring

Choice: Server-authoritative for clinical documentation (note editing is typically one clinician at a time); no CRDT.
Why: Clinical liability model is per-author; attestation is per-author; CRDT adds complexity without clinical benefit.
Cost: Second-clinician "observer" edits are merged only via explicit amendment flow.

T-15 — Service count (27)

Choice: 27 services aligned to bounded contexts.
Why: Team ownership, release independence, licensing granularity.
Cost: Operational overhead; dedicated SRE posture; shared CI template mandatory.

T-16 — Audit synchrony

Choice: Audit write is synchronous — request fails 503 if audit-service is down.
Why: The one place we are not willing to be eventually consistent. Safety > availability for audit.
Cost: Audit-service is on the critical path for every PHI write; SRE posture reflects this.

T-17 — Print-first clinical artifacts

Choice: Every clinical document (allergy banner, medication list, discharge summary) has a print stylesheet at design-token parity.
Why: Paper handovers are still common in Afghanistan reference clinics; printout must remain accurate and legible in LTR+RTL.
Cost: Print stylesheet test matrix expands QA scope.

3. Trade-off hierarchy (when two principles conflict)

Patient safety > everything.
Audit integrity > feature velocity.
Tenant isolation > operational convenience.
FHIR canonical > per-team ergonomics.
Offline correctness > online responsiveness (when they conflict).
Explicit AI > autonomous AI.
Immutability of clinical artifacts > storage cost.

4. Watchlist (quarterly review)

Cross-tenant test failures — target zero.
AI cost burn vs. clinical adoption curve.
Sync conflict rate on clinical aggregates.
Offline bundle tamper reports.
Break-glass invocation rate per facility.
HMIS data-quality indicators.
Licensing mis-seed reconciliation rate.
DSAR fulfilment SLA.
Audit-write failure rate.
E-prescribing cross-facility fill rate + dispute rate.

5. Governance

This document is versioned; each quarterly review produces a PR with updated severities/probabilities.
New features > S2 risk require a mitigation plan in the PR description.
ADRs under docs/architecture/ carry the narrative for individual decisions; this doc is the summary register.

6. Open questions

Whether sovereign deployment (single-tenant MoPH instance) warrants a separate tier or is simply a special-case of the per-schema promotion path.
Long-term AI moderation vendor strategy for Dari + Pashto clinical content — currently no mature safety classifier for these languages.

1. Risk register (platform)​

A. Architectural​

B. Multi-tenancy and isolation​

C. FHIR-first and canonicalisation​

D. Offline-first complexity​

E. AI / clinical intelligence​

F. E-prescribing and cross-facility interop​

G. Security / compliance / privacy​

H. Afghanistan / regional operational​

I. Licensing and commercial​

J. AI-gateway and platform-AI​

2. Trade-offs (explicit)​

T-01 — Eventual vs. strong consistency​

T-02 — FHIR-first vs. custom schemas​

T-03 — Row-level vs. schema-per-tenant vs. DB-per-tenant​

T-04 — Monorepo vs. split repos​

T-05 — On-premise vs. cloud​

T-06 — NestJS + Node vs. polyglot​

T-07 — Kong DB-less edge​

T-08 — Realm vs. SQLite vs. Dexie for offline​

T-09 — Keycloak realm-per-tenant vs. single-realm-multi-tenant​

T-10 — AI default on vs. off​

T-11 — Local AI vs. cloud AI​

T-12 — GraphQL vs. REST​

T-13 — pgvector vs. external vector DB​

T-14 — Yjs CRDT vs. server-authoritative for concurrent authoring​

T-15 — Service count (27)​

T-16 — Audit synchrony​

T-17 — Print-first clinical artifacts​

3. Trade-off hierarchy (when two principles conflict)​

4. Watchlist (quarterly review)​

5. Governance​

6. Open questions​