Skip to main content

11 — Risks and Trade-offs

Status: populated Last updated: 2026-04-18 Companion: 01 enterprise-architecture · 13 security-compliance-tenancy · 14 compliance-security-extended · 16 offline-first · 17 technology-stack

This document inventories the platform-level risks and the architectural trade-offs that shaped Ghasi-eHealth. The goal is not to predict the future but to keep each design decision aware of the failure modes it implies, and to name the alternative someone might later say we should have chosen.

Severity: S1 critical · S2 high · S3 medium · S4 low. Probability: P1 very likely · P2 likely · P3 possible · P4 rare.

1. Risk register (platform)

A. Architectural

IDRiskDomainSevProbMitigationResidualOwner
R-001Eventual consistency confuses clinicianscross-serviceS1P2Optimistic UI, explicit pending state, "read-your-writes" guarantees on chart + order + billing pathsLowplatform-arch
R-00227 services = 27 pipelines and 27 deploymentsopsS2P1Shared CI template, service scaffold, SLO + runbook at creation, shared libs (@ghasi/*)Low-mediumplatform-sre
R-003Multi-step sagas (referral → interop → billing) fail mid-waydomainS2P2Saga state machines with compensations; saga inspector UI; chaos tests inject mid-saga failRare orphans surface to platform-admin queueeach domain
R-004NATS JetStream outage cascadesinfraS2P3Multi-AZ replication; producer outbox; HTTP fallback drains for safety-critical flows> 30 min outage degrades sync UXplatform-sre
R-005Schema drift across versioned eventscontractsS3P2Schema registry + Pact gates; backward-compat mandatory; deprecation windowLowplatform-arch
R-006Test data sprawl (per-tenant fixtures balloon)QAS3P2Factory libraries + synthetic data generator; scrubbed-prod snapshots for stagingLowQA guild
R-007Monorepo vs. split repos debate regressesopsS3P3Decision log kept in ADRs; rotate leads through release trainCTO
R-008Monolithic monorepo build times exceed 30 minopsS3P2Turbo + task filters; per-service affected-only CI; remote cacheAcceptedplatform-sre

B. Multi-tenancy and isolation

IDRiskSevProbMitigationResidual
R-010Cross-tenant data leak through missed tenant_id filterS1P3PostgreSQL RLS as defense in depth + mandatory tenant_isolation.test.ts per serviceVery low
R-011Noisy-neighbour tenant degrades othersS2P2Per-tenant rate limits; per-tenant AI budgets; schema-per-tenant promotion for largest tenantsLow
R-012Platform-admin super-admin over-reachS2P3All super-admin reads audited; break-glass signed; time-boxed; quarterly reviewLow
R-013Row-level-only tenancy insufficient for top-tier compliance (national-scale)S2P3Per-schema promotion path already specified in 15 tenancy-decision-matrix; no API changeLow

C. FHIR-first and canonicalisation

IDRiskSevProbMitigationResidual
R-020FHIR R4 profile drift (Afghanistan national profile lags IG updates)S2P2Internal profile pack versioned; conformance tests; quarterly alignment with MoPH IGLow
R-021Teams skip FHIR and build parallel custom schemasS1P2ESLint FHIR_FIRST_STANDARD rule on controllers; review gate; architect sign-off on any non-FHIR write pathLow
R-022FHIR R5 upgrade migration scopeS3P3Resource-by-resource roadmap; adapter layer isolates breaking changesAccepted
R-023Terminology churn (ICD-11 rollout) breaks downstream codingsS2P2terminology-service versioned; ConceptMap preserves historical mappings; query-time translationLow

D. Offline-first complexity

IDRiskSevProbMitigationResidual
R-030Sync conflict on clinically-critical aggregate (allergies, meds) causes silent data lossS1P3Conflict policy is always server-authoritative on safety-critical; UI forces manual resolve; audit on every resolveVery low
R-031Offline field clinic with outdated allergy list delivers contraindicated vaccineS1P3Pre-clinic sync mandatory; cached allergy carry timestamp; clinical alert if last sync > 7 dLow
R-032Local device compromise exposes PHI cacheS2P2At-rest encryption (device-bound key); MAM policies; remote wipe; auto-lock 15 minLow
R-033Clock skew on field device silently corrupts sync orderS3P3Hybrid logical clock (HLC) on every event; server rejects > 5 min skew beyond toleranceLow
R-034Offline buffer overflow (weeks offline)S3P3Size cap per device + LRU eviction of non-critical data; emergency purge UIAccepted
R-035Break-glass access queued offline then reconciled online across conflicting policiesS2P3Offline break-glass events signed by device key; server-side policy re-evaluated on replayLow

E. AI / clinical intelligence

IDRiskSevProbMitigationResidual
R-040LLM hallucination inserts fabricated clinical factS1P2HITL signature mandatory; provenance on every artifact; refusal when low confidence; grounded prompts with RAG over chartNon-zero — residual addressed via transparency and signature
R-041Prompt injection via user-uploaded document / scanned noteS1P2Pre-call classifier; system-prompt isolation; structured generation; allowlist tool surfaceContinuous tuning
R-042AI cost runaway (unbounded tutor / scribe usage)S2P2Per-tenant + per-feature budgets; cache by prompt-hash; circuit breakers auto-downgrade modelLow
R-043Bias in triage / risk-stratification AIS1P2Fairness evaluation per model version; parity + equalised odds on consenting cohorts; explicit human-only pathOngoing review
R-044Local (edge) AI quality gap vs. cloudS2P2Local-first only with quality-threshold heuristic; "Local model" badge in UI; cloud refresh when onlineAccepted
R-045Cross-tenant leakage via shared vector storeS2P3Tenant filter on every query + schema partitioning; pen-test category for embedding isolationLow
R-046AI regulation uncertainty (EU AI Act analogue in Afghanistan / UAE)S2P3Per-feature classification; documentation and logging for high-risk features; quarterly review

F. E-prescribing and cross-facility interop

IDRiskSevProbMitigationResidual
R-050Cross-border e-prescribing legality (patient fills prescription in neighbouring country)S1P3Jurisdiction policy engine in ghasi-eprescribing-gateway; block by default; licensed corridor allowlistRequires legal gate per corridor
R-051MedicationRequest ↔ MedicationDispense spine outageS2P3Idempotent FHIR writes; gateway persistence; subscription replayLow
R-052Duplicate dispense via retryS2P3Idempotency-Key on dispense writes; dedupe on (tenantId, clientMutationId)Very low
R-053Cross-facility identity mismatch (same patient, two MRNs)S1P2MPI with NID + phone + DOB + biometric; explicit merge queue; audit every mergeLow

G. Security / compliance / privacy

IDRiskSevProbMitigationResidual
R-060Keycloak compromise exposes all tenantsS1P4Realm-per-tenant; key rotation; HSM-backed signing; IdP isolationLow
R-061Kong DB-less config drift between environmentsS2P3Declarative YAML in VCS; contract tests against Kong in CILow
R-062Insufficient consent capture for secondary use (research, population health)S2P2Consent aggregate (FHIR Consent) gates every read; consent policy versionedLow
R-063Audit write failure silently accepts transactionS1P3Synchronous audit write; transaction fails 503 if audit unavailableVery low
R-064PHI in telemetry logs/tracesS1P2@ghasi/telemetry redaction at emit + collector re-verify + nightly scannerLow
R-065DSAR export misses a service's dataS2P2DSAR is a fan-out saga; every service implements exportForSubject; coverage CI testLow
R-066Minor / guardian delegation mis-scoped (teen access to own sensitive category)S2P3Age-of-majority policy per category; jurisdiction-configurable; quarterly reviewMedium

H. Afghanistan / regional operational

IDRiskSevProbMitigationResidual
R-070Connectivity loss in district hospital (power + internet)S1P1Full offline clinician desktop + provider mobile; UPS requirement per facility tierAccepted
R-071On-premise vs. cloud deployment split within one countryS2P2All services support on-prem + cloud; per-tenant deployment class; private cloud optionAccepted
R-072Sanctions / vendor access restrictionsS2P2Multi-vendor AI router; open-source fallbacks; self-host tierLow
R-073Regulatory shift (MoPH data residency tightens)S2P3Residency is a tenant attribute, not a codepath; in-country regions readyLow
R-074HMIS data quality from low-digitised facilitiesS2P1Population-health service pulls from chart data, not manual reports; data-quality indicators visibleMedium
R-075Paper-first handovers in emergency settingsS3P1Print / scan-back workflows in document-service; OCR-assisted ingestionAccepted

I. Licensing and commercial

IDRiskSevProbMitigationResidual
R-080Licensing boundary enforcement complexity across 27 servicesS2P2Central licensing service; ModuleEntitlementGuard in @ghasi/nestjs-common; UI hides unlicensed module navLow
R-081Unlicensed usage via direct NATS event subscriptionS2P3Subject-level ACLs; licensing-aware NATS consumer registrarLow
R-082Over-license charges (mis-seeded license at onboarding)S3P2License seed template per tenant class; migration log; reconciliation job nightlyLow

J. AI-gateway and platform-AI

IDRiskSevProbMitigationResidual
R-090AI vendor lock-in via provider-specific featuresS2P2All calls through ai-gateway-service; vendor-abstract types; multi-vendor routerLow
R-091AI provenance loss on exportS2P2Domain aggregates refuse writes of AI artifacts without aiProvenance; export includes provenance blockLow
R-092AI-assisted clinical decision without human attestationS1P3Hard rule: no AI-only persistence of clinical facts; sign = human attestationVery low

2. Trade-offs (explicit)

T-01 — Eventual vs. strong consistency

  • Choice: Eventual across services; strong within a service (read-your-writes inside one bounded context).
  • Why: Distributed transactions across 27 services are forbidden by the architecture baseline. Strong per-aggregate is cheap and gives clinicians the guarantees they need on chart, order, and billing.
  • Alt we rejected: Two-phase commit via Saga Monitor service — too much operational cost.

T-02 — FHIR-first vs. custom schemas

  • Choice: FHIR R4 is canonical; local tables are operational indexes + workflow state.
  • Why: Interoperability with MoPH, national registries, and future cross-border HIE is the product's reason to exist. Custom schemas would orphan us.
  • Cost: Onboarding ramp is steeper; some domains (billing) are verbose in FHIR.

T-03 — Row-level vs. schema-per-tenant vs. DB-per-tenant

  • Choice: Row-level by default + per-schema promotion path for largest tenants.
  • Why: Row-level scales operationally; per-schema available without API change for isolation-critical tenants. DB-per-tenant is the nuclear option reserved for sovereign deployments (MoPH-only instance).
  • Cost: RLS correctness must be tested on every service.

T-04 — Monorepo vs. split repos

  • Choice: Monorepo with Turbo; one repo per logical platform product (eHealth, edTech).
  • Why: Shared libraries (@ghasi/*), unified standards, atomic cross-service changes. 27 services × 27 repos would multiply CI/CD overhead 27×.
  • Cost: Build times; dependency upgrade coordination.

T-05 — On-premise vs. cloud

  • Choice: Both — every service is packagable for both. Tenant class determines deployment.
  • Why: Afghanistan reference deployments span national cloud, private MoPH DC, and facility-level on-prem.
  • Cost: Ops team must keep two deployment modes warm; IaC modules maintained for both.

T-06 — NestJS + Node vs. polyglot

  • Choice: Single stack — NestJS 11 / Node 22 / TypeScript 5.x for all services.
  • Why: Team cohesion; shared @ghasi/* libs; one hiring pipeline; one tooling chain.
  • Cost: Node is not optimal for CPU-bound work (imaging, analytics). Mitigation: offload to worker services when needed.

T-07 — Kong DB-less edge

  • Choice: Kong in declarative YAML mode as the sole HTTP edge.
  • Why: Simpler config, no Kong DB to operate, Git-reviewable routes.
  • Cost: Runtime plugin admin not available; all changes are code-review driven.

T-08 — Realm vs. SQLite vs. Dexie for offline

  • Choice: Per-surface optimum — Realm on mobile, SQLite (better-sqlite3) on desktop Electron, Dexie/IndexedDB on web. Same sync protocol across all three.
  • Why: Each local store fits its platform's performance and platform APIs.
  • Cost: Three adapter implementations; covered by one contract test suite.

T-09 — Keycloak realm-per-tenant vs. single-realm-multi-tenant

  • Choice: Realm-per-tenant.
  • Why: IdP isolation, per-tenant policies, jurisdiction-specific federation.
  • Cost: Realm management scales with tenants; admin automation required.

T-10 — AI default on vs. off

  • Choice: Off by default per tenant; explicit opt-in per feature per facility.
  • Why: Clinical trust, predictable cost, regulatory posture.
  • Cost: AI-driven efficiency gains delayed by onboarding friction.

T-11 — Local AI vs. cloud AI

  • Choice: Both, behind one port (ai-gateway-service). Local-first only when offline or for low-stakes tasks.
  • Why: Offline-first demands local; clinical quality demands cloud.
  • Cost: Dual evaluation suites; provenance must treat both uniformly.

T-12 — GraphQL vs. REST

  • Choice: REST + FHIR-REST; GraphQL not pursued at the platform edge.
  • Why: FHIR canonical shape does not benefit from GraphQL; REST tooling is strong.
  • Cost: Patient-portal composite queries require BFF composition work.

T-13 — pgvector vs. external vector DB

  • Choice: pgvector inside ai-gateway-service DB.
  • Why: Tenant isolation via RLS is symmetric with the rest of Postgres; one operational model.
  • Cost: Scale ceiling is a future problem; abstracted behind VectorIndex port.

T-14 — Yjs CRDT vs. server-authoritative for concurrent authoring

  • Choice: Server-authoritative for clinical documentation (note editing is typically one clinician at a time); no CRDT.
  • Why: Clinical liability model is per-author; attestation is per-author; CRDT adds complexity without clinical benefit.
  • Cost: Second-clinician "observer" edits are merged only via explicit amendment flow.

T-15 — Service count (27)

  • Choice: 27 services aligned to bounded contexts.
  • Why: Team ownership, release independence, licensing granularity.
  • Cost: Operational overhead; dedicated SRE posture; shared CI template mandatory.

T-16 — Audit synchrony

  • Choice: Audit write is synchronous — request fails 503 if audit-service is down.
  • Why: The one place we are not willing to be eventually consistent. Safety > availability for audit.
  • Cost: Audit-service is on the critical path for every PHI write; SRE posture reflects this.

T-17 — Print-first clinical artifacts

  • Choice: Every clinical document (allergy banner, medication list, discharge summary) has a print stylesheet at design-token parity.
  • Why: Paper handovers are still common in Afghanistan reference clinics; printout must remain accurate and legible in LTR+RTL.
  • Cost: Print stylesheet test matrix expands QA scope.

3. Trade-off hierarchy (when two principles conflict)

  1. Patient safety > everything.
  2. Audit integrity > feature velocity.
  3. Tenant isolation > operational convenience.
  4. FHIR canonical > per-team ergonomics.
  5. Offline correctness > online responsiveness (when they conflict).
  6. Explicit AI > autonomous AI.
  7. Immutability of clinical artifacts > storage cost.

4. Watchlist (quarterly review)

  1. Cross-tenant test failures — target zero.
  2. AI cost burn vs. clinical adoption curve.
  3. Sync conflict rate on clinical aggregates.
  4. Offline bundle tamper reports.
  5. Break-glass invocation rate per facility.
  6. HMIS data-quality indicators.
  7. Licensing mis-seed reconciliation rate.
  8. DSAR fulfilment SLA.
  9. Audit-write failure rate.
  10. E-prescribing cross-facility fill rate + dispute rate.

5. Governance

  • This document is versioned; each quarterly review produces a PR with updated severities/probabilities.
  • New features > S2 risk require a mitigation plan in the PR description.
  • ADRs under docs/architecture/ carry the narrative for individual decisions; this doc is the summary register.

6. Open questions

  • Whether sovereign deployment (single-tenant MoPH instance) warrants a separate tier or is simply a special-case of the per-schema promotion path.
  • Long-term AI moderation vendor strategy for Dari + Pashto clinical content — currently no mature safety classifier for these languages.