11 — Risks and Trade-offs
Status: populated Last updated: 2026-04-18 Companion: 01 enterprise-architecture · 13 security-compliance-tenancy · 14 compliance-security-extended · 16 offline-first · 17 technology-stack
This document inventories the platform-level risks and the architectural trade-offs that shaped Ghasi-eHealth. The goal is not to predict the future but to keep each design decision aware of the failure modes it implies, and to name the alternative someone might later say we should have chosen.
Severity: S1 critical · S2 high · S3 medium · S4 low. Probability: P1 very likely · P2 likely · P3 possible · P4 rare.
1. Risk register (platform)
A. Architectural
| ID | Risk | Domain | Sev | Prob | Mitigation | Residual | Owner |
|---|---|---|---|---|---|---|---|
| R-001 | Eventual consistency confuses clinicians | cross-service | S1 | P2 | Optimistic UI, explicit pending state, "read-your-writes" guarantees on chart + order + billing paths | Low | platform-arch |
| R-002 | 27 services = 27 pipelines and 27 deployments | ops | S2 | P1 | Shared CI template, service scaffold, SLO + runbook at creation, shared libs (@ghasi/*) | Low-medium | platform-sre |
| R-003 | Multi-step sagas (referral → interop → billing) fail mid-way | domain | S2 | P2 | Saga state machines with compensations; saga inspector UI; chaos tests inject mid-saga fail | Rare orphans surface to platform-admin queue | each domain |
| R-004 | NATS JetStream outage cascades | infra | S2 | P3 | Multi-AZ replication; producer outbox; HTTP fallback drains for safety-critical flows | > 30 min outage degrades sync UX | platform-sre |
| R-005 | Schema drift across versioned events | contracts | S3 | P2 | Schema registry + Pact gates; backward-compat mandatory; deprecation window | Low | platform-arch |
| R-006 | Test data sprawl (per-tenant fixtures balloon) | QA | S3 | P2 | Factory libraries + synthetic data generator; scrubbed-prod snapshots for staging | Low | QA guild |
| R-007 | Monorepo vs. split repos debate regresses | ops | S3 | P3 | Decision log kept in ADRs; rotate leads through release train | — | CTO |
| R-008 | Monolithic monorepo build times exceed 30 min | ops | S3 | P2 | Turbo + task filters; per-service affected-only CI; remote cache | Accepted | platform-sre |
B. Multi-tenancy and isolation
| ID | Risk | Sev | Prob | Mitigation | Residual |
|---|---|---|---|---|---|
| R-010 | Cross-tenant data leak through missed tenant_id filter | S1 | P3 | PostgreSQL RLS as defense in depth + mandatory tenant_isolation.test.ts per service | Very low |
| R-011 | Noisy-neighbour tenant degrades others | S2 | P2 | Per-tenant rate limits; per-tenant AI budgets; schema-per-tenant promotion for largest tenants | Low |
| R-012 | Platform-admin super-admin over-reach | S2 | P3 | All super-admin reads audited; break-glass signed; time-boxed; quarterly review | Low |
| R-013 | Row-level-only tenancy insufficient for top-tier compliance (national-scale) | S2 | P3 | Per-schema promotion path already specified in 15 tenancy-decision-matrix; no API change | Low |
C. FHIR-first and canonicalisation
| ID | Risk | Sev | Prob | Mitigation | Residual |
|---|---|---|---|---|---|
| R-020 | FHIR R4 profile drift (Afghanistan national profile lags IG updates) | S2 | P2 | Internal profile pack versioned; conformance tests; quarterly alignment with MoPH IG | Low |
| R-021 | Teams skip FHIR and build parallel custom schemas | S1 | P2 | ESLint FHIR_FIRST_STANDARD rule on controllers; review gate; architect sign-off on any non-FHIR write path | Low |
| R-022 | FHIR R5 upgrade migration scope | S3 | P3 | Resource-by-resource roadmap; adapter layer isolates breaking changes | Accepted |
| R-023 | Terminology churn (ICD-11 rollout) breaks downstream codings | S2 | P2 | terminology-service versioned; ConceptMap preserves historical mappings; query-time translation | Low |
D. Offline-first complexity
| ID | Risk | Sev | Prob | Mitigation | Residual |
|---|---|---|---|---|---|
| R-030 | Sync conflict on clinically-critical aggregate (allergies, meds) causes silent data loss | S1 | P3 | Conflict policy is always server-authoritative on safety-critical; UI forces manual resolve; audit on every resolve | Very low |
| R-031 | Offline field clinic with outdated allergy list delivers contraindicated vaccine | S1 | P3 | Pre-clinic sync mandatory; cached allergy carry timestamp; clinical alert if last sync > 7 d | Low |
| R-032 | Local device compromise exposes PHI cache | S2 | P2 | At-rest encryption (device-bound key); MAM policies; remote wipe; auto-lock 15 min | Low |
| R-033 | Clock skew on field device silently corrupts sync order | S3 | P3 | Hybrid logical clock (HLC) on every event; server rejects > 5 min skew beyond tolerance | Low |
| R-034 | Offline buffer overflow (weeks offline) | S3 | P3 | Size cap per device + LRU eviction of non-critical data; emergency purge UI | Accepted |
| R-035 | Break-glass access queued offline then reconciled online across conflicting policies | S2 | P3 | Offline break-glass events signed by device key; server-side policy re-evaluated on replay | Low |
E. AI / clinical intelligence
| ID | Risk | Sev | Prob | Mitigation | Residual |
|---|---|---|---|---|---|
| R-040 | LLM hallucination inserts fabricated clinical fact | S1 | P2 | HITL signature mandatory; provenance on every artifact; refusal when low confidence; grounded prompts with RAG over chart | Non-zero — residual addressed via transparency and signature |
| R-041 | Prompt injection via user-uploaded document / scanned note | S1 | P2 | Pre-call classifier; system-prompt isolation; structured generation; allowlist tool surface | Continuous tuning |
| R-042 | AI cost runaway (unbounded tutor / scribe usage) | S2 | P2 | Per-tenant + per-feature budgets; cache by prompt-hash; circuit breakers auto-downgrade model | Low |
| R-043 | Bias in triage / risk-stratification AI | S1 | P2 | Fairness evaluation per model version; parity + equalised odds on consenting cohorts; explicit human-only path | Ongoing review |
| R-044 | Local (edge) AI quality gap vs. cloud | S2 | P2 | Local-first only with quality-threshold heuristic; "Local model" badge in UI; cloud refresh when online | Accepted |
| R-045 | Cross-tenant leakage via shared vector store | S2 | P3 | Tenant filter on every query + schema partitioning; pen-test category for embedding isolation | Low |
| R-046 | AI regulation uncertainty (EU AI Act analogue in Afghanistan / UAE) | S2 | P3 | Per-feature classification; documentation and logging for high-risk features; quarterly review | — |
F. E-prescribing and cross-facility interop
| ID | Risk | Sev | Prob | Mitigation | Residual |
|---|---|---|---|---|---|
| R-050 | Cross-border e-prescribing legality (patient fills prescription in neighbouring country) | S1 | P3 | Jurisdiction policy engine in ghasi-eprescribing-gateway; block by default; licensed corridor allowlist | Requires legal gate per corridor |
| R-051 | MedicationRequest ↔ MedicationDispense spine outage | S2 | P3 | Idempotent FHIR writes; gateway persistence; subscription replay | Low |
| R-052 | Duplicate dispense via retry | S2 | P3 | Idempotency-Key on dispense writes; dedupe on (tenantId, clientMutationId) | Very low |
| R-053 | Cross-facility identity mismatch (same patient, two MRNs) | S1 | P2 | MPI with NID + phone + DOB + biometric; explicit merge queue; audit every merge | Low |
G. Security / compliance / privacy
| ID | Risk | Sev | Prob | Mitigation | Residual |
|---|---|---|---|---|---|
| R-060 | Keycloak compromise exposes all tenants | S1 | P4 | Realm-per-tenant; key rotation; HSM-backed signing; IdP isolation | Low |
| R-061 | Kong DB-less config drift between environments | S2 | P3 | Declarative YAML in VCS; contract tests against Kong in CI | Low |
| R-062 | Insufficient consent capture for secondary use (research, population health) | S2 | P2 | Consent aggregate (FHIR Consent) gates every read; consent policy versioned | Low |
| R-063 | Audit write failure silently accepts transaction | S1 | P3 | Synchronous audit write; transaction fails 503 if audit unavailable | Very low |
| R-064 | PHI in telemetry logs/traces | S1 | P2 | @ghasi/telemetry redaction at emit + collector re-verify + nightly scanner | Low |
| R-065 | DSAR export misses a service's data | S2 | P2 | DSAR is a fan-out saga; every service implements exportForSubject; coverage CI test | Low |
| R-066 | Minor / guardian delegation mis-scoped (teen access to own sensitive category) | S2 | P3 | Age-of-majority policy per category; jurisdiction-configurable; quarterly review | Medium |
H. Afghanistan / regional operational
| ID | Risk | Sev | Prob | Mitigation | Residual |
|---|---|---|---|---|---|
| R-070 | Connectivity loss in district hospital (power + internet) | S1 | P1 | Full offline clinician desktop + provider mobile; UPS requirement per facility tier | Accepted |
| R-071 | On-premise vs. cloud deployment split within one country | S2 | P2 | All services support on-prem + cloud; per-tenant deployment class; private cloud option | Accepted |
| R-072 | Sanctions / vendor access restrictions | S2 | P2 | Multi-vendor AI router; open-source fallbacks; self-host tier | Low |
| R-073 | Regulatory shift (MoPH data residency tightens) | S2 | P3 | Residency is a tenant attribute, not a codepath; in-country regions ready | Low |
| R-074 | HMIS data quality from low-digitised facilities | S2 | P1 | Population-health service pulls from chart data, not manual reports; data-quality indicators visible | Medium |
| R-075 | Paper-first handovers in emergency settings | S3 | P1 | Print / scan-back workflows in document-service; OCR-assisted ingestion | Accepted |
I. Licensing and commercial
| ID | Risk | Sev | Prob | Mitigation | Residual |
|---|---|---|---|---|---|
| R-080 | Licensing boundary enforcement complexity across 27 services | S2 | P2 | Central licensing service; ModuleEntitlementGuard in @ghasi/nestjs-common; UI hides unlicensed module nav | Low |
| R-081 | Unlicensed usage via direct NATS event subscription | S2 | P3 | Subject-level ACLs; licensing-aware NATS consumer registrar | Low |
| R-082 | Over-license charges (mis-seeded license at onboarding) | S3 | P2 | License seed template per tenant class; migration log; reconciliation job nightly | Low |
J. AI-gateway and platform-AI
| ID | Risk | Sev | Prob | Mitigation | Residual |
|---|---|---|---|---|---|
| R-090 | AI vendor lock-in via provider-specific features | S2 | P2 | All calls through ai-gateway-service; vendor-abstract types; multi-vendor router | Low |
| R-091 | AI provenance loss on export | S2 | P2 | Domain aggregates refuse writes of AI artifacts without aiProvenance; export includes provenance block | Low |
| R-092 | AI-assisted clinical decision without human attestation | S1 | P3 | Hard rule: no AI-only persistence of clinical facts; sign = human attestation | Very low |
2. Trade-offs (explicit)
T-01 — Eventual vs. strong consistency
- Choice: Eventual across services; strong within a service (read-your-writes inside one bounded context).
- Why: Distributed transactions across 27 services are forbidden by the architecture baseline. Strong per-aggregate is cheap and gives clinicians the guarantees they need on chart, order, and billing.
- Alt we rejected: Two-phase commit via Saga Monitor service — too much operational cost.
T-02 — FHIR-first vs. custom schemas
- Choice: FHIR R4 is canonical; local tables are operational indexes + workflow state.
- Why: Interoperability with MoPH, national registries, and future cross-border HIE is the product's reason to exist. Custom schemas would orphan us.
- Cost: Onboarding ramp is steeper; some domains (billing) are verbose in FHIR.
T-03 — Row-level vs. schema-per-tenant vs. DB-per-tenant
- Choice: Row-level by default + per-schema promotion path for largest tenants.
- Why: Row-level scales operationally; per-schema available without API change for isolation-critical tenants. DB-per-tenant is the nuclear option reserved for sovereign deployments (MoPH-only instance).
- Cost: RLS correctness must be tested on every service.
T-04 — Monorepo vs. split repos
- Choice: Monorepo with Turbo; one repo per logical platform product (eHealth, edTech).
- Why: Shared libraries (
@ghasi/*), unified standards, atomic cross-service changes. 27 services × 27 repos would multiply CI/CD overhead 27×. - Cost: Build times; dependency upgrade coordination.
T-05 — On-premise vs. cloud
- Choice: Both — every service is packagable for both. Tenant class determines deployment.
- Why: Afghanistan reference deployments span national cloud, private MoPH DC, and facility-level on-prem.
- Cost: Ops team must keep two deployment modes warm; IaC modules maintained for both.
T-06 — NestJS + Node vs. polyglot
- Choice: Single stack — NestJS 11 / Node 22 / TypeScript 5.x for all services.
- Why: Team cohesion; shared
@ghasi/*libs; one hiring pipeline; one tooling chain. - Cost: Node is not optimal for CPU-bound work (imaging, analytics). Mitigation: offload to worker services when needed.
T-07 — Kong DB-less edge
- Choice: Kong in declarative YAML mode as the sole HTTP edge.
- Why: Simpler config, no Kong DB to operate, Git-reviewable routes.
- Cost: Runtime plugin admin not available; all changes are code-review driven.
T-08 — Realm vs. SQLite vs. Dexie for offline
- Choice: Per-surface optimum — Realm on mobile, SQLite (better-sqlite3) on desktop Electron, Dexie/IndexedDB on web. Same sync protocol across all three.
- Why: Each local store fits its platform's performance and platform APIs.
- Cost: Three adapter implementations; covered by one contract test suite.
T-09 — Keycloak realm-per-tenant vs. single-realm-multi-tenant
- Choice: Realm-per-tenant.
- Why: IdP isolation, per-tenant policies, jurisdiction-specific federation.
- Cost: Realm management scales with tenants; admin automation required.
T-10 — AI default on vs. off
- Choice: Off by default per tenant; explicit opt-in per feature per facility.
- Why: Clinical trust, predictable cost, regulatory posture.
- Cost: AI-driven efficiency gains delayed by onboarding friction.
T-11 — Local AI vs. cloud AI
- Choice: Both, behind one port (
ai-gateway-service). Local-first only when offline or for low-stakes tasks. - Why: Offline-first demands local; clinical quality demands cloud.
- Cost: Dual evaluation suites; provenance must treat both uniformly.
T-12 — GraphQL vs. REST
- Choice: REST + FHIR-REST; GraphQL not pursued at the platform edge.
- Why: FHIR canonical shape does not benefit from GraphQL; REST tooling is strong.
- Cost: Patient-portal composite queries require BFF composition work.
T-13 — pgvector vs. external vector DB
- Choice: pgvector inside ai-gateway-service DB.
- Why: Tenant isolation via RLS is symmetric with the rest of Postgres; one operational model.
- Cost: Scale ceiling is a future problem; abstracted behind
VectorIndexport.
T-14 — Yjs CRDT vs. server-authoritative for concurrent authoring
- Choice: Server-authoritative for clinical documentation (note editing is typically one clinician at a time); no CRDT.
- Why: Clinical liability model is per-author; attestation is per-author; CRDT adds complexity without clinical benefit.
- Cost: Second-clinician "observer" edits are merged only via explicit amendment flow.
T-15 — Service count (27)
- Choice: 27 services aligned to bounded contexts.
- Why: Team ownership, release independence, licensing granularity.
- Cost: Operational overhead; dedicated SRE posture; shared CI template mandatory.
T-16 — Audit synchrony
- Choice: Audit write is synchronous — request fails 503 if audit-service is down.
- Why: The one place we are not willing to be eventually consistent. Safety > availability for audit.
- Cost: Audit-service is on the critical path for every PHI write; SRE posture reflects this.
T-17 — Print-first clinical artifacts
- Choice: Every clinical document (allergy banner, medication list, discharge summary) has a print stylesheet at design-token parity.
- Why: Paper handovers are still common in Afghanistan reference clinics; printout must remain accurate and legible in LTR+RTL.
- Cost: Print stylesheet test matrix expands QA scope.
3. Trade-off hierarchy (when two principles conflict)
- Patient safety > everything.
- Audit integrity > feature velocity.
- Tenant isolation > operational convenience.
- FHIR canonical > per-team ergonomics.
- Offline correctness > online responsiveness (when they conflict).
- Explicit AI > autonomous AI.
- Immutability of clinical artifacts > storage cost.
4. Watchlist (quarterly review)
- Cross-tenant test failures — target zero.
- AI cost burn vs. clinical adoption curve.
- Sync conflict rate on clinical aggregates.
- Offline bundle tamper reports.
- Break-glass invocation rate per facility.
- HMIS data-quality indicators.
- Licensing mis-seed reconciliation rate.
- DSAR fulfilment SLA.
- Audit-write failure rate.
- E-prescribing cross-facility fill rate + dispute rate.
5. Governance
- This document is versioned; each quarterly review produces a PR with updated severities/probabilities.
- New features > S2 risk require a mitigation plan in the PR description.
- ADRs under
docs/architecture/carry the narrative for individual decisions; this doc is the summary register.
6. Open questions
- Whether sovereign deployment (single-tenant MoPH instance) warrants a separate tier or is simply a special-case of the per-schema promotion path.
- Long-term AI moderation vendor strategy for Dari + Pashto clinical content — currently no mature safety classifier for these languages.