Skip to main content

ADR-0004 — National-Backbone Resilience, Sovereignty, and Telecom-Grade Data Plane

Status: Proposed Date: 2026-04-20 Owner: Platform Architecture Council Deciders: CTO, Head of Platform, Head of SRE, Head of Security, Head of Trust & Safety, Regulator Liaison Supersedes / extends: ADR-0001 (Kong), ADR-0002 (Keycloak/IdP), ADR-0003 (Compliance Layer) References:


1. Context

Ghasi-SMS-Gateway is positioned as the national SMS backbone for Afghanistan and as a regional alternative to Twilio / Infobip / Sinch. The current architecture (ADR-0001 + ADR-0002 + ADR-0003) defines a single-region, multi-tenant SaaS with Kong + Keycloak + Compliance + NATS + Postgres. That topology is correct for an early-stage SaaS but insufficient for:

  1. Mission-critical government, banking, healthcare, and emergency-services traffic.
  2. ATRA / GSMA-grade regulator obligations (CDR, LI, MNP, sender-ID registry).
  3. Sovereign data-residency obligations (no PII off-shore by default).
  4. Telecom-grade SLAs (P99 OTP latency ≤ 3 s, availability ≥ 99.99%, RPO ≤ 5 s, RTO ≤ 5 min for OTP class).
  5. Per-MNO operational scale (5 Afghan operators, growing — each requiring isolated bind pools, TPS governors, and DLR pipelines).
  6. Telecom-grade fraud control (AIT, SIM-box, grey-route, OTP harvesting) at national scale.

This ADR defines the architectural uplifts the platform must adopt to meet that bar.


2. Decision (summary)

We will adopt the following architectural baseline for v2.0 (national-backbone GA), in addition to all prior ADRs:

  1. Multi-region active-active deployment across kbl (Kabul, primary) and mzr (Mazar-i-Sharif, secondary), with a sovereign-allowed cold-DR copy in dxb (Dubai). Both Afghan regions are read-write; geo-aware routing pins traffic to the closest healthy region.
  2. Control plane vs. data plane split: the orchestration / compliance / billing / portal services run on a control-plane node pool with classic HA; the SMPP / DLR / webhook / channel-router services run on a data-plane node pool with telecom-grade NICs, dedicated egress IP pools, and sticky bind affinity.
  3. Per-MNO connector pools: replace the single smpp-connector StatefulSet with one Deployment per (MNO × bind-direction): smpp-connector-{mno}-{tx|rx|trx} with bind-affinity, per-bind sequence-number management, per-bind concatenation buffers, and per-bind TPS governor backed by Redis.
  4. Twelve new bounded contexts (see §3): national SMS firewall, number intelligence (HLR + MNP), sender-ID registry, numbering / short-code, CDR mediation, cell-broadcast bridge, channel router (SMS / RCS / WhatsApp / Voice / Email), fraud intelligence, regulator liaison, developer portal, campaign manager, consent / DND ledger.
  5. HSM-backed key management (PKCS#11, FIPS 140-2 L3) for: platform JWT signing, SAML SP keys, webhook HMAC root, SMS-content envelope keys.
  6. Service mesh with SPIFFE/SPIRE workload identities (Istio or Linkerd) for east-west mTLS, replacing implicit namespace trust.
  7. NATS JetStream multi-cluster topology: super-cluster across kbl and mzr with stream mirrors; dxb is a leaf node with audit-only mirrors.
  8. PostgreSQL topology: Patroni-managed clusters per region with synchronous replication intra-region and logical replication inter-region for hot-standby of identity, compliance, sender-ID, and consent data; per-tenant schema sharding for messaging and CDR.
  9. CDR pipeline distinct from billing events: append-only object-storage CDRs (S3-compatible / MinIO) + ClickHouse for analytics + regulator export jobs.
  10. National traffic priority lanes (P0 emergency, P1 OTP, P2 transactional, P3 marketing, P4 broadcast) with end-to-end SLA budgets, dedicated NATS subjects, dedicated SMPP windows, and enforced TPS shaping.
  11. Trusted-tenant fast-path for vetted regulated tenants (banks, ministries, healthcare): cryptographically pre-approved templates that bypass the full compliance pipeline (replaced by signature verification + sample-mode AI shadow).
  12. Chaos engineering programme running weekly with an explicit GameDay scoreboard.
  13. 24×7 NOC with PagerDuty / Opsgenie integration, tiered escalation, and live MNO partnership channels.

3. New Bounded Contexts (12)

ServicePurposeOwnerSync interfaceAsync topics
sms-firewall-serviceInbound MO firewall, transit firewall, AIT detection, SIM-box detection, grey-route exclusion, DND enforcementTrust & SafetygRPC FilterInbound, EvaluateTransitfirewall.alert.*, firewall.audit.*
number-intelligence-serviceMSISDN → MNO resolution, ported-number cache (MNP), EIR/CEIR check, line-type classificationMessaging CoregRPC Lookupnumint.cache.refreshed, numint.mnp.changed
sender-id-registry-serviceRegistration, KYC of registrant, verification, rotation, suspension, regulator export of all registered sender IDsTrust & Safety + RegulatorREST + gRPC Verifysender.id.registered, sender.id.suspended, sender.id.regulator.exported
numbering-serviceLong-codes, short-codes, alpha-IDs, MSISDN inventory, leasing, reservation, expiry, recallCommerceRESTnumber.assigned, number.released, number.expired
cdr-mediation-serviceAppend-only CDR generation, TAP 3.12 / RAP export, regulator exportCommerce + Regulator(none — async only)cdr.generated.v1, cdr.exported.v1
cbc-bridge-service3GPP TS 23.041 / ETSI EN 302 117 cell-broadcast bridge for civil emergenciesGovernment / EmergencygRPC BroadcastEmergency (mTLS, government-only)cbc.broadcast.requested, cbc.broadcast.dispatched, cbc.broadcast.acked
channel-router-serviceMulti-channel fallback (SMS → MMS → RCS → WhatsApp BSP → Voice OTP → email), per-recipient profile and per-tenant policyMessaging CoregRPC RouteWithFallbackchannel.fallback.taken, channel.delivery.confirmed
fraud-intel-serviceML scoring for AIT, SIM-box, OTP harvesting, grey-route arbitrage; fraud feed exportTrust & SafetygRPC Scorefraud.detected.*, fraud.feed.updated.v1
regulator-portal-serviceRegulator-facing portal: license artifacts, monthly CDR submission, LI requests, complaint ingestRegulator + LegalREST (regulator only, mTLS)regulator.report.submitted, regulator.complaint.received
developer-portal-servicePublic dev portal: API docs, SDKs, sandbox, key management self-serve, consumption analyticsProduct + DevRelRESTdevportal.signup, devportal.key.created
campaign-serviceCampaigns: segments, templates, schedule, A/B, throttle, kill-switch, conversation sessionsProductREST + gRPC EnqueueCampaigncampaign.created, campaign.dispatched, campaign.completed, campaign.killed
consent-ledger-serviceOpt-in/opt-out ledger, DND registry sync, STOP-keyword handling, consent revocation propagationTrust & SafetygRPC CheckConsent, REST adminconsent.granted, consent.revoked, dnd.registry.synced

4. Updated System Context (C4 L1)


5. Multi-region Topology

Region-affinity policy.

  • Identity, sender-ID, consent, compliance rules: multi-master with conflict-free updates (logical replication + per-row LWW with HLC).
  • Messaging hot path (orch.sms_messages, dlr.delivery_receipts, cdr.records): region-local primary, cross-region mirror is read-only.
  • Routing decisions are region-pinned to keep MNO bind affinity (an MNO bind is owned by exactly one region at a time).

Failover. Region failover is automatic for read paths (Cloudflare + GeoDNS), gated for write paths (manual / GameDay-tested cutover) so we never split-brain on idempotency keys.


6. Data-Plane Separation

PlaneWorkloadsNode poolNetworkSLO
EdgeCloudflare, KongedgePublicP99 ≤ 30 ms TLS handshake
Controlorchestrator, compliance, routing, billing, portals, IdPnp-ctrl (general-purpose)private99.95%
Datasmpp-connector pool, dlr-processor, webhook-dispatcher, channel-router, cbc-bridgenp-data (telecom NICs, dedicated egress IPs whitelisted by MNOs)private + MNO IPSec / leased99.99% per pool
StatefulPostgres, NATS, Redis, ClickHouse, MinIO, HSMnp-state (local NVMe, anti-affinity)private99.99%
ObservabilityPrometheus, Loki, OTel, Grafana, NOCnp-obsprivate99.9%
IdentityKeycloak (HA), auth-service, compliance-ai (LLM)np-identity (GPU node pool for LLM)private + tightly NetworkPolicy'd99.95%

7. SMPP Connector Pool Redesign

Replace the single smpp-connector StatefulSet with the following per-MNO topology:

smpp-connector-awcc-tx Deployment, replicas=N_tx
smpp-connector-awcc-rx Deployment, replicas=N_rx
smpp-connector-awcc-trx Deployment, replicas=N_trx (used for MO+MT-DLR co-bind where MNO requires)
smpp-connector-roshan-{tx|rx|trx} ...
smpp-connector-etisalat-af-{tx|rx|trx} ...
smpp-connector-mtn-af-{tx|rx|trx} ...
smpp-connector-salaam-{tx|rx|trx} ...

Each pod owns:

  • Exactly one persistent SMPP bind (TX / RX / TRX) keyed by MNO + bind-id.
  • A per-bind sequence-number monotonic counter (Redis-backed; survives pod restart with a 60 s warm-up).
  • A per-bind sliding TPS governor (Redis sorted-set; tracks N seconds × M ms granularity per bind).
  • A per-bind concatenation buffer with TTL = concat_window_seconds (default 60).
  • A per-bind enquire_link cadence (default 30 s; configurable per MNO contract).
  • A per-bind submit_sm window (default 100; learned per MNO under back-pressure).
  • A per-bind reconnection back-off with full jitter (initial 1 s, max 60 s, decorrelated jitter).
  • Per-bind metrics: smpp_bind_state{mno,bind,direction}, smpp_window_inflight, smpp_submit_throttled_total{esme_status}, smpp_dlr_latency_seconds{mno,bind}.

Bind affinity. A NATS subject smpp.{mno}.{direction}.{bindId} is consumed by exactly one pod (queue group of 1) so PDU sequence is strictly ordered per bind. Failover transfers consumer ownership atomically via JetStream consumer recreation.


8. National Traffic Priority Lanes

LaneUse casesNATS subjectTPS budget per MNOSubmit→DLR P99Compliance treatment
P0 — EmergencyCivil emergency cell-broadcast, public-safety alertslane.p0.emergency.*reserved 100% pre-emption≤ 1 sbypass (replaced by gov-PKI signature verification)
P1 — OTPOTP, 2FA, transactional codeslane.p1.otp.*reserved 30% of MNO TPS≤ 3 strusted-tenant fast-path; compliance shadow-mode
P2 — TransactionalBank alerts, delivery notifications, healthcarelane.p2.tx.*reserved 30%≤ 10 sfull compliance, optimised
P3 — MarketingPromotional, bulklane.p3.mkt.*floating 30%≤ 60 sfull compliance, regulator quiet-window honoured
P4 — BroadcastAuthorised national broadcasts (non-emergency)lane.p4.bcast.*floating, throttled≤ 5 minfull compliance + secondary regulator approval

The Routing Engine assigns lane based on tenant tier × content classification × explicit X-Priority-Lane header (subject to authorisation).


9. Quantitative NFR Anchors (non-negotiable)

NFRTarget
Steady-state throughput5 M msg/h sustained per region; 10 M/h across both Afghan regions
Burst100 K msg/min for 5 min, 250 K msg/min for 30 s
Submit→DLR P95 (P1 OTP class)≤ 3 s end-to-end (incl. compliance)
Submit→DLR P95 (P2 transactional)≤ 10 s
Submit→202 ack latency P99≤ 200 ms (Kong → orchestrator → 202)
Compliance EvaluateCompliance P99≤ 800 ms (current spec is P95 ≤ 500 ms — keep)
Availability — Edge + Orchestrator99.99% monthly (≤ 4 m 22 s downtime)
Availability — SMPP per MNO bind99.95% monthly per-bind, 99.99% any-bind
RPO (OTP/transactional)≤ 5 s
RPO (compliance audit + CDR)0 (synchronous WAL ship)
RTO (any region)≤ 5 min for OTP class, ≤ 15 min for full platform
Webhook delivery first-attempt success≥ 99.9% within 5 s; full retry budget 24 h with exp back-off
Compliance hold-queue oldest age≤ 4 h (P95), ≤ 24 h auto-expiry hard limit
Audit-log retention≥ 13 months hot, ≥ 7 years cold (regulator)
CDR generation lag from DLR≤ 10 s P99
Fraud detection mean-time-to-detect (AIT)≤ 15 min
Tenant compliance-score refresh≤ 15 min

These are bound to alerts in 15-nfr-sla-catalog.md (to be authored).


10. Trusted-Tenant Fast Path

For pre-vetted tenants (banks, ministries, healthcare, mass transit, accredited brands):

  1. Tenant pre-registers a template catalog with content + variable schema.
  2. Each template is signed by compliance-engine after one-time human review and stored in compliance.approved_templates with a content fingerprint and template-id.
  3. At submit time, tenant supplies X-Template-Id + variable bindings.
  4. Orchestrator computes content fingerprint using template-id + variables; verifies fingerprint matches stored template hash.
  5. If match → EvaluateCompliance is called in shadow mode (logged, not blocking). Routing proceeds.
  6. If mismatch → fall back to full compliance evaluation.
  7. Periodically (1 in 1000 sample) full evaluation is run anyway for drift detection.

This delivers OTP-class latency without sacrificing compliance evidence.


11. HSM-Backed Key Hierarchy

Key classHSM-held?RotationNotes
Platform JWT root signing keys (RS256)Yes90 d (30 d previously)kid exposed via JWKS; HSM signs, HSM never exports
SAML SP signing keys (per tenant)YesAnnual or on-demandPer-tenant key under shared HSM partition
Webhook HMAC rootYes180 dPer-tenant secrets derived (HKDF)
SMS-content envelope keys (per-tenant DEK)Yes (KEK) / Postgres TDE (DEK)KEK 90 d; DEK 30 dEnvelope encryption; KEK in HSM, DEK in Postgres encrypted
TLS certsPublic CA (Cloudflare)Auto
mTLS (service mesh)SPIRE-issued workload SVIDs1 h rotation
Database TDE master keyYes365 d

12. Service Mesh + Zero Trust

Adopt Istio (or Linkerd) with:

  • Automatic mTLS between every pod (STRICT mode).
  • SPIRE as workload identity provider issuing SVIDs per service account.
  • AuthorizationPolicies per service: explicit from.principals allow-lists; deny-by-default.
  • Per-namespace egress gateways (no pod talks to the Internet directly except egress gateways).
  • Telemetry into the existing OTel collector; no separate observability stack.

13. NATS JetStream — Multi-cluster

super-cluster: ghasi-jetstream
cluster ghasi-jetstream-kbl (3 nodes; primary streams)
cluster ghasi-jetstream-mzr (3 nodes; mirrored streams)
leaf-nodes:
ghasi-jetstream-dxb (audit-only mirrors of compliance.audit.v1, cdr.*, regulator.*)
  • Streams sms.outbound.*, sms.dlr.inbound, lane.p*.* are region-local (not mirrored — region-pinned).
  • Streams compliance.audit.v1, cdr.generated.v1, cdr.exported.v1, regulator.*, auth.events, consent.*, sender.id.*, firewall.audit.*, fraud.detected.* are mirrored to the peer Afghan region.
  • Audit + CDR mirror to dxb is append-only WORM (object storage immutable bucket-policy).

14. Postgres Topology

  • Per region: Patroni-managed cluster of 3 (1 sync replica, 1 async).
  • Per service: dedicated logical database within the cluster (or dedicated cluster for very hot services like orch, cdr).
  • Cross-region: logical replication for control tables (identity, sender-id, compliance rules, consent, tenant config). Hot messaging tables stay region-local.
  • TDE on at-rest (Postgres pgcrypto + HSM-held KEK).
  • Row-level security (RLS) enforced per tenant for all tenant-scoped tables (already in DoD, must be audited).
  • WAL archived continuously to object storage (kbl + mzr cross-mirrored).

15. CDR Pipeline (separate from billing events)

DLR → dlr-processor → NATS billing.events [for billing]
DLR → dlr-processor → NATS cdr.generated.v1 [for mediation]


cdr-mediation-service
· normalises to Ghasi CDR canonical schema (JSON Lines)
· partitions by hour into MinIO bucket cdr-{region}-{yyyymmdd}/
· writes daily TAP 3.12 / RAP roll-up
· signs daily file with regulator-approved key
· exports nightly to ATRA SFTP / API
· ingests into ClickHouse for analytics

CDR is immutable; corrections are appended as adjustment records.


16. Consequences

Positive

  • True national-asset behaviour: regional resilience, regulator integration, sovereign data residency.
  • Telecom-grade SLAs are achievable and provable with the new SLA catalog and chaos drills.
  • Genuine differentiation against Twilio/Infobip/Sinch on (a) sovereign on-prem AI compliance, (b) sender-ID national authority, (c) CBC integration, (d) tenant compliance scoring exposed.
  • Eliminates the major risks identified in 00-critique-and-gap-analysis.md.

Negative

  • Significant capex (HSM appliances per region, GPU nodes for LLM, dedicated leased lines / IPSec to MNOs, second region buildout).
  • Operational complexity grows (12 new services, multi-region failover GameDays, mesh, HSM, regulator integration).
  • Time-to-GA extends by ~6 months relative to the single-region baseline.
  • Requires hiring SRE, security engineering, and regulator-liaison roles.

Risks

  • Regulator (ATRA) requirements are still evolving; CDR / LI schema may change. Mitigation: design cdr-mediation-service with pluggable export schemas.
  • HSM vendor lock-in. Mitigation: PKCS#11 abstraction; multi-vendor procurement.
  • Multi-region Postgres conflict resolution edge cases. Mitigation: keep messaging hot path region-local; only control-plane data is multi-master.

17. Acceptance / Done

This ADR is "Approved" once:

  1. Architecture Council signs §2 list.
  2. SRE confirms NFRs in §9 are budgeted (capacity + cost).
  3. Security signs §11–§12 (HSM + mesh).
  4. Trust & Safety signs §10 (trusted-tenant fast-path) and §3 (new contexts).
  5. Regulator Liaison signs §15 (CDR pipeline) and regulator-portal-service epic in 007.
  6. Product signs the new bounded contexts and their epic IDs in 007.
  7. Roadmap is updated to reflect the additional 6-month national-backbone GA window.