ADR-0004 — National-Backbone Resilience, Sovereignty, and Telecom-Grade Data Plane
Status: Proposed Date: 2026-04-20 Owner: Platform Architecture Council Deciders: CTO, Head of Platform, Head of SRE, Head of Security, Head of Trust & Safety, Regulator Liaison Supersedes / extends: ADR-0001 (Kong), ADR-0002 (Keycloak/IdP), ADR-0003 (Compliance Layer) References:
- docs/01-enterprise-architecture.md
- docs/03-platform-services.md
- docs/13-security-compliance-tenancy.md
- docs/reports/00-critique-and-gap-analysis.md
- docs/07-epics-and-user-stories.md
1. Context
Ghasi-SMS-Gateway is positioned as the national SMS backbone for Afghanistan and as a regional alternative to Twilio / Infobip / Sinch. The current architecture (ADR-0001 + ADR-0002 + ADR-0003) defines a single-region, multi-tenant SaaS with Kong + Keycloak + Compliance + NATS + Postgres. That topology is correct for an early-stage SaaS but insufficient for:
- Mission-critical government, banking, healthcare, and emergency-services traffic.
- ATRA / GSMA-grade regulator obligations (CDR, LI, MNP, sender-ID registry).
- Sovereign data-residency obligations (no PII off-shore by default).
- Telecom-grade SLAs (P99 OTP latency ≤ 3 s, availability ≥ 99.99%, RPO ≤ 5 s, RTO ≤ 5 min for OTP class).
- Per-MNO operational scale (5 Afghan operators, growing — each requiring isolated bind pools, TPS governors, and DLR pipelines).
- Telecom-grade fraud control (AIT, SIM-box, grey-route, OTP harvesting) at national scale.
This ADR defines the architectural uplifts the platform must adopt to meet that bar.
2. Decision (summary)
We will adopt the following architectural baseline for v2.0 (national-backbone GA), in addition to all prior ADRs:
- Multi-region active-active deployment across
kbl(Kabul, primary) andmzr(Mazar-i-Sharif, secondary), with a sovereign-allowed cold-DR copy indxb(Dubai). Both Afghan regions are read-write; geo-aware routing pins traffic to the closest healthy region. - Control plane vs. data plane split: the orchestration / compliance / billing / portal services run on a control-plane node pool with classic HA; the SMPP / DLR / webhook / channel-router services run on a data-plane node pool with telecom-grade NICs, dedicated egress IP pools, and sticky bind affinity.
- Per-MNO connector pools: replace the single
smpp-connectorStatefulSet with one Deployment per (MNO × bind-direction):smpp-connector-{mno}-{tx|rx|trx}with bind-affinity, per-bind sequence-number management, per-bind concatenation buffers, and per-bind TPS governor backed by Redis. - Twelve new bounded contexts (see §3): national SMS firewall, number intelligence (HLR + MNP), sender-ID registry, numbering / short-code, CDR mediation, cell-broadcast bridge, channel router (SMS / RCS / WhatsApp / Voice / Email), fraud intelligence, regulator liaison, developer portal, campaign manager, consent / DND ledger.
- HSM-backed key management (PKCS#11, FIPS 140-2 L3) for: platform JWT signing, SAML SP keys, webhook HMAC root, SMS-content envelope keys.
- Service mesh with SPIFFE/SPIRE workload identities (Istio or Linkerd) for east-west mTLS, replacing implicit namespace trust.
- NATS JetStream multi-cluster topology: super-cluster across
kblandmzrwith stream mirrors;dxbis a leaf node with audit-only mirrors. - PostgreSQL topology: Patroni-managed clusters per region with synchronous replication intra-region and logical replication inter-region for hot-standby of identity, compliance, sender-ID, and consent data; per-tenant schema sharding for messaging and CDR.
- CDR pipeline distinct from billing events: append-only object-storage CDRs (S3-compatible / MinIO) + ClickHouse for analytics + regulator export jobs.
- National traffic priority lanes (P0 emergency, P1 OTP, P2 transactional, P3 marketing, P4 broadcast) with end-to-end SLA budgets, dedicated NATS subjects, dedicated SMPP windows, and enforced TPS shaping.
- Trusted-tenant fast-path for vetted regulated tenants (banks, ministries, healthcare): cryptographically pre-approved templates that bypass the full compliance pipeline (replaced by signature verification + sample-mode AI shadow).
- Chaos engineering programme running weekly with an explicit GameDay scoreboard.
- 24×7 NOC with PagerDuty / Opsgenie integration, tiered escalation, and live MNO partnership channels.
3. New Bounded Contexts (12)
| Service | Purpose | Owner | Sync interface | Async topics |
|---|---|---|---|---|
sms-firewall-service | Inbound MO firewall, transit firewall, AIT detection, SIM-box detection, grey-route exclusion, DND enforcement | Trust & Safety | gRPC FilterInbound, EvaluateTransit | firewall.alert.*, firewall.audit.* |
number-intelligence-service | MSISDN → MNO resolution, ported-number cache (MNP), EIR/CEIR check, line-type classification | Messaging Core | gRPC Lookup | numint.cache.refreshed, numint.mnp.changed |
sender-id-registry-service | Registration, KYC of registrant, verification, rotation, suspension, regulator export of all registered sender IDs | Trust & Safety + Regulator | REST + gRPC Verify | sender.id.registered, sender.id.suspended, sender.id.regulator.exported |
numbering-service | Long-codes, short-codes, alpha-IDs, MSISDN inventory, leasing, reservation, expiry, recall | Commerce | REST | number.assigned, number.released, number.expired |
cdr-mediation-service | Append-only CDR generation, TAP 3.12 / RAP export, regulator export | Commerce + Regulator | (none — async only) | cdr.generated.v1, cdr.exported.v1 |
cbc-bridge-service | 3GPP TS 23.041 / ETSI EN 302 117 cell-broadcast bridge for civil emergencies | Government / Emergency | gRPC BroadcastEmergency (mTLS, government-only) | cbc.broadcast.requested, cbc.broadcast.dispatched, cbc.broadcast.acked |
channel-router-service | Multi-channel fallback (SMS → MMS → RCS → WhatsApp BSP → Voice OTP → email), per-recipient profile and per-tenant policy | Messaging Core | gRPC RouteWithFallback | channel.fallback.taken, channel.delivery.confirmed |
fraud-intel-service | ML scoring for AIT, SIM-box, OTP harvesting, grey-route arbitrage; fraud feed export | Trust & Safety | gRPC Score | fraud.detected.*, fraud.feed.updated.v1 |
regulator-portal-service | Regulator-facing portal: license artifacts, monthly CDR submission, LI requests, complaint ingest | Regulator + Legal | REST (regulator only, mTLS) | regulator.report.submitted, regulator.complaint.received |
developer-portal-service | Public dev portal: API docs, SDKs, sandbox, key management self-serve, consumption analytics | Product + DevRel | REST | devportal.signup, devportal.key.created |
campaign-service | Campaigns: segments, templates, schedule, A/B, throttle, kill-switch, conversation sessions | Product | REST + gRPC EnqueueCampaign | campaign.created, campaign.dispatched, campaign.completed, campaign.killed |
consent-ledger-service | Opt-in/opt-out ledger, DND registry sync, STOP-keyword handling, consent revocation propagation | Trust & Safety | gRPC CheckConsent, REST admin | consent.granted, consent.revoked, dnd.registry.synced |
4. Updated System Context (C4 L1)
5. Multi-region Topology
Region-affinity policy.
- Identity, sender-ID, consent, compliance rules: multi-master with conflict-free updates (logical replication + per-row LWW with HLC).
- Messaging hot path (
orch.sms_messages,dlr.delivery_receipts,cdr.records): region-local primary, cross-region mirror is read-only. - Routing decisions are region-pinned to keep MNO bind affinity (an MNO bind is owned by exactly one region at a time).
Failover. Region failover is automatic for read paths (Cloudflare + GeoDNS), gated for write paths (manual / GameDay-tested cutover) so we never split-brain on idempotency keys.
6. Data-Plane Separation
| Plane | Workloads | Node pool | Network | SLO |
|---|---|---|---|---|
| Edge | Cloudflare, Kong | edge | Public | P99 ≤ 30 ms TLS handshake |
| Control | orchestrator, compliance, routing, billing, portals, IdP | np-ctrl (general-purpose) | private | 99.95% |
| Data | smpp-connector pool, dlr-processor, webhook-dispatcher, channel-router, cbc-bridge | np-data (telecom NICs, dedicated egress IPs whitelisted by MNOs) | private + MNO IPSec / leased | 99.99% per pool |
| Stateful | Postgres, NATS, Redis, ClickHouse, MinIO, HSM | np-state (local NVMe, anti-affinity) | private | 99.99% |
| Observability | Prometheus, Loki, OTel, Grafana, NOC | np-obs | private | 99.9% |
| Identity | Keycloak (HA), auth-service, compliance-ai (LLM) | np-identity (GPU node pool for LLM) | private + tightly NetworkPolicy'd | 99.95% |
7. SMPP Connector Pool Redesign
Replace the single smpp-connector StatefulSet with the following per-MNO topology:
smpp-connector-awcc-tx Deployment, replicas=N_tx
smpp-connector-awcc-rx Deployment, replicas=N_rx
smpp-connector-awcc-trx Deployment, replicas=N_trx (used for MO+MT-DLR co-bind where MNO requires)
smpp-connector-roshan-{tx|rx|trx} ...
smpp-connector-etisalat-af-{tx|rx|trx} ...
smpp-connector-mtn-af-{tx|rx|trx} ...
smpp-connector-salaam-{tx|rx|trx} ...
Each pod owns:
- Exactly one persistent SMPP bind (TX / RX / TRX) keyed by MNO + bind-id.
- A per-bind sequence-number monotonic counter (Redis-backed; survives pod restart with a 60 s warm-up).
- A per-bind sliding TPS governor (Redis sorted-set; tracks N seconds × M ms granularity per bind).
- A per-bind concatenation buffer with TTL =
concat_window_seconds(default 60). - A per-bind enquire_link cadence (default 30 s; configurable per MNO contract).
- A per-bind submit_sm window (default 100; learned per MNO under back-pressure).
- A per-bind reconnection back-off with full jitter (initial 1 s, max 60 s, decorrelated jitter).
- Per-bind metrics:
smpp_bind_state{mno,bind,direction},smpp_window_inflight,smpp_submit_throttled_total{esme_status},smpp_dlr_latency_seconds{mno,bind}.
Bind affinity. A NATS subject smpp.{mno}.{direction}.{bindId} is consumed by exactly one pod (queue group of 1) so PDU sequence is strictly ordered per bind. Failover transfers consumer ownership atomically via JetStream consumer recreation.
8. National Traffic Priority Lanes
| Lane | Use cases | NATS subject | TPS budget per MNO | Submit→DLR P99 | Compliance treatment |
|---|---|---|---|---|---|
| P0 — Emergency | Civil emergency cell-broadcast, public-safety alerts | lane.p0.emergency.* | reserved 100% pre-emption | ≤ 1 s | bypass (replaced by gov-PKI signature verification) |
| P1 — OTP | OTP, 2FA, transactional codes | lane.p1.otp.* | reserved 30% of MNO TPS | ≤ 3 s | trusted-tenant fast-path; compliance shadow-mode |
| P2 — Transactional | Bank alerts, delivery notifications, healthcare | lane.p2.tx.* | reserved 30% | ≤ 10 s | full compliance, optimised |
| P3 — Marketing | Promotional, bulk | lane.p3.mkt.* | floating 30% | ≤ 60 s | full compliance, regulator quiet-window honoured |
| P4 — Broadcast | Authorised national broadcasts (non-emergency) | lane.p4.bcast.* | floating, throttled | ≤ 5 min | full compliance + secondary regulator approval |
The Routing Engine assigns lane based on tenant tier × content classification × explicit X-Priority-Lane header (subject to authorisation).
9. Quantitative NFR Anchors (non-negotiable)
| NFR | Target |
|---|---|
| Steady-state throughput | 5 M msg/h sustained per region; 10 M/h across both Afghan regions |
| Burst | 100 K msg/min for 5 min, 250 K msg/min for 30 s |
| Submit→DLR P95 (P1 OTP class) | ≤ 3 s end-to-end (incl. compliance) |
| Submit→DLR P95 (P2 transactional) | ≤ 10 s |
| Submit→202 ack latency P99 | ≤ 200 ms (Kong → orchestrator → 202) |
| Compliance EvaluateCompliance P99 | ≤ 800 ms (current spec is P95 ≤ 500 ms — keep) |
| Availability — Edge + Orchestrator | 99.99% monthly (≤ 4 m 22 s downtime) |
| Availability — SMPP per MNO bind | 99.95% monthly per-bind, 99.99% any-bind |
| RPO (OTP/transactional) | ≤ 5 s |
| RPO (compliance audit + CDR) | 0 (synchronous WAL ship) |
| RTO (any region) | ≤ 5 min for OTP class, ≤ 15 min for full platform |
| Webhook delivery first-attempt success | ≥ 99.9% within 5 s; full retry budget 24 h with exp back-off |
| Compliance hold-queue oldest age | ≤ 4 h (P95), ≤ 24 h auto-expiry hard limit |
| Audit-log retention | ≥ 13 months hot, ≥ 7 years cold (regulator) |
| CDR generation lag from DLR | ≤ 10 s P99 |
| Fraud detection mean-time-to-detect (AIT) | ≤ 15 min |
| Tenant compliance-score refresh | ≤ 15 min |
These are bound to alerts in 15-nfr-sla-catalog.md (to be authored).
10. Trusted-Tenant Fast Path
For pre-vetted tenants (banks, ministries, healthcare, mass transit, accredited brands):
- Tenant pre-registers a template catalog with content + variable schema.
- Each template is signed by
compliance-engineafter one-time human review and stored incompliance.approved_templateswith a content fingerprint and template-id. - At submit time, tenant supplies
X-Template-Id+ variable bindings. - Orchestrator computes content fingerprint using template-id + variables; verifies fingerprint matches stored template hash.
- If match →
EvaluateComplianceis called in shadow mode (logged, not blocking). Routing proceeds. - If mismatch → fall back to full compliance evaluation.
- Periodically (1 in 1000 sample) full evaluation is run anyway for drift detection.
This delivers OTP-class latency without sacrificing compliance evidence.
11. HSM-Backed Key Hierarchy
| Key class | HSM-held? | Rotation | Notes |
|---|---|---|---|
| Platform JWT root signing keys (RS256) | Yes | 90 d (30 d previously) | kid exposed via JWKS; HSM signs, HSM never exports |
| SAML SP signing keys (per tenant) | Yes | Annual or on-demand | Per-tenant key under shared HSM partition |
| Webhook HMAC root | Yes | 180 d | Per-tenant secrets derived (HKDF) |
| SMS-content envelope keys (per-tenant DEK) | Yes (KEK) / Postgres TDE (DEK) | KEK 90 d; DEK 30 d | Envelope encryption; KEK in HSM, DEK in Postgres encrypted |
| TLS certs | Public CA (Cloudflare) | Auto | — |
| mTLS (service mesh) | SPIRE-issued workload SVIDs | 1 h rotation | — |
| Database TDE master key | Yes | 365 d | — |
12. Service Mesh + Zero Trust
Adopt Istio (or Linkerd) with:
- Automatic mTLS between every pod (
STRICTmode). - SPIRE as workload identity provider issuing SVIDs per service account.
- AuthorizationPolicies per service: explicit
from.principalsallow-lists; deny-by-default. - Per-namespace egress gateways (no pod talks to the Internet directly except egress gateways).
- Telemetry into the existing OTel collector; no separate observability stack.
13. NATS JetStream — Multi-cluster
super-cluster: ghasi-jetstream
cluster ghasi-jetstream-kbl (3 nodes; primary streams)
cluster ghasi-jetstream-mzr (3 nodes; mirrored streams)
leaf-nodes:
ghasi-jetstream-dxb (audit-only mirrors of compliance.audit.v1, cdr.*, regulator.*)
- Streams
sms.outbound.*,sms.dlr.inbound,lane.p*.*are region-local (not mirrored — region-pinned). - Streams
compliance.audit.v1,cdr.generated.v1,cdr.exported.v1,regulator.*,auth.events,consent.*,sender.id.*,firewall.audit.*,fraud.detected.*are mirrored to the peer Afghan region. - Audit + CDR mirror to
dxbis append-only WORM (object storage immutable bucket-policy).
14. Postgres Topology
- Per region: Patroni-managed cluster of 3 (1 sync replica, 1 async).
- Per service: dedicated logical database within the cluster (or dedicated cluster for very hot services like
orch,cdr). - Cross-region: logical replication for control tables (identity, sender-id, compliance rules, consent, tenant config). Hot messaging tables stay region-local.
- TDE on at-rest (Postgres pgcrypto + HSM-held KEK).
- Row-level security (RLS) enforced per tenant for all tenant-scoped tables (already in DoD, must be audited).
- WAL archived continuously to object storage (kbl + mzr cross-mirrored).
15. CDR Pipeline (separate from billing events)
DLR → dlr-processor → NATS billing.events [for billing]
DLR → dlr-processor → NATS cdr.generated.v1 [for mediation]
│
▼
cdr-mediation-service
· normalises to Ghasi CDR canonical schema (JSON Lines)
· partitions by hour into MinIO bucket cdr-{region}-{yyyymmdd}/
· writes daily TAP 3.12 / RAP roll-up
· signs daily file with regulator-approved key
· exports nightly to ATRA SFTP / API
· ingests into ClickHouse for analytics
CDR is immutable; corrections are appended as adjustment records.
16. Consequences
Positive
- True national-asset behaviour: regional resilience, regulator integration, sovereign data residency.
- Telecom-grade SLAs are achievable and provable with the new SLA catalog and chaos drills.
- Genuine differentiation against Twilio/Infobip/Sinch on (a) sovereign on-prem AI compliance, (b) sender-ID national authority, (c) CBC integration, (d) tenant compliance scoring exposed.
- Eliminates the major risks identified in
00-critique-and-gap-analysis.md.
Negative
- Significant capex (HSM appliances per region, GPU nodes for LLM, dedicated leased lines / IPSec to MNOs, second region buildout).
- Operational complexity grows (12 new services, multi-region failover GameDays, mesh, HSM, regulator integration).
- Time-to-GA extends by ~6 months relative to the single-region baseline.
- Requires hiring SRE, security engineering, and regulator-liaison roles.
Risks
- Regulator (ATRA) requirements are still evolving; CDR / LI schema may change. Mitigation: design
cdr-mediation-servicewith pluggable export schemas. - HSM vendor lock-in. Mitigation: PKCS#11 abstraction; multi-vendor procurement.
- Multi-region Postgres conflict resolution edge cases. Mitigation: keep messaging hot path region-local; only control-plane data is multi-master.
17. Acceptance / Done
This ADR is "Approved" once:
- Architecture Council signs §2 list.
- SRE confirms NFRs in §9 are budgeted (capacity + cost).
- Security signs §11–§12 (HSM + mesh).
- Trust & Safety signs §10 (trusted-tenant fast-path) and §3 (new contexts).
- Regulator Liaison signs §15 (CDR pipeline) and
regulator-portal-serviceepic in 007. - Product signs the new bounded contexts and their epic IDs in 007.
- Roadmap is updated to reflect the additional 6-month national-backbone GA window.