Ghasi-SMS-Gateway — Critique & Gap Analysis
Version: 1.1 (re-scanned baseline)
Status: Approved (baseline critique)
Owner: Platform Architecture + Trust & Safety + SRE
Last Updated: 2026-04-20
Scope reviewed: docs/01-…14-*.md, docs/architecture/ADR-000{1,2,3,4}, all 14 service docs under services/<svc>/ (incl. _report.md, SERVICE_OVERVIEW.md, JIRA_IMPORT.csv), docs/standards/, docs/roadmap/ROADMAP.md.
Change log
- v1.1 (2026-04-20) — Re-scanned the repo after correcting an inventory error. The earlier draft overstated the stub problem (it claimed 11 of 14 services were stubs and that
07-epics-and-user-stories.mdwas empty). Both are wrong: every service has a populatedSERVICE_OVERVIEW.md(3KB–12KB) and a_report.mdcarrying full Jira-ready epics & stories under theEP-{PREFIX}-NN/US-{PREFIX}-NNNscheme;07-…mdis populated with five cross-cutting platform epics (EP-PLAT-01..05). The critique below is recalibrated to that real baseline. Architectural / NFR / national-backbone critique stands.- v1.0 (2026-04-20) — Initial draft.
Companion deliverables:
- 07-epics-and-user-stories.md — extended catalog (existing IDs preserved + national-backbone additions)
- 07-epics-and-user-stories.JIRA_IMPORT.csv — Jira import (new/updated items only)
- architecture/ADR-0004-national-backbone-resilience.md — national resilience blueprint
0. Verdict
The platform has a strong micro-architectural skeleton (Kong → orchestrator → compliance → routing → SMPP, NATS JetStream backbone, fail-closed compliance, Keycloak-brokered IdP) and a substantial Jira-ready backlog already authored (60+ epics, 200+ stories, consistent IDs across 14 services). It is materially better than the average vendor starting point.
What is missing is not engineering discipline — it's the telecom-grade national-asset DNA: multi-region resilience, regulator integration, fraud / AIT / SIM-box defence, sender-ID national registry, HLR/MNP authority, cell-broadcast for emergencies, multi-channel fallback, sovereign HSM key custody, service-mesh zero trust, quantitative NFR/SLA catalog, and the operational machinery (NOC, chaos drills, status page, change management) that distinguishes an enterprise SaaS from a national backbone.
One-line summary: strong foundations, mature backlog, but architected for SaaS, not for a national asset; the gap to Twilio/Infobip/Sinch parity is now small — the gap to surpassing them is closed by the additions in ADR-0004 and the new epics in 007.
1. What is Strong (keep, do not regress)
| # | Strength | Evidence | Why it matters |
|---|---|---|---|
| 1 | Mature, ID-consistent backlog already exists across 14 services using EP-{PREFIX}-NN and US-{PREFIX}-NNN | services/*/_report.md — 60+ epics, 200+ stories, all with acceptance criteria and story points | Most "national backbone" pitches do not have this much actually written down. |
| 2 | First-class fail-closed Compliance Layer between orchestrator and routing | 01 §3.2, §4; compliance-engine is the most fully-specified service (12.6 KB SERVICE_OVERVIEW, 763-line _report.md, dedicated JIRA_IMPORT.csv with 10 epics + 45 stories) | Twilio / Infobip treat moderation as a sidecar; you treat it as a tier. Most differentiated control today. |
| 3 | Pluggable IdP abstraction with Keycloak as broker and per-tenant OIDC/SAML | 01 §3.1, ADR-0002 | Enterprise customers (banks, ministries) demand SSO; Twilio's federation story is weaker. |
| 4 | NATS JetStream as the only async substrate, with explicit DLQ + durable consumer policy | 01 §6, 13 §8 | Avoids the Kafka operational tax; correct choice for sub-30 ms inter-service hops. |
| 5 | gRPC for hot paths (Routing Engine SelectOperator, Compliance EvaluateCompliance) with REST for admin/CRUD | 01 §3.2; ADR-0002, ADR-0003 | Right choice for sub-50 ms decisions. |
| 6 | 17-doc service template + Definition of Done with mutation-test thresholds, RLS/tenant guards, AI-provenance VOs, WCAG 2.2, ICU MessageFormat | standards/SERVICE_TEMPLATE.md, standards/DEFINITION_OF_DONE.md | Better engineering discipline than most regional competitors ship with. |
| 7 | Real telecom rigour already authored in smpp-connector/_report.md (bind modes, enquire_link heartbeat, exponential backoff, GSM7/UCS2 encoding, CSMS segmentation, message_payload TLV, message correlation persistence, Redis sliding-window TPS, primary/backup failover) | services/smpp-connector/_report.md US-SC-001..015 | Far stronger than the typical "build SMPP later" placeholder. |
| 8 | 13-month immutable compliance audit log | 01 §6 retention notes; compliance-engine/DATA_MODEL.md | Telecom-regulator-defensible. |
| 9 | Tenant compliance scoring (0–100) + risk tiering + automated thresholds | 13 §4; EP-CE-05 epic | Twilio has none of this surfaced to tenants. |
| 10 | mTLS for gRPC, HMAC for webhooks, explicit rotation cadence per secret class | 13 §5, §7 | Enterprise table stakes done right. |
2. What is Weak (must improve before national-backbone GA)
2.1 Documentation rot and contradictions (real, post re-scan)
The "stub" claim from v1.0 was wrong. The remaining real issues:
docs/02-ddd-bounded-contexts.mdis mis-titled — its content is the testing strategy, not bounded contexts. Either rename or write the actual context map. The DDD context map is referenced everywhere but exists nowhere.docs/04, 05, 08, 09, 10, 11, 12, 14are 17-line placeholders (event-driven-architecture,api-design,frontend-design-guidelines,frontend-workflows,data-models,risks-and-tradeoffs,observability-telemetry,testing-strategy-qa). Not optional for a national backbone. (These exist as targeted topical docs; their absence is a documentation gap, not a backlog gap.)- Auth contradiction with the customer portal:
customer-portal/_report.mdUS-CUST-01-01 still says login uses Firebase (Firebase Auth signInWithEmailAndPassword,POST /v1/auth/firebase). This contradicts ADR-0002 / PLT-ADR-009 (Keycloak baseline; Firebase is legacy-only). → Update story to Keycloak OIDC PKCE flow with Firebase as a feature-flagged legacy fallback. auth-service/_report.mdscope header says: "Firebase federation, session management, and account provisioning". Stories are Keycloak-aware but the scope sentence drifts — clean it up.compliance-engineepic-ID mismatch:services/compliance-engine/_report.mduses canonical IDsEP-CE-01..10andUS-CE-001..045. The same service'sservices/compliance-engine/JIRA_IMPORT.csvusesCE-E1..10andCE-1..45. One source of truth must win. Recommendation:EP-CE-*/US-CE-*is canonical (matches the platformEP-{PREFIX}-NN/US-{PREFIX}-NNNregistry in07-…md). The CSV must be regenerated._sources/<svc>/epics.mdlegacy artifacts still contain pre-Keycloak language (e.g.,auth-service/_sources/auth-service/epics.mdcalls AUTH-EPIC-001 "Firebase Integration & Account Provisioning"). They are clearly legacy migration aids perREADME.md, but they should be deleted or marked DEPRECATED to prevent future confusion.
2.2 Architecture: missing layers (national-backbone deltas)
The container diagram is missing tiers a national backbone requires:
| Missing tier | Why it must exist | New service proposed |
|---|---|---|
| National SMS Firewall (inbound + transit) | Detect and block grey-route A2P, AIT (Artificially Inflated Traffic), SIM-box originators, fraudulent OTP harvesting, inbound spam at the border. Today the platform mediates only outbound to MNOs — there is no inbound firewall. | sms-firewall-service |
| HLR/HSS lookup + MNP authority | Without HLR/MSISDN-to-MNO resolution and Mobile Number Portability, "operator selection" is a guess. Twilio buys this; you must own it because Afghanistan has no neutral MNP authority — Ghasi can be it. | number-intelligence-service |
| Sender-ID Registry | India's DLT, UAE's TRA, KSA's CITC, US 10DLC all require a registered-sender model. ATRA does not yet — Ghasi should ship the registry. | sender-id-registry-service |
| Cell Broadcast / Emergency Alerts adjunct | A national gateway is the natural home for 3GPP TS 23.041 / ETSI EN 302 117 cell-broadcast for civil alerts. Requires MNO RAN integration; unaddressed. | cbc-bridge-service |
| Lawful Intercept + Regulator Reporting | ATRA will require LI (ETSI TS 102 232) and periodic CDR reporting. Not in any doc. | regulator-portal-service + cdr-mediation-service |
| MMS / RCS / WhatsApp Business / Voice-OTP fallback channels | Prompt asks for "intelligent fallback (SMPP → HTTP → USSD)"; no channel beyond SMPP is in the architecture. | channel-router-service |
| CDR (Charging Data Record) pipeline distinct from billing events | Telecom CDR has specific schema (TAP 3.12, RAP) and is the artifact regulators audit. billing.events is not a CDR. | cdr-mediation-service |
| Number-pool / short-code lifecycle | Long-codes, short-codes, alpha-IDs, MSISDN inventory, leasing, reservation, expiry, recall — none modelled. | numbering-service |
| DND / Consent / STOP-keyword national ledger | Compliance has TEMPORAL/RECIPIENT rules but no national consent ledger; STOP keyword routing undefined. | consent-ledger-service |
| Fraud / AIT / SIM-box detection | Twilio loses ~3% of revenue to AIT; without prevention you will too. | fraud-intel-service |
| Multi-region active-active data plane + cross-border sovereign DR | Single region with "DR optional" (01 §10 A-001/A-002) is unacceptable for a national asset. | Topology in ADR-0004 |
| Tenant VPC peering / Direct Connect ingress | Banks and ministries will not accept messaging from the public internet. | Edge enhancement in ADR-0004 |
| Quota / TPS shaping engine — per-tenant × per-operator × per-shortcode × per-priority-lane | Today TPS is a single Redis namespace owned by smpp-connector. | New tier in routing-engine (epic added) |
| Government priority-lane + emergency override | Mentioned nowhere. | EP-PLAT-NB-08 (lane policy) + cbc-bridge-service |
| Verify / Lookup / Notify / Conversation APIs (Twilio parity) | Not in scope; required for developer adoption. | New epics under developer-portal-service + campaign-service |
2.3 Architecture: weak choices
- Single-region Postgres with "HA replica" is not telecom-grade. Postgres needs (a) multi-AZ synchronous + multi-region async (Patroni or managed equivalent), (b) a CDR/event store on object storage with a separate cold-tier query layer (ClickHouse). The note in
01§10 A-003 ("ClickHouse is optional scaffolding") must be promoted to a hard requirement. smpp-connectoras a single StatefulSet with 2 replicas is naive at national scale. Each MNO bind needs (i) per-bind TPS shaping, (ii) per-bind windowing, (iii) per-bind sequence-number management, (iv) per-bind enquire_link cadence, (v) per-bind reconnection backoff with jitter, (vi) per-bind concatenation buffers, (vii) per-bind UDH/TLV handling. The current_report.mdalready covers most of these as logic (US-SC-001..015) — what is missing is the deployment topology: per-MNO per-direction Deployment/StatefulSet pools with bind affinity. →EP-SC-05andEP-SC-06added.- Single NATS JetStream cluster with no cross-region mirror (
01§3,03§2). For a regulated national service, JetStream needs leaf-node + mirror replication into a DR region. → ADR-0004 §13. - Routing Engine today exposes COST / PRIORITY / FAILOVER strategies (
US-RE-005..007) — a strong start. Missing: live operator quality scoring (delivery rate, latency, cost), per-route cost tables that vary by hour/day, per-tenant route preferences and exclusions, regulatory route restrictions, gray-route exclusion, QoS lanes (OTP < 3 s, marketing best-effort, government priority). →EP-RE-05andEP-RE-06added. compliance-engineAI fallback to "external LLM" opens a data-residency hole (SCT-001, SCT-003) acknowledged but not closed. For Afghanistan, all PII-bearing inference must remain on-prem; external fallback must be feature-flagged off by default and per-tenant opt-in only. →EP-CE-09(Local LLM Platform) is in place; add story to disable external fallback by default and require explicit per-tenant opt-in.- No long-SMS encoding/segment-pricing alignment with Pashto/Dari (UCS-2 → 70 chars/segment vs. GSM7 → 160).
smpp-connectorhandles encoding correctly (US-SC-007);billing-servicedoes not have a story for UCS-2 segment-pricing parity — risk: customers billed wrong on Pashto/Dari. → New storyUS-BILL-037. webhook-dispatcherretries are not specified with explicit back-off, signing-key rotation, or back-pressure policy. ExistingEP-HOOK-01..04cover delivery and HMAC but not back-pressure under stampede. →EP-HOOK-05added.api-gateway(Kong) has 29 stories (US-KONG-01..29) — strong. Missing: JA3 fingerprint blocking, adaptive rate-limit per consumer+key dimensions, mTLS upstream policy spec for sensitive routes (compliance, regulator). →EP-KONG-06added.- No idempotency contract for inbound DLR in
dlr-processor. Operators sometimes re-deliver DLR PDUs. →US-DLR-015added (dedup by(operator_id, message_id, status, timestamp_bucket)).
2.4 NFRs and SLAs — almost entirely missing
The platform documents:
- 90/80/60% test coverage thresholds.
- gRPC P95 ≤ 500 ms for compliance evaluation.
- Some per-service P95 latency targets (e.g., orch GET P95 ≤ 50 ms, billing usage P95 ≤ 300 ms, RBAC P95 ≤ 50 ms).
- "30 days Prometheus retention", "14 days Loki", "7 days OTel".
That is essentially all the quantitative NFRs the platform commits to platform-wide. Missing:
- Throughput targets (msg/s steady, msg/s burst, peak-hour, peak-second).
- Submit-to-DLR P50/P95/P99 latency budgets per traffic class (OTP, transactional, marketing, broadcast).
- DLR-receipt SLA per operator.
- Availability SLOs (99.9 / 99.95 / 99.99) per service tier.
- Error-budget policy.
- RPO/RTO commitments (
A-002admits "TBD; assumed RPO 1h / RTO 4h" — for OTP traffic, RPO 1h is unacceptable). - Per-tenant quotas / fair-use defaults.
- Per-priority-lane SLAs.
- Concurrency, queue-depth, and back-pressure thresholds.
- Maximum acceptable compliance HOLD review latency.
- Maximum acceptable webhook-delivery time.
- TPS-shaping precision and burst policy per MNO contract.
→ Fixed by EP-PLAT-NB-09 — NFR/SLA Catalog & Error-Budget Policy in 007 and a new 15-nfr-sla-catalog.md doc (to be authored).
2.5 Security gaps (national-asset bar)
| Gap | Required uplift | Epic |
|---|---|---|
| No HSM / KMS-backed signing for platform JWT, SAML SP keys, webhook HMAC | PKCS#11 HSM (FIPS 140-2 L3) for: JWT signing, SAML SP keys, webhook HMAC root, SMS-content envelope keys. Vault stays for transit & lifecycle, HSM holds master. | EP-PLAT-NB-04 |
| No Zero-Trust east-west policy specified | Service-mesh mTLS + SPIFFE/SPIRE workload identities (Istio or Linkerd). | EP-PLAT-NB-05 |
| No DDoS / abuse defence at the edge beyond Cloudflare WAF | Per-tenant per-API-key adaptive rate-limit, layer-7 fingerprinting, JA3 blocking, tarpit lane. | EP-KONG-06 |
| No threat model artifact | STRIDE per service under docs/security/threat-models/. | EP-PLAT-NB-10 |
| No SBOM, no signed images | Sigstore/Cosign image signing + SBOM (CycloneDX) per build, verified by Kyverno/Gatekeeper. | EP-PLAT-NB-11 |
| No CIS-Benchmarked node + Pod Security Standards profile | All workloads restricted PSA, runAsNonRoot, read-only root FS, seccomp RuntimeDefault. | EP-PLAT-NB-11 |
| No secrets-in-source CI gate | gitleaks + trufflehog as required CI step (claimed manually in 13 §7 but no enforcement). | EP-PLAT-NB-11 |
| Lawful intercept and SIEM forwarding undefined | Security-relevant events to a SIEM (Splunk/ELK/QRadar) with WORM retention. | EP-REG-01 |
| Customer-portal session security headers missing | CSP, COEP, COOP, SRI, Trusted-Types, sub-resource integrity. | EP-CUST-07 |
2.6 Operational excellence — partial
- Grafana dashboards listed but no NOC dashboard (single pane: per-MNO bind health, queue depth, TPS shaping, DLR latency heatmap, compliance hold queue, fraud signals, regulator alerts). →
EP-PLAT-NB-12. - No runbook catalogue. Compliance engine mentions runbooks; no other service does. → covered in DoD; track in
EP-PLAT-NB-12. - No chaos engineering programme. →
EP-PLAT-NB-13. - No capacity model. What does 10 M msg/h mean in NATS bytes/s, Postgres rows/s, Redis ops/s, Postgres WAL/h, cluster CPU cores, MNO TPS budget? Nowhere computed. →
EP-PLAT-NB-09. - Status page not specified. →
EP-PLAT-NB-14.
2.7 Product / commercial gaps
billing-servicehas 36 stories (EP-BILL-01..05) — strong start, with usage queries, invoicing, pricing CRUD, alerts. Missing for national backbone:- SLA-backed pricing tiers (committed throughput, reserved capacity, government bulk, OTP premium).
- Tax engine (VAT/national sales tax).
- AFN-USD multi-currency + FX policy.
- Pre-paid wallet + post-paid invoicing dual model.
- Credit notes / refunds / dispute workflow.
- Revenue assurance / leakage detection.
- →
EP-BILL-06andEP-BILL-07added.
- No marketplace / template catalog — pre-approved templates (DLT-style) are how India and the Gulf solved spam at scale. →
EP-CAMP-01..04(new service). - No partner/reseller programme (sub-tenants under a tenant). →
EP-AUTH-06extension for sub-org model. - No SDKs named (Node, Python, Java, .NET, Go, PHP, Flutter, Android, iOS). →
EP-DEV-01..04(new service). - No public developer portal beyond
customer-portal. →EP-DEV-01. - No template-based personalisation engine with merge-fields + conditional content (Twilio Notify equivalent). →
EP-CAMP-02. - No campaign management UI (segments, schedule, A/B, throttle, kill-switch). →
EP-CAMP-01. - No 2-way SMS / inbound MO routing to tenant flow specified. →
EP-CHAN-03(new service). - No conversational session manager (sticky alpha-ID ↔ MSISDN ↔ tenant correlation across MO/MT pairs). →
EP-CHAN-04. - No Verify API (managed OTP, Twilio Verify equivalent). →
EP-DEV-05or sub-epic ofchannel-router-service. - No Lookup API (number intelligence as a tenant-callable API). →
EP-NI-04.
2.8 Regulatory / sovereignty gaps
- ATRA (Afghanistan Telecom Regulatory Authority) is not named anywhere; reporting cadence, CDR format, and licensing posture are undefined. →
EP-REG-01..03. - Data residency policy is one open point (SCT-003); it must be a first-class policy, not a TODO. →
EP-PLAT-NB-04and ADR-0004 §5. - GDPR / TCPA / GSMA RCS-BM compliance mentioned at one paragraph; needs an actual control catalogue. → covered by
EP-CE-06(existing) extended. - PII tokenisation (SCT-002) is open; for SMS bodies traversing compliance/AI, a deterministic tokeniser (e.g., FF1) with HSM-held key is the only defensible answer. →
EP-PLAT-NB-04. - Number-portability legal posture unaddressed. →
EP-NI-02.
2.9 Edge cases not in any flow
These are gaps in the flows, not always in the IDs. New stories added under appropriate epics:
- Operator returns
ESME_RTHROTTLEDmid-window → new story underEP-SC-04. - Operator returns
ESME_RSUBMITFAILmid-window — half-close behaviour → new story. - DLR for unknown
messageId(operator re-delivers from stale buffer) →US-DLR-015. - Concatenated-SMS partial DLR (segment 2 of 3 fails) — segment-aware DLR aggregation → new story under
EP-DLR-05. - MO message arrives for a sender-ID never registered →
EP-SID-03. - MNO emits Stop-keyword DLR (recipient opt-out) →
EP-CONS-02. - Tenant tries to send to recipient on the DND registry →
EP-CONS-01. - Tenant tries to send during a regulator-imposed quiet window → already in
compliance-engineTEMPORAL but no national default rule-set →EP-CE-11extension. - Tenant flips IdP mid-session — token-revocation propagation →
EP-AUTH-06. - Compliance
BLOCKfor pre-credited tenant — refund/credit reversal →US-BILL-038(new). - Webhook destination 5xx for >1 hour — circuit-break + tenant-portal alert →
EP-HOOK-05. - Operator-ID renamed by MNO mid-day — config swap with zero in-flight loss →
US-OPS-09(new). - Nation-wide MNO outage — fallback to OTT (WhatsApp / Telegram / Signal-as-OTP) →
EP-CHAN-01..02.
2.10 Risk register essentially absent
docs/11-risks-and-tradeoffs.md is a 17-line stub; per-service SERVICE_RISK_REGISTER.md files exist via the 17-doc template but are not all populated. For a national-asset programme, the risk register is a board-level artifact and must be populated. → EP-PLAT-NB-15.
3. What is Unclear
| Area | Specific question that must be answered |
|---|---|
| Identity broker scope | Does Keycloak broker for every tenant or only enterprise SSO? Self-serve tenants today appear to use Keycloak directly; spec implies both — clarify in auth-service/SERVICE_OVERVIEW.md. |
| Compliance verdict semantics | FLAG is named in §4 but its semantics (logged but allowed? logged + sample-routed for human review?) are not defined. |
| Idempotency-Key scope | Per (tenant, key) or per global? Retention 48h is documented (US-ORCH-005) — confirm conflict semantics for the same key with a different payload. |
| API versioning | "OpenAPI 3.1 with /v1/" — deprecation policy N+2? Sunset header policy? |
| Tenant deletion | GDPR erasure mentioned for users; tenant deletion (cascade across 13 schemas) is not. |
| Multi-tenant Postgres | "per-service schema" — within a schema, RLS is enforced (DoD §2). Confirm that all tenant-scoped tables across all services have an RLS policy and a contract test asserting it. |
| Currency | billing-service does not pin currency strategy. AFN, USD, multi-currency? FX policy? |
| Legal entity model | Tenant ≠ legal entity; one legal entity may own multiple tenants (sub-orgs). Not modelled. |
| Operator-side outbound IPs | MNOs whitelist source IPs; how does the platform present a stable egress? NAT gateway, dedicated egress pods? → ADR-0004 §6 begins to address. |
| Disaster mode | What is "graceful degradation" if compliance-engine is down for >1 h? Today it just queues — at national scale that becomes a regulator incident. → EP-PLAT-NB-08 (trusted-tenant fast-path) addresses the OTP slice. |
4. What is Risky (top 10)
- Compliance fail-closed under MNO-OTP burst — A nationwide bank pushing OTPs would queue indefinitely if compliance-engine flaps. Need trusted-tenant fast-path (
EP-PLAT-NB-08). - Single-region everything — One DC outage = national SMS outage. Multi-region active-active is non-negotiable for the use case described. → ADR-0004 +
EP-PLAT-NB-01..03. - External-LLM PII leakage — Even "fallback only" is a regulator and reputational disaster waiting. Default off, per-tenant opt-in, audited. →
US-CE-046(new story underEP-CE-09). - SMPP connector deployment topology — logic is good, deployment is naive. Per-MNO per-direction pools required. →
EP-SC-05..06. - No HLR/MNP authority — number-portability changes will silently break delivery. →
number-intelligence-service. - No SIM-box / AIT detection — within 12 months of public launch, fraud rings will arbitrage Ghasi for grey-route termination. →
fraud-intel-service. - No regulator integration — ATRA can shut you down on a single CDR audit failure. →
regulator-portal-service+cdr-mediation-service. - No emergency / cell-broadcast plan — the moment a public emergency happens, government will demand the gateway. →
cbc-bridge-service. - No supply-chain security — npm dependency hijack would compromise the national gateway; SBOM + signed images + locked registries required. →
EP-PLAT-NB-11. - Customer-portal Firebase login contradicts ADR-0002 — fix immediately to avoid a regression on go-live. → update
US-CUST-01-01.
5. What is Not Enterprise-Grade (must be lifted)
| Symptom | Enterprise-grade target |
|---|---|
Multiple topical docs are 17-line stubs (04, 05, 08, 09, 10, 11, 12, 14) | Authored to the bar of 01, 03, 13. |
| No NFR/SLA catalog | A 15-nfr-sla-catalog.md with quantitative targets per traffic class, per service tier. → EP-PLAT-NB-09. |
| No DR plan | Documented multi-region active-active, RTO ≤ 5 min for OTP class, ≤ 15 min for transactional, ≤ 60 min for marketing. RPO ≤ 5 s for OTP/transactional, ≤ 60 s for marketing. → ADR-0004. |
| No published support model | 24×7 NOC, T1/T2/T3, P1 ≤ 15 min ack, monthly SLA credits. → EP-PLAT-NB-12. |
| No status page | https://status.ghasi.io with per-MNO and per-API-class signals. → EP-PLAT-NB-14. |
| No compliance certifications roadmap | ISO 27001, ISO 27017/27018, SOC 2 Type II, PCI DSS scope-out attestation, GSMA AA.18 (A2P SMS) accreditation. → EP-PLAT-NB-15. |
| No formal change management | CAB process, change windows, MNO-coordinated change notices. → EP-PLAT-NB-12. |
| No customer success surface | Quarterly business reviews, dedicated TAM model for enterprise tenants. → EP-CUST-08. |
| No localisation discipline beyond ICU MessageFormat tag | Pashto/Dari translation memory, RTL audit, content lengths recomputed for UCS-2. → EP-CUST-09, US-BILL-037. |
6. Mandatory Architectural Uplifts (specified in ADR-0004)
- Multi-region active-active across
kbl(Kabul) andmzr(Mazar), plus sovereign-DR cold copy indxb(Dubai). - Control-plane vs. data-plane split on separate node pools.
- Per-MNO connector pools (
smpp-connector-{mno}-{tx|rx|trx}) with bind affinity. - Twelve new bounded contexts (see §7).
- HSM-backed key custody (PKCS#11, FIPS 140-2 L3).
- Service mesh with SPIFFE/SPIRE workload identities.
- NATS JetStream multi-cluster (super-cluster + leaf nodes).
- Postgres Patroni clusters per region; multi-master only on control-plane data.
- CDR pipeline distinct from billing events.
- National traffic priority lanes (P0–P4).
- Trusted-tenant fast-path for vetted regulated tenants.
- Chaos engineering programme.
- 24×7 NOC tooling.
- Status page + customer-facing SLO dashboard.
- SBOM + signed images + admission-controlled image policy.
7. New Bounded Contexts (12) — naming and prefixes
| New context | Service name (proposed) | Epic prefix | Owner |
|---|---|---|---|
| Number Intelligence | number-intelligence-service | EP-NI-* / US-NI-* | Messaging Core |
| SMS Firewall | sms-firewall-service | EP-FW-* / US-FW-* | Trust & Safety |
| Sender ID Registry | sender-id-registry-service | EP-SID-* / US-SID-* | Trust & Safety + Regulator-facing |
| Numbering / Short-code | numbering-service | EP-NUM-* / US-NUM-* | Commerce |
| CDR / Mediation | cdr-mediation-service | EP-CDR-* / US-CDR-* | Commerce + Regulator |
| Cell Broadcast Bridge | cbc-bridge-service | EP-CBC-* / US-CBC-* | Government / Emergency |
| Channel Router (multi-channel) | channel-router-service | EP-CHAN-* / US-CHAN-* | Messaging Core |
| Fraud Intelligence | fraud-intel-service | EP-FRAUD-* / US-FRAUD-* | Trust & Safety |
| Regulator Liaison | regulator-portal-service | EP-REG-* / US-REG-* | Regulator-facing |
| Developer / SDK Portal | developer-portal-service | EP-DEV-* / US-DEV-* | Product |
| Campaign / Template Manager | campaign-service | EP-CAMP-* / US-CAMP-* | Product |
| Consent / DND Ledger | consent-ledger-service | EP-CONS-* / US-CONS-* | Trust & Safety |
This brings the platform from 14 to 26 services. They are scoped, sequenced, and assigned epic IDs in 07-epics-and-user-stories.md.
8. Twilio / Infobip / Sinch Comparison — Where We Win, Catch Up, Differentiate
| Capability | Twilio | Infobip | Sinch | Ghasi today | Ghasi target |
|---|---|---|---|---|---|
| Multi-channel (SMS, MMS, RCS, WhatsApp, Voice) | ✅ | ✅ | ✅ | ❌ (SMS only) | ✅ via channel-router-service |
| Per-MNO direct binds in-country | Partial | ✅ | ✅ | ✅ | ✅✅ (sovereign + national) |
| AI content moderation | Partial (Trust Hub) | Partial | Partial | ✅ (compliance-engine) | ✅✅ (already ahead) |
| Tenant compliance scoring exposed to tenant | ❌ | ❌ | ❌ | ✅ | ✅✅ (differentiator) |
| Sender-ID registry | US 10DLC only | DLT integrations | DLT integrations | ❌ | ✅✅ (national authority) |
| HLR/MNP service | Buys | Buys | Buys | ❌ | ✅✅ (owns nationally) |
| Cell-broadcast emergency alerts | ❌ | ❌ | ❌ | ❌ | ✅✅ (national authority) |
| Government priority lane | Limited | Limited | Limited | ❌ | ✅✅ |
| Regulator-direct CDR export | ❌ | ❌ | ❌ | ❌ | ✅✅ |
| Local LLM compliance (no data export) | ❌ (cloud LLM) | ❌ | ❌ | ✅ | ✅✅ |
| Fraud / AIT / SIM-box prevention | Partial | Partial | Partial | ❌ | ✅ (parity) |
| Status page + per-route SLO | ✅ | ✅ | ✅ | ❌ | ✅ (parity) |
| Multi-region active-active | ✅ | ✅ | ✅ | ❌ | ✅ (parity) |
| OAuth/SAML enterprise SSO | ✅ | ✅ | ✅ | ✅ | ✅ (parity) |
| RCS Business Messaging | ✅ | ✅ | ✅ | ❌ | ✅ post-GA (channel-router) |
| Voice OTP fallback | ✅ | ✅ | ✅ | ❌ | ✅ (channel-router) |
| Verify API (managed OTP) | ✅ | ✅ | ✅ | ❌ | ✅ (developer-portal/channel-router) |
| Lookup API (number intelligence) | ✅ (paid) | ✅ | ✅ | ❌ | ✅✅ (free for nationals) |
| Notify API (broadcast/segments) | ✅ | ✅ | ✅ | ❌ | ✅ (campaign-service) |
| Conversation API (2-way sticky) | ✅ | ✅ | ✅ | ❌ | ✅ (channel-router) |
| Pricing transparency | ✅ | Partial | Partial | ❌ | ✅ (must publish) |
9. Engineering Punch List
| # | Action | Owner | Sprint window |
|---|---|---|---|
| 1 | Adopt this critique + extended 007 catalog + Jira CSV | Platform Architecture | Now |
| 2 | Approve ADR-0004 (national resilience blueprint) | Architecture Council | This sprint |
| 3 | Author 15-nfr-sla-catalog.md; bind every NFR to a Prometheus alert | SRE | +2 sprints |
| 4 | Reconcile EP-CE-* vs CE-E* IDs — regenerate services/compliance-engine/JIRA_IMPORT.csv to use canonical IDs | Compliance Eng | This sprint |
| 5 | Fix customer-portal/_report.md US-CUST-01-01 to use Keycloak OIDC PKCE (not Firebase) | Frontend Eng | This sprint |
| 6 | Update auth-service/_report.md scope header to remove "Firebase federation" wording | Identity | This sprint |
| 7 | Rewrite 02-ddd-bounded-contexts.md (currently testing standards) — mis-titled file | Platform Arch | +1 sprint |
| 8 | Author docs 04, 05, 08, 09, 10, 11, 12, 14 to the bar of 01, 03, 13 | Per-domain leads | +3 sprints |
| 9 | Stand up the 12 new services (firewall, number-intel, sender-id, numbering, CDR, CBC, channel-router, fraud-intel, regulator-portal, dev-portal, campaign, consent-ledger) as 17-doc skeletons | Platform PM + each domain lead | +2 sprints |
| 10 | Multi-region topology, HSM, service mesh, chaos, NOC dashboards | SRE + Security | Sprint windows S6–S12 |
| 11 | Regulator engagement (ATRA) — license posture, CDR schema, LI plan | Legal + Platform Leadership | Continuous |
| 12 | MNO commercial + technical onboarding playbook (per-MNO bind plan, TPS contracts, escalation tree) | MNO Partnerships | Continuous |
| 13 | Public status page + SDKs + developer portal | Product + DevRel | Sprint windows S8–S14 |
End of critique. Continue to the extended epic catalog 07-epics-and-user-stories.md and the Jira import 07-epics-and-user-stories.JIRA_IMPORT.csv.