Architecture Baseline
Version: 1.2 Status: Approved Owner: Platform Architecture Team Last Updated: 2026-04-19 References: ADR-0001 Kong edge gateway, system.md §1–4, AGENT.md §4–5
Change log
- v1.2 (2026-04-19) — (a) Identity rebaselined: Keycloak is the default/base Identity Provider (IdP);
auth-serviceexposes a pluggable IdP provider abstraction so tenant organisations can federate their own external IdP (OIDC or SAML 2.0) for SSO. Firebase is retained only as a legacy/optional provider. (b) Compliance Engine added as a first-class architectural tier. Every outbound SMS now traverses the Compliance Layer (async, in the NATS consumer) before routing. Container view, outbound SMS sequence, NATS topology, database ownership, and technology stack updated accordingly.- v1.1 (2026-04-17) — Kong adopted as the north-south API gateway. Custom NestJS
api-gatewayservice retired; its responsibilities moved to Kong (TLS, auth, rate limiting, correlation, logging) and tosms-orchestrator(validation, idempotency, NATS publish). See ADR-0001. Container view and sequence diagrams updated accordingly.- v1.0 (2026-04-12) — Initial baseline with custom NestJS
api-gateway.
1. Purpose
This document establishes the authoritative architectural baseline for the Ghasi Messaging Gateway — a telecom-grade SMS gateway platform. It defines service boundaries, communication patterns, data ownership, and system context for all 14 platform units (13 messaging/commerce services plus the compliance-engine).
2. System Context (C4 Level 1)
Identity topology (summary). Kong terminates TLS and validates JWTs/API keys at the edge. auth-service is the platform's canonical identity surface and owns the IdP provider abstraction: a pluggable set of providers of which Keycloak is the base/default. Keycloak itself acts as an OIDC/SAML broker so that a tenant organisation can bring its own corporate IdP (Azure AD, Okta, Google Workspace, ADFS, generic OIDC/SAML) for SSO without any change to downstream services. Firebase remains available as an optional legacy provider for early customers and will be retired per the migration plan. See auth-service SERVICE_OVERVIEW §5.
Compliance Layer (summary). The compliance-engine is a first-class architectural tier alongside ingestion, routing, and transport. It is invoked asynchronously by sms-orchestrator after the tenant has received a 202, and its verdict (ALLOW / FLAG / HOLD / BLOCK) gates whether the message is handed to routing-engine. See compliance-engine SERVICE_OVERVIEW.
3. Container View (C4 Level 2)
3.1 Identity & Access — provider abstraction
auth-service does not hard-code a single external IdP. It exposes an IdP Provider Abstraction with three categories of concrete providers, all implementing the same internal IdentityProvider port (verifyExternalToken, resolveExternalIdentity, provisionUserFromClaims, revokeExternalSession):
| Category | Concrete provider | When used |
|---|---|---|
| Base / Default | KeycloakProvider (OIDC, RS256) | Every tenant by default; also the OIDC/SAML broker for any tenant that enables external SSO |
| External organisation SSO | TenantOIDCProvider, TenantSAMLProvider (brokered through a per-tenant Keycloak realm / IdP mapper) | Enterprise tenants who require SSO against their own corporate IdP (Azure AD / Okta / Google / ADFS / generic OIDC / SAML 2.0) |
| Legacy / optional | FirebaseProvider | Existing Firebase-based customers during the migration window; slated for retirement |
Tenant-level configuration (tenant_identity_providers table, owned by auth-service) selects which provider(s) apply. Downstream services are indifferent to which provider authenticated the user: they only see the canonical platform JWT signed by auth-service and validated by Kong. See auth-service SECURITY_MODEL §1 for the auth flows per provider.
3.2 Compliance Layer — container-level view
The compliance-engine sits between orchestration and routing. It is a gRPC service with an HTTP REST admin surface and a NATS producer/consumer pair:
| Interface | Purpose | Caller |
|---|---|---|
gRPC EvaluateCompliance (P95 ≤ 500 ms) | Synchronous rule evaluation for a queued message | sms-orchestrator (NATS consumer) |
HTTPS /v1/compliance/* (admin) | CRUD on rules, rule-sets, hold-queue review, tenant score overrides | admin-dashboard, Kong-authenticated |
NATS producer (compliance.*) | Emits audit, hold, block, release, reject, expire, score-change events | Fan-out to notification-service, analytics-service, billing-service |
NATS consumer (sms.dlr.inbound → stats) | Consumes DLR statistics feeding tenant-score models | From dlr-processor |
Crucially, no outbound SMS reaches routing-engine until the Compliance Layer returns ALLOW (or an admin releases a held message). The pipeline is fail-closed.
4. Outbound SMS Pipeline (Sequence)
Fail-closed guarantee. If compliance-engine is unavailable, sms-orchestrator retries via NATS (bounded by DLQ policy). The message remains in EVALUATING and is never dispatched to routing-engine without an explicit ALLOW verdict or admin release.
5. DLR Return Path (Sequence)
6. NATS JetStream Topology
Retention notes. compliance.audit.v1 is retained for ≥ 13 months (regulatory evidence window). compliance.message.* and compliance.tenant.tier.changed.v1 use standard 7-day JetStream retention; durable consumers fan events into long-term Postgres storage inside each subscribing service. auth.events captures SSO-relevant signals (external IdP link/unlink, SAML/OIDC session events) for audit.
7. Database Ownership Map
Keycloak data ownership. Keycloak manages its own schema (keycloak) inside the same PostgreSQL instance but in an isolated logical database or schema; no Ghasi service reads the Keycloak schema directly. auth-service interacts with Keycloak exclusively via its Admin REST API + OIDC endpoints. The auth schema stores Ghasi-owned projections: tenant_identity_providers (which IdP each tenant is bound to), external_identities (link between platform userId and external IdP subject), and idp_session_audit.
Compliance data retention. compliance.audit_log is append-only, partitioned by month, and retained ≥ 13 months. evaluation_log (per-message evaluation trace) uses a shorter 90-day retention with cold-tier archival to object storage.
8. Architectural Principles
| Principle | Rule | Source |
|---|---|---|
| No shared databases | Each service owns exactly one schema | AGENT.md §5.2 |
| Async-first | Inter-service communication via NATS JetStream by default | system.md §4 |
| Sync only when required | gRPC for latency-sensitive calls (Routing Engine) | system.md §2 |
| DDD enforcement | Domain layer contains zero framework imports | AGENT.md §4.2 |
| Idempotency | All message processing is idempotent via Redis keys | system.md §2 |
| SMPP resilience | Persistent reconnect, operator failover, DLQ | system.md §2 |
| Secret management | Vault or K8s Secrets — never plaintext | AGENT.md §11.1 |
| Observability | Every service exposes Prometheus metrics + OTel traces | AGENT.md §12 |
9. Technology Stack
| Layer | Technology | Key packages / version |
|---|---|---|
| Language | TypeScript | 5.x, strict mode |
| Backend framework | NestJS | @nestjs/core, @nestjs/common, @nestjs/platform-fastify (HTTP adapter), latest stable |
| HTTP adapter | Fastify (via NestJS platform adapter) | @nestjs/platform-fastify 10.x — NestJS drives Fastify internally; no raw Fastify code in services |
| API documentation | @nestjs/swagger | OpenAPI 3.1 generated from decorators |
| Input validation | class-validator + class-transformer + Zod | DTO validation via NestJS Pipes |
| Frontend | Next.js (App Router) | 14+ |
| UI components | ShadCN UI + TailwindCSS | Latest stable |
| Primary DB | PostgreSQL | 16+ |
| ORM | Prisma (via @nestjs/prisma / custom module) | 5.x |
| Caching / rate limiting | Redis (@nestjs/cache-manager, ioredis) | 7+ |
| Message bus | NATS JetStream (nats npm package, via shared nats-client) | 2.10+ |
| Identity Provider (base / default) | Keycloak (self-hosted) — realm-per-environment, OIDC + SAML 2.0 broker | 24.x LTS |
| IdP client libraries | openid-client, @node-saml/node-saml, keycloak-admin-client | Latest stable |
| IdP provider abstraction | In-house IdentityProvider port in auth-service with pluggable providers (Keycloak / Tenant-OIDC / Tenant-SAML / Firebase-legacy) | — |
| Legacy IdP (optional) | Firebase Authentication (firebase-admin) — retained only for migration window | 12.x |
| Auth guards | @nestjs/passport + custom NestJS Guards | Latest stable |
| SCIM (tenant org user provisioning) | scim2-server or equivalent, exposed via auth-service for enterprise tenants | Latest stable |
| SMPP | SMPP 3.4 connector (custom NestJS module) | — |
| Logging | Pino via nestjs-pino | Latest stable |
| Tracing | OpenTelemetry SDK | 1.x |
| Container | Docker | 24+ |
| Orchestration | Kubernetes | 1.29+ |
| DNS / WAF | Cloudflare | — |
| CI/CD | GitHub Actions | — |
| Observability | Prometheus + Grafana + Loki + OpenTelemetry | Latest stable |
| Compliance AI | Local LLM (e.g. llama.cpp / vLLM) with external LLM fallback for classification (@compliance-engine/ai) | Latest stable |
10. Assumptions and Open Points
| ID | Assumption / Open Point | Owner | Resolution Date |
|---|---|---|---|
| A-001 | Cloud region not specified in system.md; assumed single primary region with optional DR | Infra Team | TBD |
| A-002 | RPO and RTO targets not defined; assumed RPO 1h, RTO 4h as initial baseline | Infra Team | TBD |
| A-003 | ClickHouse integration for analytics is optional scaffolding; not in baseline architecture | Analytics Team | TBD |
| A-004 | gRPC is used for Routing Engine synchronous calls; all other sync calls use REST | Platform Arch | TBD |
| A-005 | Vault is preferred for secrets; K8s Secrets as fallback | Security Team | TBD |
| A-006 | Keycloak runs as a managed deployment inside the cluster (HA pair) with PostgreSQL as its persistence backend. Managed/cloud Keycloak (e.g., Red Hat SSO) remains an option for regulated regions. | Platform Arch + Security | TBD |
| A-007 | Tenant-specific external IdP onboarding (OIDC discovery URL or SAML metadata URL) is self-serve via admin-dashboard → auth-service → Keycloak Admin REST. | Platform Arch | TBD |
| A-008 | compliance-engine local LLM runs as a sidecar or shared in-cluster service; external LLM fallback is region-scoped and governed by data residency policy. | Trust & Safety + Security | TBD |
| A-009 | Compliance Layer is fail-closed: if unavailable, messages remain in EVALUATING and are retried from NATS until the service recovers or DLQ policy fires. Messages are never released to routing without an explicit ALLOW verdict. | Trust & Safety | Approved |