Version: 1.2
Status: Approved
Owner: Platform Infrastructure Team
Last Updated: 2026-04-19
References: ADR-0001 Kong edge gateway, system.md §7–8, AGENT.md §11–12
Change log
- v1.2 (2026-04-19) — Added Keycloak (base/default IdP + OIDC/SAML broker for tenant external IdP SSO) and compliance-engine (first-class Compliance Layer service) to the infrastructure topology, Docker Compose inventory, Kubernetes namespace strategy, and secrets management. Firebase Auth is retained only as an optional legacy provider.
- v1.1 (2026-04-17) — Ingress layer updated: Kong Gateway replaces NGINX + custom NestJS
api-gateway at the edge. Custom api-gateway pod removed; Kong runs as a deployment with its own HPA and (optional) DB-less configuration. See ADR-0001.
- v1.0 (2026-04-12) — Initial baseline.
1. Purpose
This document defines the infrastructure topology, environment strategy, Kubernetes configuration standards, observability stack, and secrets management approach for the Ghasi Messaging Gateway platform.
2. Infrastructure Topology
3. Local Development — Docker Compose
Services defined in infra/docker/docker-compose.yml:
| Service | Port | Notes |
|---|
| postgres | 5432 | Single instance with volume |
| redis | 6379 | Single instance |
| nats | 4222 / 8222 | JetStream enabled, monitoring UI |
| api-gateway | 3000 | Hot reload via nodemon |
| sms-orchestrator | 3003 | — |
| smpp-connector | 3004 | Connects to mock SMPP server |
| routing-engine | 3005 | — |
| dlr-processor | 3006 | — |
| billing-service | 3007 | — |
| webhook-dispatcher | 3008 | — |
| auth-service | 3009 | Talks to local Keycloak; Firebase emulator optional (legacy provider) |
| analytics-service | 3010 | — |
| notification-service | 3011 | — |
| operator-management-service | 3012 | — |
| compliance-engine | 3013 | gRPC on :50051, REST admin on :3013; pairs with compliance-ai |
| compliance-ai | 8088 | Local LLM (container) for compliance classification |
| keycloak | 8080 / 8443 | Base / default IdP; dev realm ghasi-local; Postgres-backed |
| admin-dashboard | 3001 | Next.js dev server |
| customer-portal | 3002 | Next.js dev server |
| smpp-simulator | 2775 | Mock SMPP operator |
| prometheus | 9090 | Scrapes all services (incl. Keycloak + compliance-engine) |
| grafana | 3100 | Pre-loaded dashboards |
| loki | 3200 | Log aggregation |
4. Kubernetes — Production Configuration
4.1 Namespace Strategy
| Namespace | Contents |
|---|
ghasi-prod | All production application services (incl. compliance-engine) |
ghasi-identity | Keycloak (HA), auth-service, compliance-ai (local LLM) |
ghasi-data | Postgres, Redis, NATS |
ghasi-obs | Prometheus, Grafana, Loki, OTel Collector |
ghasi-vault | HashiCorp Vault |
Note. Keycloak and compliance-ai live in their own namespace (ghasi-identity) because they (a) handle sensitive credentials/PII and warrant tighter NetworkPolicies, and (b) have different scaling and upgrade cadences from the messaging core.
4.2 Resource Standards (per service pod)
| Profile | CPU Request | CPU Limit | Memory Request | Memory Limit |
|---|
| Light (UI, analytics) | 100m | 500m | 128Mi | 512Mi |
| Standard (API, billing) | 250m | 1000m | 256Mi | 1Gi |
| Heavy (SMPP, orchestrator) | 500m | 2000m | 512Mi | 2Gi |
4.3 HPA Configuration
- All services use
HorizontalPodAutoscaler with CPU utilisation target of 70%.
- SMPP Connector uses
StatefulSet (sticky SMPP sessions require stable pod identity).
- Minimum replicas: 2 for all production services (high availability).
4.4 Health Endpoints (required on all services)
| Endpoint | Purpose |
|---|
GET /health/live | Kubernetes liveness probe |
GET /health/ready | Kubernetes readiness probe |
GET /metrics | Prometheus scrape endpoint |
4.5 Ingress Rules
| Host | Service | TLS |
|---|
api.ghasi.io | api-gateway | Cloudflare-managed cert |
admin.ghasi.io | admin-dashboard | Cloudflare-managed cert |
app.ghasi.io | customer-portal | Cloudflare-managed cert |
5. Secrets Management
- All secrets (DB credentials, API keys, SMPP operator credentials, Keycloak admin credentials, Keycloak realm signing keys, per-tenant OIDC/SAML broker client secrets, SAML signing keys, legacy Firebase service account, external LLM API keys) stored in HashiCorp Vault.
- K8s Secrets used as fallback for environments without Vault.
- Secrets injected as environment variables via Vault Agent Sidecar Injector or External Secrets Operator.
- Keycloak realm signing keys are managed inside Keycloak but backed up via Vault-sealed exports.
- Prohibited: Secrets in ConfigMaps, Helm values files, or source code.
6. Observability Stack
| Tool | Role | Retention |
|---|
| Prometheus | Metrics collection and alerting | 30 days |
| Grafana | Dashboards and alert routing | — |
| Loki | Log aggregation (Pino JSON logs) | 14 days |
| OpenTelemetry Collector | Trace collection and export | 7 days |
Required Dashboards (Grafana)
- Platform overview: message throughput, delivery rates, error rates
- Service-level: latency P50/P95/P99 per service
- SMPP connector: TPS, bind status per operator
- Billing: events per hour, invoice generation rate
- Infrastructure: pod CPU/memory, Postgres connections, Redis hit rate
7. CI/CD Pipeline (GitHub Actions)
| Stage | Trigger | Action |
|---|
| Lint | PR opened | ESLint + TypeScript check |
| Test | PR opened | Unit + integration tests |
| Build | PR merged to main | Docker image build + push to registry |
| Deploy staging | Build success | kubectl apply to staging |
| E2E | Deploy staging complete | Playwright + API E2E suite |
| Deploy production | Manual approval | kubectl apply to production |
8. Assumptions and Open Points
| ID | Assumption / Open Point | Owner | Resolution Date |
|---|
| A-001 | Cloud provider and region not specified; assumed Kubernetes-compatible (GKE / EKS / AKS) | Infra Team | TBD |
| A-002 | Postgres HA via patroni or managed cloud service TBD | Infra Team | TBD |
| A-003 | Redis cluster vs Redis Sentinel decision TBD | Infra Team | TBD |
| A-004 | NATS cluster deployment (self-managed vs managed) TBD | Infra Team | TBD |
| A-005 | Container registry (GCR / ECR / GHCR) TBD | Infra Team | TBD |
| A-006 | ClickHouse for analytics: optional, not in baseline K8s manifests | Analytics Team | TBD |