SMS Firewall Service — Deployment Topology

Version: 1.0 Status: Draft Owner: Trust & Safety + SRE Last Updated: 2026-04-21 Companion: SERVICE_OVERVIEW · LOCAL_DEV_SETUP · SECURITY_MODEL Related ADR: ADR-0004 §5–§6

1. Runtime

Aspect	Value
Language	TypeScript
Runtime	Node.js 22 LTS
Framework	NestJS 10 (gRPC + HTTP)
gRPC server	`@grpc/grpc-js` on port `50061` (data plane) and `50062` (control plane)
HTTP server	Fastify on port `3061` (admin REST)
Metrics	Prometheus on port `9061`
Container base image	`gcr.io/distroless/nodejs22-debian12:nonroot`
Image registry	`registry.ghasi.af/platform/sms-firewall-service`
Health endpoints	`/health/live`, `/health/ready`
OS user	`nonroot` (UID 65532)

2. Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sms-firewall-service
  namespace: np-data
  labels:
    app: sms-firewall-service
    tier: data-plane
    sovereignty: national
spec:
  replicas: 5
  selector:
    matchLabels: { app: sms-firewall-service }
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0       # zero downtime; firewall is a national choke-point
      maxSurge: 2
  template:
    metadata:
      labels:
        app: sms-firewall-service
        tier: data-plane
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9061"
        prometheus.io/path: "/metrics"
        spire.io/managed: "true"
    spec:
      serviceAccountName: sms-firewall-service
      automountServiceAccountToken: false
      securityContext:
        runAsNonRoot: true
        runAsUser: 65532
        seccompProfile: { type: RuntimeDefault }
        fsGroup: 65532
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - { key: app, operator: In, values: [sms-firewall-service] }
              topologyKey: topology.kubernetes.io/zone
      nodeSelector:
        node-pool: np-data
        sovereignty: af-only
      tolerations:
        - key: node-pool
          operator: Equal
          value: np-data
          effect: NoSchedule
      priorityClassName: data-plane-critical
      containers:
        - name: firewall
          image: registry.ghasi.af/platform/sms-firewall-service:1.0.0
          imagePullPolicy: IfNotPresent
          ports:
            - { containerPort: 50061, name: grpc-data }
            - { containerPort: 50062, name: grpc-ctrl }
            - { containerPort: 3061,  name: http-admin }
            - { containerPort: 9061,  name: metrics }
          env:
            - name: NODE_ENV
              value: production
            - name: LOG_LEVEL
              value: info
            - name: GRPC_DATA_PORT
              value: "50061"
            - name: GRPC_CTRL_PORT
              value: "50062"
            - name: HTTP_PORT
              value: "3061"
            - name: METRICS_PORT
              value: "9061"
            - name: REGION
              valueFrom: { fieldRef: { fieldPath: metadata.labels['topology.kubernetes.io/region'] } }
            - name: DATABASE_URL
              valueFrom: { secretKeyRef: { name: firewall-db, key: url } }
            - name: REDIS_URL
              valueFrom: { secretKeyRef: { name: firewall-redis, key: url } }
            - name: NATS_URL
              valueFrom: { secretKeyRef: { name: firewall-nats, key: url } }
            - name: VAULT_ADDR
              value: https://vault.np-ctrl.svc.cluster.local:8200
            - name: SPIFFE_ENDPOINT_SOCKET
              value: unix:///run/spire/agent-sockets/spire-agent.sock
            - name: HSM_PKCS11_URI
              value: pkcs11:object=ghasi-firewall-fed-signer
            - name: LOCAL_LLM_URL
              value: http://local-llm-service.np-data.svc.cluster.local:8000
            - name: NUMBER_INTEL_URL
              value: number-intelligence-service.np-ctrl.svc.cluster.local:50080
            - name: SENDER_ID_REGISTRY_URL
              value: sender-id-registry-service.np-ctrl.svc.cluster.local:50081
            - name: EVAL_BUDGET_MS
              value: "30"
            - name: TRANSIT_BUDGET_MS
              value: "50"
            - name: PER_BIND_CONCURRENCY
              value: "200"
            - name: EXTERNAL_LLM_ENABLED
              value: "false"   # MUST be false; service refuses to boot otherwise
          resources:
            requests:
              cpu: 1000m
              memory: 1Gi
              ephemeral-storage: 2Gi
            limits:
              cpu: 4000m
              memory: 4Gi
              ephemeral-storage: 4Gi
          livenessProbe:
            httpGet: { path: /health/live, port: http-admin }
            initialDelaySeconds: 20
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3
          readinessProbe:
            httpGet: { path: /health/ready, port: http-admin }
            initialDelaySeconds: 15
            periodSeconds: 5
            timeoutSeconds: 2
            failureThreshold: 2
          startupProbe:
            httpGet: { path: /health/ready, port: http-admin }
            initialDelaySeconds: 30
            periodSeconds: 5
            failureThreshold: 30   # Bloom rebuild + rule cache warm-up may take 60–90s
          volumeMounts:
            - { name: spire-agent-socket, mountPath: /run/spire/agent-sockets, readOnly: true }
            - { name: hsm-pkcs11, mountPath: /opt/pkcs11, readOnly: true }
            - { name: tmp, mountPath: /tmp }
          lifecycle:
            preStop:
              exec:
                command: ["/usr/bin/node", "/app/dist/scripts/graceful-shutdown.js", "--drain-seconds=15"]
      volumes:
        - name: spire-agent-socket
          hostPath: { path: /run/spire/agent-sockets, type: Directory }
        - name: hsm-pkcs11
          secret: { secretName: firewall-hsm-pkcs11 }
        - name: tmp
          emptyDir: { medium: Memory, sizeLimit: 256Mi }
      terminationGracePeriodSeconds: 30

3. HorizontalPodAutoscaler (KEDA + Prometheus adapter)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: sms-firewall-service
  namespace: np-data
spec:
  scaleTargetRef: { name: sms-firewall-service }
  minReplicaCount: 5
  maxReplicaCount: 15
  pollingInterval: 15
  cooldownPeriod: 120
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.np-obs.svc:9090
        metricName: firewall_filter_inbound_p95
        threshold: "0.025"   # scale up when P95 > 25ms (5ms before SLO breach)
        query: |
          histogram_quantile(0.95,
            sum by (le) (
              rate(firewall_request_duration_seconds_bucket{rpc="FilterInbound"}[2m])
            )
          )
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.np-obs.svc:9090
        metricName: firewall_inflight_requests
        threshold: "150"   # scale up when in-flight per pod > 150 (cap is 200)
        query: sum(firewall_inflight_requests) / count(up{app="sms-firewall-service"})
    - type: cpu
      metricType: Utilization
      metadata: { value: "70" }

Per region:

kbl (primary, ~80% national volume): min=5, max=15
mzr (secondary): min=3, max=8

4. PodDisruptionBudget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: sms-firewall-service-pdb
  namespace: np-data
spec:
  minAvailable: 4         # never go below 4 in kbl during voluntary disruption
  selector:
    matchLabels: { app: sms-firewall-service }

5. Services

apiVersion: v1
kind: Service
metadata:
  name: sms-firewall-grpc-data
  namespace: np-data
spec:
  selector: { app: sms-firewall-service }
  ports:
    - { name: grpc-data, port: 50061, targetPort: grpc-data, protocol: TCP }
  type: ClusterIP
  internalTrafficPolicy: Local   # prefer same-zone connector→firewall traffic
---
apiVersion: v1
kind: Service
metadata:
  name: sms-firewall-grpc-ctrl
  namespace: np-data
spec:
  selector: { app: sms-firewall-service }
  ports:
    - { name: grpc-ctrl, port: 50062, targetPort: grpc-ctrl }
  type: ClusterIP
---
apiVersion: v1
kind: Service
metadata:
  name: sms-firewall-http
  namespace: np-data
spec:
  selector: { app: sms-firewall-service }
  ports:
    - { name: http-admin, port: 3061, targetPort: http-admin }
  type: ClusterIP

A region-local headless Endpoints is published for the gRPC service so the connectors can use client-side DNS round-robin without going through kube-proxy iptables (saves ~1 ms per call).

6. NetworkPolicy

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: sms-firewall-service
  namespace: np-data
spec:
  podSelector:
    matchLabels: { app: sms-firewall-service }
  policyTypes: [Ingress, Egress]
  ingress:
    # Data-plane gRPC: only from smpp-connector pods in same namespace
    - from:
        - podSelector:
            matchExpressions:
              - { key: app, operator: In, values:
                  [smpp-connector-awcc-rx, smpp-connector-awcc-trx,
                   smpp-connector-roshan-rx, smpp-connector-roshan-trx,
                   smpp-connector-etisalat-rx, smpp-connector-etisalat-trx,
                   smpp-connector-mtn-af-rx, smpp-connector-mtn-af-trx,
                   smpp-connector-salaam-rx, smpp-connector-salaam-trx,
                   smpp-connector-transit-rx] }
      ports: [{ port: 50061, protocol: TCP }]
    # Control-plane gRPC: routing-engine + channel-router-service + fraud-intel + cdr-mediation
    - from:
        - namespaceSelector: { matchLabels: { name: np-ctrl } }
          podSelector:
            matchExpressions:
              - { key: app, operator: In, values:
                  [routing-engine, channel-router-service] }
        - namespaceSelector: { matchLabels: { name: np-data } }
          podSelector:
            matchExpressions:
              - { key: app, operator: In, values:
                  [fraud-intel-service, cdr-mediation-service] }
      ports: [{ port: 50062 }]
    # Admin REST: only from Kong (in np-edge)
    - from:
        - namespaceSelector: { matchLabels: { name: np-edge } }
          podSelector: { matchLabels: { app: kong } }
      ports: [{ port: 3061 }]
    # Metrics: from prometheus only
    - from:
        - namespaceSelector: { matchLabels: { name: np-obs } }
          podSelector: { matchLabels: { app: prometheus } }
      ports: [{ port: 9061 }]
  egress:
    # Postgres (region-local primary)
    - to:
        - namespaceSelector: { matchLabels: { name: np-data } }
          podSelector: { matchLabels: { app: postgres-firewall } }
      ports: [{ port: 5432 }]
    # Redis (region-local cluster)
    - to:
        - namespaceSelector: { matchLabels: { name: np-data } }
          podSelector: { matchLabels: { app: redis-firewall } }
      ports: [{ port: 6379 }]
    # NATS JetStream
    - to:
        - namespaceSelector: { matchLabels: { name: np-data } }
          podSelector: { matchLabels: { app: nats } }
      ports: [{ port: 4222 }]
    # Local LLM
    - to:
        - podSelector: { matchLabels: { app: local-llm-service } }
      ports: [{ port: 8000 }]
    # Vault (control-plane)
    - to:
        - namespaceSelector: { matchLabels: { name: np-ctrl } }
          podSelector: { matchLabels: { app: vault } }
      ports: [{ port: 8200 }]
    # Adjacent services
    - to:
        - namespaceSelector: { matchLabels: { name: np-ctrl } }
          podSelector:
            matchExpressions:
              - { key: app, operator: In, values:
                  [number-intelligence-service, sender-id-registry-service] }
      ports: [{ port: 50080 }, { port: 50081 }]
    # MinIO (audit + federation)
    - to:
        - namespaceSelector: { matchLabels: { name: np-data } }
          podSelector: { matchLabels: { app: minio } }
      ports: [{ port: 9000 }]
    # SPIRE agent (UDS, but include for completeness — over hostPath)
    # Block all other egress (fail-closed network)

No external egress. No internet. All traffic stays within the platform.

7. Pod lifecycle

Startup sequence (target: ready in ≤ 90 s)

Process boot, env validation; abort if EXTERNAL_LLM_ENABLED=true.
Fetch SVID via SPIRE agent UDS.
Connect Postgres + Redis + NATS; verify schemas + ping.
Acquire Vault tokens for KEK and HSM access; verify HSM signer.
Load active rules from firewall.rules WHERE enabled=TRUE into in-process cache.
Rebuild Bloom filters: BF.RESERVE fw:blocklist:national 0.01 10000000 then bulk-load from firewall.blocklist_entries WHERE active=TRUE.
Subscribe to NATS consumers (consent.dnd.snapshot.v1, fraud.detected.*, regulator.blocklist.published.v1, sender.id.*).
Register heartbeat with firewall.mno_bind_registry if pod represents a connector binding (only for connectors, not firewall pods).
Start gRPC + HTTP servers; mark /health/ready true.
Begin processing.

Shutdown sequence (SIGTERM, drain ≤ 25 s within 30 s grace)

Mark /health/ready false; remove from Service Endpoints.
Stop accepting new gRPC calls (graceful close listener); refuse with UNAVAILABLE.
Drain in-flight gRPC up to 15 s.
Close NATS consumers (durable offsets persist).
Flush rate-counter Redis pipeline.
Flush outbox queue.
Close Postgres pool, Redis pool, Vault token revocation.
Process exit 0.

8. Background workers

Workers run inside firewall pods (no separate deployment), with leader election via Redis lock to ensure singleton execution per region:

Worker	Schedule	Singleton	Lock
`OutboxRelayWorker`	continuous (250 ms poll)	per pod (sharded)	none (sharded)
`RuleCacheRefreshWorker`	60 s	per pod	none
`BloomRebuildWorker`	on-demand + 02:30	yes	`fw:bloom:rebuild:lock`
`QuarantineExpiryWorker`	5 min	yes	`fw:quarantine:expiry:lock`
`FederationExportWorker`	02:00 Asia/Kabul	yes	`fw:fed:export:lock`
`AuditVerifierWorker`	03:30 Asia/Kabul	yes	`fw:audit:verify:lock`
`AuditArchiveWorker`	03:00 Asia/Kabul	yes	`fw:audit:archive:lock`
`PartitionMaintenanceWorker`	02:00 daily	yes	`fw:partition:lock`
`PeerHygieneScoreWorker`	5 min	yes	`fw:peer:score:lock`
`BindHeartbeatWatcherWorker`	30 s	yes	`fw:bind:watcher:lock`

9. Region strategy

Per ADR-0004 §6:

Region	Role	Postgres	NATS	Redis	Connectors
`kbl`	Primary	Logical-replication source	JetStream cluster (R=3)	Cluster (3 masters / 3 replicas)	All MNO connectors active
`mzr`	Secondary	Read-only replica	Mirror for `FIREWALL_AUDIT` (R=2 added)	Region-local cluster	Standby connectors (active in failover)
`dxb`	Cold-archive only	Audit Parquet objects only	Leaf node (one-way mirror)	—	—

Audit-event JetStream stream FIREWALL_AUDIT: 3 replicas in kbl, 2 mirror replicas in mzr, leaf-mirrored to dxb. Loss of kbl does not lose audit evidence.

10. Infrastructure dependencies

Dependency	Version	Topology	Failure response
Postgres	16	Patroni HA (1 primary + 2 replicas per region); pgBouncer transaction pooling	Failover < 30 s; firewall fail-closed during failover
Redis	7.2+	Cluster mode (3 masters + 3 replicas per region); RedisBloom module	Bloom + rate degraded; fall through to PG
NATS JetStream	2.10+	3-node cluster per region; mirror to other region; leaf to dxb	Outbox queues locally
Local LLM	vLLM 0.6+	2 GPU pods (NVIDIA L4), shared with compliance-engine	CLASSIFIER rules skip
MinIO	RELEASE.2024+	4-node erasure-coded; Object Lock Compliance enabled	Federation export postponed
Vault	1.15+	HA cluster in np-ctrl	New verdicts may fail-closed
SPIRE	1.9+	Agent per node + Server in np-ctrl	mTLS rotation paused; existing certs honoured
HSM (Thales Luna or AWS CloudHSM `dxb`)	n/a	2-partition for HA	Federation export postponed

11. Canary / rollout strategy

Per SERVICE_RISK_REGISTER §R-OPS-01, the firewall is a national choke-point — rollouts are conservative.

Build & sign: image built reproducibly; signed with cosign; SBOM generated; Trivy + osv-scanner pass.
Staging: full deployment to staging cluster; integration + E2E + load tests pass.
Production canary 5% in mzr (lower volume): Argo Rollouts with traffic split via Istio VirtualService weight; observe for 30 minutes.
Auto-rollback if any of:
- firewall_request_duration_seconds{rpc="FilterInbound",quantile="0.95"} > 0.030 for 5 min
- firewall_errors_total{code!="OK"} rate > 0.1% for 5 min
- firewall_audit_chain_break_total > 0
- Pod restart loop detected
Promote canary to 25% in mzr; observe 30 min.
Promote to 100% in mzr; observe 60 min.
Replicate to kbl with same gradient (5% → 25% → 100%).

Total rollout window: ~3 hours. Manual roll-forward override available with dual approval.

12. Storage & secrets

Artifact	Mount	Source
TLS server cert + private key (mTLS)	injected via SPIRE workload SVID API	SPIRE
HSM PKCS#11 token	`/opt/pkcs11/firewall-fed-signer.cfg`	K8s Secret (referenced; HSM holds keys)
Postgres credentials	env (Vault dynamic, 24 h TTL)	Vault
Redis credentials	env	Vault KV
NATS credentials	env	Vault KV
Per-MNO KEK refs (Vault Transit)	Vault Transit (referenced)	—
Event-signing Ed25519 key	env (rotated 90d)	Vault KV

1. Runtime​

2. Kubernetes Deployment​

3. HorizontalPodAutoscaler (KEDA + Prometheus adapter)​

4. PodDisruptionBudget​

5. Services​

6. NetworkPolicy​

7. Pod lifecycle​

Startup sequence (target: ready in ≤ 90 s)​

Shutdown sequence (SIGTERM, drain ≤ 25 s within 30 s grace)​

8. Background workers​

9. Region strategy​

10. Infrastructure dependencies​

11. Canary / rollout strategy​

12. Storage & secrets​