Skip to main content

SMS Firewall Service — Deployment Topology

Version: 1.0 Status: Draft Owner: Trust & Safety + SRE Last Updated: 2026-04-21 Companion: SERVICE_OVERVIEW · LOCAL_DEV_SETUP · SECURITY_MODEL Related ADR: ADR-0004 §5–§6


1. Runtime

AspectValue
LanguageTypeScript
RuntimeNode.js 22 LTS
FrameworkNestJS 10 (gRPC + HTTP)
gRPC server@grpc/grpc-js on port 50061 (data plane) and 50062 (control plane)
HTTP serverFastify on port 3061 (admin REST)
MetricsPrometheus on port 9061
Container base imagegcr.io/distroless/nodejs22-debian12:nonroot
Image registryregistry.ghasi.af/platform/sms-firewall-service
Health endpoints/health/live, /health/ready
OS usernonroot (UID 65532)

2. Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
name: sms-firewall-service
namespace: np-data
labels:
app: sms-firewall-service
tier: data-plane
sovereignty: national
spec:
replicas: 5
selector:
matchLabels: { app: sms-firewall-service }
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # zero downtime; firewall is a national choke-point
maxSurge: 2
template:
metadata:
labels:
app: sms-firewall-service
tier: data-plane
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9061"
prometheus.io/path: "/metrics"
spire.io/managed: "true"
spec:
serviceAccountName: sms-firewall-service
automountServiceAccountToken: false
securityContext:
runAsNonRoot: true
runAsUser: 65532
seccompProfile: { type: RuntimeDefault }
fsGroup: 65532
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- { key: app, operator: In, values: [sms-firewall-service] }
topologyKey: topology.kubernetes.io/zone
nodeSelector:
node-pool: np-data
sovereignty: af-only
tolerations:
- key: node-pool
operator: Equal
value: np-data
effect: NoSchedule
priorityClassName: data-plane-critical
containers:
- name: firewall
image: registry.ghasi.af/platform/sms-firewall-service:1.0.0
imagePullPolicy: IfNotPresent
ports:
- { containerPort: 50061, name: grpc-data }
- { containerPort: 50062, name: grpc-ctrl }
- { containerPort: 3061, name: http-admin }
- { containerPort: 9061, name: metrics }
env:
- name: NODE_ENV
value: production
- name: LOG_LEVEL
value: info
- name: GRPC_DATA_PORT
value: "50061"
- name: GRPC_CTRL_PORT
value: "50062"
- name: HTTP_PORT
value: "3061"
- name: METRICS_PORT
value: "9061"
- name: REGION
valueFrom: { fieldRef: { fieldPath: metadata.labels['topology.kubernetes.io/region'] } }
- name: DATABASE_URL
valueFrom: { secretKeyRef: { name: firewall-db, key: url } }
- name: REDIS_URL
valueFrom: { secretKeyRef: { name: firewall-redis, key: url } }
- name: NATS_URL
valueFrom: { secretKeyRef: { name: firewall-nats, key: url } }
- name: VAULT_ADDR
value: https://vault.np-ctrl.svc.cluster.local:8200
- name: SPIFFE_ENDPOINT_SOCKET
value: unix:///run/spire/agent-sockets/spire-agent.sock
- name: HSM_PKCS11_URI
value: pkcs11:object=ghasi-firewall-fed-signer
- name: LOCAL_LLM_URL
value: http://local-llm-service.np-data.svc.cluster.local:8000
- name: NUMBER_INTEL_URL
value: number-intelligence-service.np-ctrl.svc.cluster.local:50080
- name: SENDER_ID_REGISTRY_URL
value: sender-id-registry-service.np-ctrl.svc.cluster.local:50081
- name: EVAL_BUDGET_MS
value: "30"
- name: TRANSIT_BUDGET_MS
value: "50"
- name: PER_BIND_CONCURRENCY
value: "200"
- name: EXTERNAL_LLM_ENABLED
value: "false" # MUST be false; service refuses to boot otherwise
resources:
requests:
cpu: 1000m
memory: 1Gi
ephemeral-storage: 2Gi
limits:
cpu: 4000m
memory: 4Gi
ephemeral-storage: 4Gi
livenessProbe:
httpGet: { path: /health/live, port: http-admin }
initialDelaySeconds: 20
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
readinessProbe:
httpGet: { path: /health/ready, port: http-admin }
initialDelaySeconds: 15
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 2
startupProbe:
httpGet: { path: /health/ready, port: http-admin }
initialDelaySeconds: 30
periodSeconds: 5
failureThreshold: 30 # Bloom rebuild + rule cache warm-up may take 60–90s
volumeMounts:
- { name: spire-agent-socket, mountPath: /run/spire/agent-sockets, readOnly: true }
- { name: hsm-pkcs11, mountPath: /opt/pkcs11, readOnly: true }
- { name: tmp, mountPath: /tmp }
lifecycle:
preStop:
exec:
command: ["/usr/bin/node", "/app/dist/scripts/graceful-shutdown.js", "--drain-seconds=15"]
volumes:
- name: spire-agent-socket
hostPath: { path: /run/spire/agent-sockets, type: Directory }
- name: hsm-pkcs11
secret: { secretName: firewall-hsm-pkcs11 }
- name: tmp
emptyDir: { medium: Memory, sizeLimit: 256Mi }
terminationGracePeriodSeconds: 30

3. HorizontalPodAutoscaler (KEDA + Prometheus adapter)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: sms-firewall-service
namespace: np-data
spec:
scaleTargetRef: { name: sms-firewall-service }
minReplicaCount: 5
maxReplicaCount: 15
pollingInterval: 15
cooldownPeriod: 120
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.np-obs.svc:9090
metricName: firewall_filter_inbound_p95
threshold: "0.025" # scale up when P95 > 25ms (5ms before SLO breach)
query: |
histogram_quantile(0.95,
sum by (le) (
rate(firewall_request_duration_seconds_bucket{rpc="FilterInbound"}[2m])
)
)
- type: prometheus
metadata:
serverAddress: http://prometheus.np-obs.svc:9090
metricName: firewall_inflight_requests
threshold: "150" # scale up when in-flight per pod > 150 (cap is 200)
query: sum(firewall_inflight_requests) / count(up{app="sms-firewall-service"})
- type: cpu
metricType: Utilization
metadata: { value: "70" }

Per region:

  • kbl (primary, ~80% national volume): min=5, max=15
  • mzr (secondary): min=3, max=8

4. PodDisruptionBudget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: sms-firewall-service-pdb
namespace: np-data
spec:
minAvailable: 4 # never go below 4 in kbl during voluntary disruption
selector:
matchLabels: { app: sms-firewall-service }

5. Services

apiVersion: v1
kind: Service
metadata:
name: sms-firewall-grpc-data
namespace: np-data
spec:
selector: { app: sms-firewall-service }
ports:
- { name: grpc-data, port: 50061, targetPort: grpc-data, protocol: TCP }
type: ClusterIP
internalTrafficPolicy: Local # prefer same-zone connector→firewall traffic
---
apiVersion: v1
kind: Service
metadata:
name: sms-firewall-grpc-ctrl
namespace: np-data
spec:
selector: { app: sms-firewall-service }
ports:
- { name: grpc-ctrl, port: 50062, targetPort: grpc-ctrl }
type: ClusterIP
---
apiVersion: v1
kind: Service
metadata:
name: sms-firewall-http
namespace: np-data
spec:
selector: { app: sms-firewall-service }
ports:
- { name: http-admin, port: 3061, targetPort: http-admin }
type: ClusterIP

A region-local headless Endpoints is published for the gRPC service so the connectors can use client-side DNS round-robin without going through kube-proxy iptables (saves ~1 ms per call).


6. NetworkPolicy

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: sms-firewall-service
namespace: np-data
spec:
podSelector:
matchLabels: { app: sms-firewall-service }
policyTypes: [Ingress, Egress]
ingress:
# Data-plane gRPC: only from smpp-connector pods in same namespace
- from:
- podSelector:
matchExpressions:
- { key: app, operator: In, values:
[smpp-connector-awcc-rx, smpp-connector-awcc-trx,
smpp-connector-roshan-rx, smpp-connector-roshan-trx,
smpp-connector-etisalat-rx, smpp-connector-etisalat-trx,
smpp-connector-mtn-af-rx, smpp-connector-mtn-af-trx,
smpp-connector-salaam-rx, smpp-connector-salaam-trx,
smpp-connector-transit-rx] }
ports: [{ port: 50061, protocol: TCP }]
# Control-plane gRPC: routing-engine + channel-router-service + fraud-intel + cdr-mediation
- from:
- namespaceSelector: { matchLabels: { name: np-ctrl } }
podSelector:
matchExpressions:
- { key: app, operator: In, values:
[routing-engine, channel-router-service] }
- namespaceSelector: { matchLabels: { name: np-data } }
podSelector:
matchExpressions:
- { key: app, operator: In, values:
[fraud-intel-service, cdr-mediation-service] }
ports: [{ port: 50062 }]
# Admin REST: only from Kong (in np-edge)
- from:
- namespaceSelector: { matchLabels: { name: np-edge } }
podSelector: { matchLabels: { app: kong } }
ports: [{ port: 3061 }]
# Metrics: from prometheus only
- from:
- namespaceSelector: { matchLabels: { name: np-obs } }
podSelector: { matchLabels: { app: prometheus } }
ports: [{ port: 9061 }]
egress:
# Postgres (region-local primary)
- to:
- namespaceSelector: { matchLabels: { name: np-data } }
podSelector: { matchLabels: { app: postgres-firewall } }
ports: [{ port: 5432 }]
# Redis (region-local cluster)
- to:
- namespaceSelector: { matchLabels: { name: np-data } }
podSelector: { matchLabels: { app: redis-firewall } }
ports: [{ port: 6379 }]
# NATS JetStream
- to:
- namespaceSelector: { matchLabels: { name: np-data } }
podSelector: { matchLabels: { app: nats } }
ports: [{ port: 4222 }]
# Local LLM
- to:
- podSelector: { matchLabels: { app: local-llm-service } }
ports: [{ port: 8000 }]
# Vault (control-plane)
- to:
- namespaceSelector: { matchLabels: { name: np-ctrl } }
podSelector: { matchLabels: { app: vault } }
ports: [{ port: 8200 }]
# Adjacent services
- to:
- namespaceSelector: { matchLabels: { name: np-ctrl } }
podSelector:
matchExpressions:
- { key: app, operator: In, values:
[number-intelligence-service, sender-id-registry-service] }
ports: [{ port: 50080 }, { port: 50081 }]
# MinIO (audit + federation)
- to:
- namespaceSelector: { matchLabels: { name: np-data } }
podSelector: { matchLabels: { app: minio } }
ports: [{ port: 9000 }]
# SPIRE agent (UDS, but include for completeness — over hostPath)
# Block all other egress (fail-closed network)

No external egress. No internet. All traffic stays within the platform.


7. Pod lifecycle

Startup sequence (target: ready in ≤ 90 s)

  1. Process boot, env validation; abort if EXTERNAL_LLM_ENABLED=true.
  2. Fetch SVID via SPIRE agent UDS.
  3. Connect Postgres + Redis + NATS; verify schemas + ping.
  4. Acquire Vault tokens for KEK and HSM access; verify HSM signer.
  5. Load active rules from firewall.rules WHERE enabled=TRUE into in-process cache.
  6. Rebuild Bloom filters: BF.RESERVE fw:blocklist:national 0.01 10000000 then bulk-load from firewall.blocklist_entries WHERE active=TRUE.
  7. Subscribe to NATS consumers (consent.dnd.snapshot.v1, fraud.detected.*, regulator.blocklist.published.v1, sender.id.*).
  8. Register heartbeat with firewall.mno_bind_registry if pod represents a connector binding (only for connectors, not firewall pods).
  9. Start gRPC + HTTP servers; mark /health/ready true.
  10. Begin processing.

Shutdown sequence (SIGTERM, drain ≤ 25 s within 30 s grace)

  1. Mark /health/ready false; remove from Service Endpoints.
  2. Stop accepting new gRPC calls (graceful close listener); refuse with UNAVAILABLE.
  3. Drain in-flight gRPC up to 15 s.
  4. Close NATS consumers (durable offsets persist).
  5. Flush rate-counter Redis pipeline.
  6. Flush outbox queue.
  7. Close Postgres pool, Redis pool, Vault token revocation.
  8. Process exit 0.

8. Background workers

Workers run inside firewall pods (no separate deployment), with leader election via Redis lock to ensure singleton execution per region:

WorkerScheduleSingletonLock
OutboxRelayWorkercontinuous (250 ms poll)per pod (sharded)none (sharded)
RuleCacheRefreshWorker60 sper podnone
BloomRebuildWorkeron-demand + 02:30yesfw:bloom:rebuild:lock
QuarantineExpiryWorker5 minyesfw:quarantine:expiry:lock
FederationExportWorker02:00 Asia/Kabulyesfw:fed:export:lock
AuditVerifierWorker03:30 Asia/Kabulyesfw:audit:verify:lock
AuditArchiveWorker03:00 Asia/Kabulyesfw:audit:archive:lock
PartitionMaintenanceWorker02:00 dailyyesfw:partition:lock
PeerHygieneScoreWorker5 minyesfw:peer:score:lock
BindHeartbeatWatcherWorker30 syesfw:bind:watcher:lock

9. Region strategy

Per ADR-0004 §6:

RegionRolePostgresNATSRedisConnectors
kblPrimaryLogical-replication sourceJetStream cluster (R=3)Cluster (3 masters / 3 replicas)All MNO connectors active
mzrSecondaryRead-only replicaMirror for FIREWALL_AUDIT (R=2 added)Region-local clusterStandby connectors (active in failover)
dxbCold-archive onlyAudit Parquet objects onlyLeaf node (one-way mirror)

Audit-event JetStream stream FIREWALL_AUDIT: 3 replicas in kbl, 2 mirror replicas in mzr, leaf-mirrored to dxb. Loss of kbl does not lose audit evidence.


10. Infrastructure dependencies

DependencyVersionTopologyFailure response
Postgres16Patroni HA (1 primary + 2 replicas per region); pgBouncer transaction poolingFailover < 30 s; firewall fail-closed during failover
Redis7.2+Cluster mode (3 masters + 3 replicas per region); RedisBloom moduleBloom + rate degraded; fall through to PG
NATS JetStream2.10+3-node cluster per region; mirror to other region; leaf to dxbOutbox queues locally
Local LLMvLLM 0.6+2 GPU pods (NVIDIA L4), shared with compliance-engineCLASSIFIER rules skip
MinIORELEASE.2024+4-node erasure-coded; Object Lock Compliance enabledFederation export postponed
Vault1.15+HA cluster in np-ctrlNew verdicts may fail-closed
SPIRE1.9+Agent per node + Server in np-ctrlmTLS rotation paused; existing certs honoured
HSM (Thales Luna or AWS CloudHSM dxb)n/a2-partition for HAFederation export postponed

11. Canary / rollout strategy

Per SERVICE_RISK_REGISTER §R-OPS-01, the firewall is a national choke-point — rollouts are conservative.

  1. Build & sign: image built reproducibly; signed with cosign; SBOM generated; Trivy + osv-scanner pass.
  2. Staging: full deployment to staging cluster; integration + E2E + load tests pass.
  3. Production canary 5% in mzr (lower volume): Argo Rollouts with traffic split via Istio VirtualService weight; observe for 30 minutes.
  4. Auto-rollback if any of:
    • firewall_request_duration_seconds{rpc="FilterInbound",quantile="0.95"} > 0.030 for 5 min
    • firewall_errors_total{code!="OK"} rate > 0.1% for 5 min
    • firewall_audit_chain_break_total > 0
    • Pod restart loop detected
  5. Promote canary to 25% in mzr; observe 30 min.
  6. Promote to 100% in mzr; observe 60 min.
  7. Replicate to kbl with same gradient (5% → 25% → 100%).

Total rollout window: ~3 hours. Manual roll-forward override available with dual approval.


12. Storage & secrets

ArtifactMountSource
TLS server cert + private key (mTLS)injected via SPIRE workload SVID APISPIRE
HSM PKCS#11 token/opt/pkcs11/firewall-fed-signer.cfgK8s Secret (referenced; HSM holds keys)
Postgres credentialsenv (Vault dynamic, 24 h TTL)Vault
Redis credentialsenvVault KV
NATS credentialsenvVault KV
Per-MNO KEK refs (Vault Transit)Vault Transit (referenced)
Event-signing Ed25519 keyenv (rotated 90d)Vault KV