numbering-service — Deployment Topology

Version: 1.0 Status: Draft Owner: Commerce Engineering + Platform SRE Last Updated: 2026-04-21 Companion: OBSERVABILITY · FAILURE_MODES · ../../docs/architecture/ADR-0004-national-backbone-resilience.md

1. Kubernetes Resources

1.1 API Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: numbering-service
  namespace: sms-platform
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector: { matchLabels: { app: numbering-service } }
  template:
    metadata:
      labels:
        app: numbering-service
        region: kbl                     # overridden to 'mzr' in mzr overlay
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "3021"
        prometheus.io/path: "/metrics"
    spec:
      serviceAccountName: numbering-service
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: ScheduleAnyway
          labelSelector: { matchLabels: { app: numbering-service } }
      containers:
        - name: numbering-service
          image: ghcr.io/ghasi/numbering-service:${GIT_SHA}
          ports:
            - { containerPort: 50061, name: grpc }
            - { containerPort: 3021,  name: http }
          env:
            - { name: NODE_ENV,      value: production }
            - { name: LOG_LEVEL,     value: info }
            - { name: REGION_ID,     valueFrom: { fieldRef: { fieldPath: metadata.labels['region'] } } }
            - { name: GRPC_PORT,     value: "50061" }
            - { name: HTTP_PORT,     value: "3021" }
            - { name: DATABASE_URL,  valueFrom: { secretKeyRef: { name: numbering-db-secret, key: url } } }
            - { name: REDIS_URL,     valueFrom: { secretKeyRef: { name: numbering-redis-secret, key: url } } }
            - { name: NATS_URL,      valueFrom: { secretKeyRef: { name: nats-credentials, key: url } } }
            - { name: NATS_CREDS_PATH, value: /etc/nats/creds.jwt }
            - { name: S3_REGULATOR_BUCKET, value: ghasi-regulator-exports-kbl }  # overridden per region
            - { name: GRPC_TLS_ENABLED, value: "true" }
            - { name: TLS_CERT_PATH, value: /etc/tls/server.crt }
            - { name: TLS_KEY_PATH,  value: /etc/tls/server.key }
            - { name: TLS_CA_PATH,   value: /etc/tls/ca.crt }
            - { name: RESERVATION_RESERVE_TTL_SECS, value: "900" }
            - { name: RESERVATION_HOLD_TTL_SECS,    value: "86400" }
            - { name: QUARANTINE_MSISDN_DAYS,       value: "90" }
            - { name: QUARANTINE_SHORT_CODE_DAYS,   value: "30" }
            - { name: QUARANTINE_VANITY_DAYS,       value: "365" }
            - { name: QUARANTINE_ALPHA_DAYS,        value: "0" }
          envFrom:
            - secretRef: { name: numbering-vault-secrets }
          resources:
            requests: { cpu: 500m,  memory: 512Mi }
            limits:   { cpu: 2000m, memory: 1Gi }
          livenessProbe:
            httpGet: { path: /health/live,  port: http }
            initialDelaySeconds: 15
            periodSeconds: 10
          readinessProbe:
            httpGet: { path: /health/ready, port: http }
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 3
          volumeMounts:
            - { name: tls-certs,   mountPath: /etc/tls, readOnly: true }
            - { name: nats-creds,  mountPath: /etc/nats, readOnly: true }
      volumes:
        - name: tls-certs
          secret: { secretName: numbering-tls }
        - name: nats-creds
          secret: { secretName: numbering-nats-creds }

1.2 HPA

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: numbering-service-hpa, namespace: sms-platform }
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: numbering-service }
  minReplicas: 3
  maxReplicas: 24
  metrics:
    - type: Resource
      resource: { name: cpu,    target: { type: Utilization, averageUtilization: 65 } }
    - type: Resource
      resource: { name: memory, target: { type: Utilization, averageUtilization: 75 } }
    - type: Pods
      pods:
        metric: { name: numbering_validate_lease_requests_per_pod }
        target: { type: AverageValue, averageValue: "1500" }  # target 1500 RPS/pod

1.3 PodDisruptionBudget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata: { name: numbering-service-pdb, namespace: sms-platform }
spec:
  minAvailable: 2
  selector: { matchLabels: { app: numbering-service } }

1.4 CronJobs — Background Workers

apiVersion: batch/v1
kind: CronJob
metadata: { name: numbering-reservation-cleanup, namespace: sms-platform }
spec:
  schedule: "* * * * *"       # every minute (safety-net; Redis keyspace is primary)
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: numbering-service
          restartPolicy: OnFailure
          containers:
            - name: cleanup
              image: ghcr.io/ghasi/numbering-service:${GIT_SHA}
              command: ["node", "dist/workers/reservation-cleanup.js"]
              envFrom: [ { secretRef: { name: numbering-vault-secrets } } ]
---
apiVersion: batch/v1
kind: CronJob
metadata: { name: numbering-quarantine-sweep, namespace: sms-platform }
spec:
  schedule: "*/5 * * * *"
  concurrencyPolicy: Forbid
  jobTemplate: { spec: { template: { spec: { containers: [ { name: sweep, image: ghcr.io/ghasi/numbering-service:${GIT_SHA}, command: ["node","dist/workers/quarantine-sweep.js"] } ], restartPolicy: OnFailure, serviceAccountName: numbering-service } } } }
---
apiVersion: batch/v1
kind: CronJob
metadata: { name: numbering-lease-renewal, namespace: sms-platform }
spec:
  schedule: "0 2 * * *"      # daily 02:00 UTC
  jobTemplate: { spec: { template: { spec: { containers: [ { name: renew, image: ghcr.io/ghasi/numbering-service:${GIT_SHA}, command: ["node","dist/workers/lease-renewal.js"] } ], restartPolicy: OnFailure, serviceAccountName: numbering-service } } } }
---
apiVersion: batch/v1
kind: CronJob
metadata: { name: numbering-regulator-export, namespace: sms-platform }
spec:
  schedule: "0 1 1 * *"      # 01:00 UTC on the 1st of each month
  jobTemplate: { spec: { template: { spec: { containers: [ { name: export, image: ghcr.io/ghasi/numbering-service:${GIT_SHA}, command: ["node","dist/workers/regulator-export.js"] } ], restartPolicy: OnFailure, serviceAccountName: numbering-service } } } }
---
apiVersion: batch/v1
kind: CronJob
metadata: { name: numbering-reconciliation, namespace: sms-platform }
spec:
  schedule: "0 3 * * *"      # nightly 03:00 UTC
  jobTemplate: { spec: { template: { spec: { containers: [ { name: recon, image: ghcr.io/ghasi/numbering-service:${GIT_SHA}, command: ["node","dist/workers/reconciliation.js"] } ], restartPolicy: OnFailure, serviceAccountName: numbering-service } } } }
---
apiVersion: batch/v1
kind: CronJob
metadata: { name: numbering-partition-maintenance, namespace: sms-platform }
spec:
  schedule: "30 3 * * *"     # nightly 03:30 UTC
  jobTemplate: { spec: { template: { spec: { containers: [ { name: part, image: ghcr.io/ghasi/numbering-service:${GIT_SHA}, command: ["node","dist/workers/partition-maintenance.js"] } ], restartPolicy: OnFailure, serviceAccountName: numbering-service } } } }
---
apiVersion: batch/v1
kind: CronJob
metadata: { name: numbering-audit-chain-verify, namespace: sms-platform }
spec:
  schedule: "0 4 * * *"      # nightly 04:00 UTC
  jobTemplate: { spec: { template: { spec: { containers: [ { name: verify, image: ghcr.io/ghasi/numbering-service:${GIT_SHA}, command: ["node","dist/workers/audit-chain-verify.js"] } ], restartPolicy: OnFailure, serviceAccountName: numbering-service } } } }

All workers acquire Redis distributed locks before work to be multi-replica safe.

1.5 Services

apiVersion: v1
kind: Service
metadata: { name: numbering-service-grpc, namespace: sms-platform }
spec:
  selector: { app: numbering-service }
  ports: [ { name: grpc, port: 50061, targetPort: grpc } ]
  type: ClusterIP
---
apiVersion: v1
kind: Service
metadata: { name: numbering-service-http, namespace: sms-platform }
spec:
  selector: { app: numbering-service }
  ports: [ { name: http, port: 3021, targetPort: http } ]
  type: ClusterIP

1.6 NetworkPolicy

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: numbering-service-netpol, namespace: sms-platform }
spec:
  podSelector: { matchLabels: { app: numbering-service } }
  policyTypes: [ Ingress, Egress ]
  ingress:
    # gRPC callers (mTLS-authenticated)
    - from:
        - podSelector: { matchLabels: { app: sms-orchestrator } }
        - podSelector: { matchLabels: { app: routing-engine } }
        - podSelector: { matchLabels: { app: number-intelligence-service } }
        - podSelector: { matchLabels: { app: sender-id-registry-service } }
        - podSelector: { matchLabels: { app: compliance-engine } }
        - podSelector: { matchLabels: { app: billing-service } }
        - podSelector: { matchLabels: { app: customer-portal-bff } }
        - podSelector: { matchLabels: { app: admin-dashboard-bff } }
      ports: [ { port: 50061 } ]
    # REST via Kong
    - from:
        - podSelector: { matchLabels: { app: kong } }
      ports: [ { port: 3021 } ]
    # Prometheus scrape
    - from:
        - namespaceSelector: { matchLabels: { name: monitoring } }
      ports: [ { port: 3021 } ]
  egress:
    - to: [ { podSelector: { matchLabels: { app: postgresql } } } ]
      ports: [ { port: 5432 } ]
    - to: [ { podSelector: { matchLabels: { app: redis } } } ]
      ports: [ { port: 6379 } ]
    - to: [ { podSelector: { matchLabels: { app: nats } } } ]
      ports: [ { port: 4222 } ]
    # Vault, S3 (regulator exports) — restricted
    - to:
        - namespaceSelector: { matchLabels: { name: vault } }
      ports: [ { port: 8200 } ]
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
            except: [ 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16 ]
      ports: [ { port: 443 } ]   # S3 HTTPS

2. Infrastructure Dependencies

Dependency	Version	Topology
PostgreSQL	15+	Primary per region + read replica; cross-region logical replication per ADR-0004 §14
Redis	7.0+	Cluster mode; logical DB 7; keyspace notifications enabled (`notify-keyspace-events Ex`)
NATS JetStream	2.10+	3-node cluster per region with mesh; streams replicated cross-region
S3 (or S3-compatible)	—	`ghasi-regulator-exports-{region}` bucket with object-lock WORM 7 y
Vault	1.15+	PKI engine for mTLS; Transit for regulator-export signing; DB engine for dynamic PG creds
Kong	3.4+	REST ingress; JWT + rate-limiting-advanced plugins

2.1 Regions

Per ADR-0004 §14:

Region	Purpose	Primary for
`kbl` (Kabul)	Primary control plane	All control-plane writes; monthly regulator export
`mzr` (Mazar-i-Sharif)	Secondary control plane, DR	Read replicas; takes over on `kbl` failover

Writes use synchronous cross-region quorum on numbers and leases. Reservations are region-local.

3. Environment Variables

Variable	Required	Default	Description
`NODE_ENV`	Yes	—	`production` / `staging` / `development`
`REGION_ID`	Yes	—	`kbl` / `mzr`
`GRPC_PORT`	No	`50061`	gRPC listener
`HTTP_PORT`	No	`3021`	REST + health + metrics
`DATABASE_URL`	Yes	—	PostgreSQL connection string (dynamic from Vault)
`REDIS_URL`	Yes	—	Redis connection string (includes DB 7)
`NATS_URL`	Yes	—	NATS server URL (regional cluster)
`NATS_CREDS_PATH`	Yes	—	Path to NATS credentials JWT
`S3_REGULATOR_BUCKET`	Yes	—	Bucket name for regulator exports
`GRPC_TLS_ENABLED`	No	`true`	Set `false` only in dev
`TLS_CERT_PATH`, `TLS_KEY_PATH`, `TLS_CA_PATH`	If TLS	—	mTLS material
`LOG_LEVEL`	No	`info`	`debug` / `info` / `warn` / `error`
`RESERVATION_RESERVE_TTL_SECS`	No	`900`	15 min default
`RESERVATION_HOLD_TTL_SECS`	No	`86400`	24 h default
`QUARANTINE_MSISDN_DAYS`	No	`90`	Cool-off for MSISDN
`QUARANTINE_SHORT_CODE_DAYS`	No	`30`	Cool-off for short code
`QUARANTINE_VANITY_DAYS`	No	`365`	Cool-off for vanity
`QUARANTINE_ALPHA_DAYS`	No	`0`	Cool-off for alpha
`OUTBOX_POLL_INTERVAL_MS`	No	`500`	Outbox relay tick
`RECONCILIATION_WINDOW_HOURS`	No	`24`	Nightly reconciliation look-back

4. Deployment Environments

Environment	Replicas	PG	Redis	NATS
Production (kbl)	6–24 (HPA)	Primary + 2 replicas	Cluster 3 shards × 3 replicas	JetStream 3-node
Production (mzr)	3–12 (HPA)	Secondary + 1 replica	Cluster 3 shards × 3 replicas	JetStream 3-node
Staging	2	Primary + 1 replica	Single-node	Single-node
Development	1	Single-node (Docker)	Single-node	Single-node
CI	1	Testcontainers	Testcontainers	Testcontainers

5. Rollout & Rollback

Rolling update with maxSurge: 1, maxUnavailable: 0. PDB minAvailable: 2 prevents drop below 2 pods.
Graceful shutdown on SIGTERM: drain in-flight gRPC (30 s grace), flush outbox relay batch, close PG pool.
Rollback: re-deploy previous image tag. Schema changes follow expand/contract — no column is removed in the same release as its last code use.
Multi-region: deploy to mzr first (canary); observe for 24 h; then kbl.

6. Capacity Planning

Baseline assumptions (Phase 3 production targets):

National peak SMS rate: 3,500 msg/s → 3,500 ValidateLease RPS.
Hot-path cache hit ratio ≥ 95 % → effective PG read rate ≈ 175 QPS.
Per-pod capacity: 1500 ValidateLease RPS @ CPU 65 %.
Baseline 6 pods (headroom); HPA scales to 24 for peak + safety.
PG connection pool: 20 per pod × 24 pods = 480 max; PgBouncer in transaction mode fronts PG with target pool 100.

End of DEPLOYMENT_TOPOLOGY.md

1. Kubernetes Resources​

1.1 API Deployment​

1.2 HPA​

1.3 PodDisruptionBudget​

1.4 CronJobs — Background Workers​

1.5 Services​

1.6 NetworkPolicy​

2. Infrastructure Dependencies​

2.1 Regions​

3. Environment Variables​

4. Deployment Environments​

5. Rollout & Rollback​

6. Capacity Planning​