numbering-service — Deployment Topology
Version: 1.0 Status: Draft Owner: Commerce Engineering + Platform SRE Last Updated: 2026-04-21 Companion: OBSERVABILITY · FAILURE_MODES · ../../docs/architecture/ADR-0004-national-backbone-resilience.md
1. Kubernetes Resources
1.1 API Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: numbering-service
namespace: sms-platform
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector: { matchLabels: { app: numbering-service } }
template:
metadata:
labels:
app: numbering-service
region: kbl # overridden to 'mzr' in mzr overlay
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "3021"
prometheus.io/path: "/metrics"
spec:
serviceAccountName: numbering-service
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector: { matchLabels: { app: numbering-service } }
containers:
- name: numbering-service
image: ghcr.io/ghasi/numbering-service:${GIT_SHA}
ports:
- { containerPort: 50061, name: grpc }
- { containerPort: 3021, name: http }
env:
- { name: NODE_ENV, value: production }
- { name: LOG_LEVEL, value: info }
- { name: REGION_ID, valueFrom: { fieldRef: { fieldPath: metadata.labels['region'] } } }
- { name: GRPC_PORT, value: "50061" }
- { name: HTTP_PORT, value: "3021" }
- { name: DATABASE_URL, valueFrom: { secretKeyRef: { name: numbering-db-secret, key: url } } }
- { name: REDIS_URL, valueFrom: { secretKeyRef: { name: numbering-redis-secret, key: url } } }
- { name: NATS_URL, valueFrom: { secretKeyRef: { name: nats-credentials, key: url } } }
- { name: NATS_CREDS_PATH, value: /etc/nats/creds.jwt }
- { name: S3_REGULATOR_BUCKET, value: ghasi-regulator-exports-kbl } # overridden per region
- { name: GRPC_TLS_ENABLED, value: "true" }
- { name: TLS_CERT_PATH, value: /etc/tls/server.crt }
- { name: TLS_KEY_PATH, value: /etc/tls/server.key }
- { name: TLS_CA_PATH, value: /etc/tls/ca.crt }
- { name: RESERVATION_RESERVE_TTL_SECS, value: "900" }
- { name: RESERVATION_HOLD_TTL_SECS, value: "86400" }
- { name: QUARANTINE_MSISDN_DAYS, value: "90" }
- { name: QUARANTINE_SHORT_CODE_DAYS, value: "30" }
- { name: QUARANTINE_VANITY_DAYS, value: "365" }
- { name: QUARANTINE_ALPHA_DAYS, value: "0" }
envFrom:
- secretRef: { name: numbering-vault-secrets }
resources:
requests: { cpu: 500m, memory: 512Mi }
limits: { cpu: 2000m, memory: 1Gi }
livenessProbe:
httpGet: { path: /health/live, port: http }
initialDelaySeconds: 15
periodSeconds: 10
readinessProbe:
httpGet: { path: /health/ready, port: http }
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
volumeMounts:
- { name: tls-certs, mountPath: /etc/tls, readOnly: true }
- { name: nats-creds, mountPath: /etc/nats, readOnly: true }
volumes:
- name: tls-certs
secret: { secretName: numbering-tls }
- name: nats-creds
secret: { secretName: numbering-nats-creds }
1.2 HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: numbering-service-hpa, namespace: sms-platform }
spec:
scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: numbering-service }
minReplicas: 3
maxReplicas: 24
metrics:
- type: Resource
resource: { name: cpu, target: { type: Utilization, averageUtilization: 65 } }
- type: Resource
resource: { name: memory, target: { type: Utilization, averageUtilization: 75 } }
- type: Pods
pods:
metric: { name: numbering_validate_lease_requests_per_pod }
target: { type: AverageValue, averageValue: "1500" } # target 1500 RPS/pod
1.3 PodDisruptionBudget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata: { name: numbering-service-pdb, namespace: sms-platform }
spec:
minAvailable: 2
selector: { matchLabels: { app: numbering-service } }
1.4 CronJobs — Background Workers
apiVersion: batch/v1
kind: CronJob
metadata: { name: numbering-reservation-cleanup, namespace: sms-platform }
spec:
schedule: "* * * * *" # every minute (safety-net; Redis keyspace is primary)
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
serviceAccountName: numbering-service
restartPolicy: OnFailure
containers:
- name: cleanup
image: ghcr.io/ghasi/numbering-service:${GIT_SHA}
command: ["node", "dist/workers/reservation-cleanup.js"]
envFrom: [ { secretRef: { name: numbering-vault-secrets } } ]
---
apiVersion: batch/v1
kind: CronJob
metadata: { name: numbering-quarantine-sweep, namespace: sms-platform }
spec:
schedule: "*/5 * * * *"
concurrencyPolicy: Forbid
jobTemplate: { spec: { template: { spec: { containers: [ { name: sweep, image: ghcr.io/ghasi/numbering-service:${GIT_SHA}, command: ["node","dist/workers/quarantine-sweep.js"] } ], restartPolicy: OnFailure, serviceAccountName: numbering-service } } } }
---
apiVersion: batch/v1
kind: CronJob
metadata: { name: numbering-lease-renewal, namespace: sms-platform }
spec:
schedule: "0 2 * * *" # daily 02:00 UTC
jobTemplate: { spec: { template: { spec: { containers: [ { name: renew, image: ghcr.io/ghasi/numbering-service:${GIT_SHA}, command: ["node","dist/workers/lease-renewal.js"] } ], restartPolicy: OnFailure, serviceAccountName: numbering-service } } } }
---
apiVersion: batch/v1
kind: CronJob
metadata: { name: numbering-regulator-export, namespace: sms-platform }
spec:
schedule: "0 1 1 * *" # 01:00 UTC on the 1st of each month
jobTemplate: { spec: { template: { spec: { containers: [ { name: export, image: ghcr.io/ghasi/numbering-service:${GIT_SHA}, command: ["node","dist/workers/regulator-export.js"] } ], restartPolicy: OnFailure, serviceAccountName: numbering-service } } } }
---
apiVersion: batch/v1
kind: CronJob
metadata: { name: numbering-reconciliation, namespace: sms-platform }
spec:
schedule: "0 3 * * *" # nightly 03:00 UTC
jobTemplate: { spec: { template: { spec: { containers: [ { name: recon, image: ghcr.io/ghasi/numbering-service:${GIT_SHA}, command: ["node","dist/workers/reconciliation.js"] } ], restartPolicy: OnFailure, serviceAccountName: numbering-service } } } }
---
apiVersion: batch/v1
kind: CronJob
metadata: { name: numbering-partition-maintenance, namespace: sms-platform }
spec:
schedule: "30 3 * * *" # nightly 03:30 UTC
jobTemplate: { spec: { template: { spec: { containers: [ { name: part, image: ghcr.io/ghasi/numbering-service:${GIT_SHA}, command: ["node","dist/workers/partition-maintenance.js"] } ], restartPolicy: OnFailure, serviceAccountName: numbering-service } } } }
---
apiVersion: batch/v1
kind: CronJob
metadata: { name: numbering-audit-chain-verify, namespace: sms-platform }
spec:
schedule: "0 4 * * *" # nightly 04:00 UTC
jobTemplate: { spec: { template: { spec: { containers: [ { name: verify, image: ghcr.io/ghasi/numbering-service:${GIT_SHA}, command: ["node","dist/workers/audit-chain-verify.js"] } ], restartPolicy: OnFailure, serviceAccountName: numbering-service } } } }
All workers acquire Redis distributed locks before work to be multi-replica safe.
1.5 Services
apiVersion: v1
kind: Service
metadata: { name: numbering-service-grpc, namespace: sms-platform }
spec:
selector: { app: numbering-service }
ports: [ { name: grpc, port: 50061, targetPort: grpc } ]
type: ClusterIP
---
apiVersion: v1
kind: Service
metadata: { name: numbering-service-http, namespace: sms-platform }
spec:
selector: { app: numbering-service }
ports: [ { name: http, port: 3021, targetPort: http } ]
type: ClusterIP
1.6 NetworkPolicy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: numbering-service-netpol, namespace: sms-platform }
spec:
podSelector: { matchLabels: { app: numbering-service } }
policyTypes: [ Ingress, Egress ]
ingress:
# gRPC callers (mTLS-authenticated)
- from:
- podSelector: { matchLabels: { app: sms-orchestrator } }
- podSelector: { matchLabels: { app: routing-engine } }
- podSelector: { matchLabels: { app: number-intelligence-service } }
- podSelector: { matchLabels: { app: sender-id-registry-service } }
- podSelector: { matchLabels: { app: compliance-engine } }
- podSelector: { matchLabels: { app: billing-service } }
- podSelector: { matchLabels: { app: customer-portal-bff } }
- podSelector: { matchLabels: { app: admin-dashboard-bff } }
ports: [ { port: 50061 } ]
# REST via Kong
- from:
- podSelector: { matchLabels: { app: kong } }
ports: [ { port: 3021 } ]
# Prometheus scrape
- from:
- namespaceSelector: { matchLabels: { name: monitoring } }
ports: [ { port: 3021 } ]
egress:
- to: [ { podSelector: { matchLabels: { app: postgresql } } } ]
ports: [ { port: 5432 } ]
- to: [ { podSelector: { matchLabels: { app: redis } } } ]
ports: [ { port: 6379 } ]
- to: [ { podSelector: { matchLabels: { app: nats } } } ]
ports: [ { port: 4222 } ]
# Vault, S3 (regulator exports) — restricted
- to:
- namespaceSelector: { matchLabels: { name: vault } }
ports: [ { port: 8200 } ]
- to:
- ipBlock:
cidr: 0.0.0.0/0
except: [ 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16 ]
ports: [ { port: 443 } ] # S3 HTTPS
2. Infrastructure Dependencies
| Dependency | Version | Topology |
|---|---|---|
| PostgreSQL | 15+ | Primary per region + read replica; cross-region logical replication per ADR-0004 §14 |
| Redis | 7.0+ | Cluster mode; logical DB 7; keyspace notifications enabled (notify-keyspace-events Ex) |
| NATS JetStream | 2.10+ | 3-node cluster per region with mesh; streams replicated cross-region |
| S3 (or S3-compatible) | — | ghasi-regulator-exports-{region} bucket with object-lock WORM 7 y |
| Vault | 1.15+ | PKI engine for mTLS; Transit for regulator-export signing; DB engine for dynamic PG creds |
| Kong | 3.4+ | REST ingress; JWT + rate-limiting-advanced plugins |
2.1 Regions
Per ADR-0004 §14:
| Region | Purpose | Primary for |
|---|---|---|
kbl (Kabul) | Primary control plane | All control-plane writes; monthly regulator export |
mzr (Mazar-i-Sharif) | Secondary control plane, DR | Read replicas; takes over on kbl failover |
Writes use synchronous cross-region quorum on numbers and leases. Reservations are region-local.
3. Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
NODE_ENV | Yes | — | production / staging / development |
REGION_ID | Yes | — | kbl / mzr |
GRPC_PORT | No | 50061 | gRPC listener |
HTTP_PORT | No | 3021 | REST + health + metrics |
DATABASE_URL | Yes | — | PostgreSQL connection string (dynamic from Vault) |
REDIS_URL | Yes | — | Redis connection string (includes DB 7) |
NATS_URL | Yes | — | NATS server URL (regional cluster) |
NATS_CREDS_PATH | Yes | — | Path to NATS credentials JWT |
S3_REGULATOR_BUCKET | Yes | — | Bucket name for regulator exports |
GRPC_TLS_ENABLED | No | true | Set false only in dev |
TLS_CERT_PATH, TLS_KEY_PATH, TLS_CA_PATH | If TLS | — | mTLS material |
LOG_LEVEL | No | info | debug / info / warn / error |
RESERVATION_RESERVE_TTL_SECS | No | 900 | 15 min default |
RESERVATION_HOLD_TTL_SECS | No | 86400 | 24 h default |
QUARANTINE_MSISDN_DAYS | No | 90 | Cool-off for MSISDN |
QUARANTINE_SHORT_CODE_DAYS | No | 30 | Cool-off for short code |
QUARANTINE_VANITY_DAYS | No | 365 | Cool-off for vanity |
QUARANTINE_ALPHA_DAYS | No | 0 | Cool-off for alpha |
OUTBOX_POLL_INTERVAL_MS | No | 500 | Outbox relay tick |
RECONCILIATION_WINDOW_HOURS | No | 24 | Nightly reconciliation look-back |
4. Deployment Environments
| Environment | Replicas | PG | Redis | NATS |
|---|---|---|---|---|
| Production (kbl) | 6–24 (HPA) | Primary + 2 replicas | Cluster 3 shards × 3 replicas | JetStream 3-node |
| Production (mzr) | 3–12 (HPA) | Secondary + 1 replica | Cluster 3 shards × 3 replicas | JetStream 3-node |
| Staging | 2 | Primary + 1 replica | Single-node | Single-node |
| Development | 1 | Single-node (Docker) | Single-node | Single-node |
| CI | 1 | Testcontainers | Testcontainers | Testcontainers |
5. Rollout & Rollback
- Rolling update with
maxSurge: 1,maxUnavailable: 0. PDBminAvailable: 2prevents drop below 2 pods. - Graceful shutdown on SIGTERM: drain in-flight gRPC (30 s grace), flush outbox relay batch, close PG pool.
- Rollback: re-deploy previous image tag. Schema changes follow expand/contract — no column is removed in the same release as its last code use.
- Multi-region: deploy to
mzrfirst (canary); observe for 24 h; thenkbl.
6. Capacity Planning
Baseline assumptions (Phase 3 production targets):
- National peak SMS rate: 3,500 msg/s → 3,500
ValidateLeaseRPS. - Hot-path cache hit ratio ≥ 95 % → effective PG read rate ≈ 175 QPS.
- Per-pod capacity: 1500
ValidateLeaseRPS @ CPU 65 %. - Baseline 6 pods (headroom); HPA scales to 24 for peak + safety.
- PG connection pool: 20 per pod × 24 pods = 480 max; PgBouncer in transaction mode fronts PG with target pool 100.
End of DEPLOYMENT_TOPOLOGY.md