Skip to main content

Channel Router Service — Deployment Topology

Version: 1.0 Status: Draft Owner: Messaging Core + SRE Last Updated: 2026-04-21 Companion: SERVICE_OVERVIEW · SECURITY_MODEL · LOCAL_DEV_SETUP Related ADR: ADR-0004 §5–§6 (multi-region), §11–§12 (sovereignty, mesh identity)


1. Runtime

AspectValue
LanguageTypeScript
RuntimeNode.js 22 LTS
FrameworkNestJS 10 (gRPC + HTTP)
gRPC server@grpc/grpc-js data plane :50071, control plane :50072
HTTP serverFastify on :3071
MetricsPrometheus on :9061 (channel-router); :9062 (chan-mo-router)
Container basegcr.io/distroless/nodejs22-debian12:nonroot
Image registryregistry.ghasi.af/platform/channel-router-service
OS usernonroot (UID 65532)
Health endpoints/health/live, /health/ready

2. Topology — Two Deployments + N OTT-adapter Deployments

Channel-router runs as two distinct workloads for blast-radius isolation:

  1. channel-router — handles RouteWithFallback, REST admin/tenant surface, OTT webhook ingress.
  2. chan-mo-router — handles inbound MO routing (NATS consumer + tenant-webhook delegation to webhook-dispatcher).

Plus separate Deployments per OTT adapter so provider-specific issues do not drain shared pools:

  1. chan-adapter-whatsapp — pinned to nodes with WhatsApp-allow-listed egress IP.
  2. chan-adapter-telegram.
  3. chan-adapter-viber.
  4. chan-adapter-voice — gRPC client to Voice OTP gateway.
  5. chan-adapter-email — SMTP egress from dedicated mail IP pool.

3. Kubernetes Deployment — channel-router (decision core)

apiVersion: apps/v1
kind: Deployment
metadata:
name: channel-router-service
namespace: np-data
labels:
app: channel-router-service
tier: data-plane
sovereignty: national
spec:
replicas: 8
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 2
selector:
matchLabels: { app: channel-router-service }
template:
metadata:
labels:
app: channel-router-service
tier: data-plane
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9061"
prometheus.io/path: "/metrics"
spire.io/managed: "true"
spec:
serviceAccountName: channel-router-service
automountServiceAccountToken: false
securityContext:
runAsNonRoot: true
runAsUser: 65532
seccompProfile: { type: RuntimeDefault }
fsGroup: 65532
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- { key: app, operator: In, values: [channel-router-service] }
topologyKey: topology.kubernetes.io/zone
nodeSelector:
node-pool: np-data
sovereignty: af-only
priorityClassName: data-plane-critical
tolerations:
- key: node-pool
operator: Equal
value: np-data
effect: NoSchedule
containers:
- name: channel-router
image: registry.ghasi.af/platform/channel-router-service:1.0.0
imagePullPolicy: IfNotPresent
ports:
- { containerPort: 50071, name: grpc-data }
- { containerPort: 50072, name: grpc-ctrl }
- { containerPort: 3071, name: http }
- { containerPort: 9061, name: metrics }
env:
- { name: NODE_ENV, value: production }
- { name: LOG_LEVEL, value: info }
- { name: GRPC_DATA_PORT, value: "50071" }
- { name: GRPC_CTRL_PORT, value: "50072" }
- { name: HTTP_PORT, value: "3071" }
- { name: METRICS_PORT, value: "9061" }
- { name: REGION, valueFrom: { fieldRef: { fieldPath: metadata.labels['topology.kubernetes.io/region'] } } }
- { name: DATABASE_URL, valueFrom: { secretKeyRef: { name: chan-db, key: url } } }
- { name: REDIS_URL, valueFrom: { secretKeyRef: { name: chan-redis, key: url } } }
- { name: NATS_URL, valueFrom: { secretKeyRef: { name: chan-nats, key: url } } }
- { name: VAULT_ADDR, value: https://vault.np-ctrl.svc.cluster.local:8200 }
- { name: SPIFFE_ENDPOINT_SOCKET, value: unix:///run/spire/agent-sockets/spire-agent.sock }
- { name: CONSENT_LEDGER_URL, value: consent-ledger-service.np-data.svc.cluster.local:50051 }
- { name: COMPLIANCE_ENGINE_URL, value: compliance-engine.np-data.svc.cluster.local:50052 }
- { name: SENDER_ID_REGISTRY_URL, value: sender-id-registry-service.np-data.svc.cluster.local:50081 }
- { name: WEBHOOK_DISPATCHER_URL, value: webhook-dispatcher.np-data.svc.cluster.local:50091 }
- { name: TRITON_URL, value: triton.np-ml.svc.cluster.local:8001 }
- { name: ROUTE_DECISION_BUDGET_MS, value: "50" }
- { name: GATE_DEADLINE_MS, value: "15" }
- { name: MAX_INFLIGHT_GRPC, value: "1000" }
- { name: MAX_INFLIGHT_CONSUMER, value: "200" }
- { name: CHAN_EXTERNAL_LLM_ENABLED, value: "false" } # Sovereignty guard — pod refuses to start if true
- { name: CHAN_ML_PREFERENCE_ORDERING_ENABLED, value: "true" }
resources:
requests: { cpu: 1000m, memory: 1Gi, ephemeral-storage: 1Gi }
limits: { cpu: 4000m, memory: 4Gi, ephemeral-storage: 2Gi }
livenessProbe:
httpGet: { path: /health/live, port: http }
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
readinessProbe:
httpGet: { path: /health/ready, port: http }
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 2
startupProbe:
httpGet: { path: /health/ready, port: http }
failureThreshold: 30
volumeMounts:
- { name: spire-agent-socket, mountPath: /run/spire/agent-sockets, readOnly: true }
- { name: tmp, mountPath: /tmp }
lifecycle:
preStop:
exec:
command: ["/usr/bin/node", "/app/dist/scripts/graceful-shutdown.js", "--drain-seconds=15"]
volumes:
- name: spire-agent-socket
hostPath: { path: /run/spire/agent-sockets, type: Directory }
- name: tmp
emptyDir: { medium: Memory, sizeLimit: 256Mi }
terminationGracePeriodSeconds: 30

Per-region replicas: kbl: 8 (HPA 8..24), mzr: 6 (HPA 6..20).


4. Kubernetes Deployment — chan-mo-router

apiVersion: apps/v1
kind: Deployment
metadata:
name: chan-mo-router
namespace: np-data
spec:
replicas: 4
template:
spec:
containers:
- name: chan-mo-router
image: registry.ghasi.af/platform/channel-router-service:1.0.0
args: ["--mode=mo-router"]
env:
- { name: METRICS_PORT, value: "9062" }
- { name: NATS_CONSUMER_GROUP, value: "chan-mo-router" }
- { name: MAX_INFLIGHT_CONSUMER, value: "100" }
resources:
requests: { cpu: 500m, memory: 512Mi }
limits: { cpu: 2000m, memory: 2Gi }

Per-region: kbl: 4, mzr: 4. HPA on consumer lag.


5. OTT-adapter Deployments

apiVersion: apps/v1
kind: Deployment
metadata:
name: chan-adapter-whatsapp
namespace: np-data
spec:
replicas: 4
template:
spec:
nodeSelector:
node-pool: np-data
egress-ip-pool: whatsapp-allowlisted # Only nodes with the Meta-allow-listed source IPs
containers:
- name: chan-adapter-whatsapp
image: registry.ghasi.af/platform/channel-router-service:1.0.0
args: ["--mode=adapter-whatsapp"]
env:
- { name: METRICS_PORT, value: "9063" }
- { name: WHATSAPP_API_BASE, value: "https://graph.facebook.com/v20.0" }
- { name: TPS_LIMIT_DEFAULT, value: "80" }

chan-adapter-telegram, chan-adapter-viber, chan-adapter-voice, chan-adapter-email follow the same template with provider-specific config and (for some) dedicated egress IP pools.


6. HorizontalPodAutoscaler (KEDA + Prometheus adapter)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: channel-router-service
namespace: np-data
spec:
scaleTargetRef: { name: channel-router-service }
minReplicaCount: 8
maxReplicaCount: 24
pollingInterval: 15
cooldownPeriod: 120
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.np-obs.svc:9090
metricName: chan_route_p95
threshold: "0.040"
query: |
histogram_quantile(0.95,
sum(rate(chan_request_duration_seconds_bucket{rpc="RouteWithFallback"}[3m])) by (le)
)
- type: prometheus
metadata:
metricName: chan_consumer_lag
threshold: "500"
query: |
max(nats_consumer_pending{stream="NOTIFICATION_DISPATCH",consumer="chan-router"})

chan-mo-router HPA on chan_mo_inbound_total rate + NATS consumer lag.


7. NetworkPolicy

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: channel-router-allow
namespace: np-data
spec:
podSelector:
matchLabels: { app: channel-router-service }
policyTypes: [Ingress, Egress]
ingress:
- from:
- namespaceSelector: { matchLabels: { name: np-data } }
podSelector: { matchLabels: { app: sms-orchestrator } }
- namespaceSelector: { matchLabels: { name: np-ctrl } }
podSelector: { matchLabels: { app: admin-dashboard } }
- namespaceSelector: { matchLabels: { name: np-edge } }
podSelector: { matchLabels: { app: kong } }
- namespaceSelector: { matchLabels: { name: np-obs } }
podSelector: { matchLabels: { app: prometheus } }
ports:
- { protocol: TCP, port: 50071 }
- { protocol: TCP, port: 50072 }
- { protocol: TCP, port: 3071 }
- { protocol: TCP, port: 9061 }
egress:
- to:
- podSelector: { matchLabels: { app: postgres-chan } }
- podSelector: { matchLabels: { app: redis-chan } }
- podSelector: { matchLabels: { app: nats-jetstream } }
- namespaceSelector: { matchLabels: { name: np-ctrl } }
podSelector: { matchLabels: { app: vault } }
- namespaceSelector: { matchLabels: { name: np-data } }
podSelector: { matchLabels: { app: consent-ledger-service } }
- namespaceSelector: { matchLabels: { name: np-data } }
podSelector: { matchLabels: { app: compliance-engine } }
- namespaceSelector: { matchLabels: { name: np-data } }
podSelector: { matchLabels: { app: sender-id-registry-service } }
- namespaceSelector: { matchLabels: { name: np-data } }
podSelector: { matchLabels: { app: webhook-dispatcher } }
- namespaceSelector: { matchLabels: { name: np-ml } }
podSelector: { matchLabels: { app: triton } }

OTT-adapter Deployments have distinct egress NetworkPolicies allowing only the relevant provider FQDNs (resolved via DNS-aware policy or via egress-proxy with FQDN allow-list).


8. Region affinity

ComponentRegion pinning
channel-router (decision)Region-local; both regions active-active
chan-mo-routerRegion-local; cross-region MO forwarding via internal NATS subject
chan-adapter-*Region-local for routing; OTT adapter pods may be region-pinned by egress IP allow-list
postgres-chan (Patroni)Per-region cluster (1 primary + 2 sync standbys); cross-region logical replication for control plane
redis-chan SentinelPer-region (6-node)
ConversationsRegion-pinned — pinned to the region that opened them
ProfilesMulti-master with LWW

9. Secrets

SecretSource
chan-db (DB URL + dynamic creds)Vault (1 h dynamic)
chan-redisVault (static)
chan-natsSPIRE-issued NATS NKEY
chan-hsm-pkcs11HSM-managed (audit-chain signing)
OTT credentials per-tenant per-providerVault secrets/data/chan/ott/{tenantId}/{provider}
Tenant webhook HMAC secretsVault secrets/data/chan/webhook/{tenantId}/{inbound}
Meta app-secret (webhook signature)Vault secrets/data/chan/meta/app_secret

10. Disaster recovery

  • RPO ≤ 60 s for control-plane data (cross-region logical replication).
  • RPO ≤ 5 s for audit/outcome streams (JetStream mirror).
  • RTO ≤ 15 min region-failover (manual, drilled quarterly).
  • Postgres backups: PITR via WAL-G; nightly base backup; 30 d retention; encrypted with per-environment KMS.
  • DR drill quarterly: kill kbl region; verify mzr continues serving; verify outcome events not duplicated.

11. Service mesh

  • Linkerd (or Istio per ADR-0004 §12) sidecars on every pod.
  • mTLS enforced; SPIRE SVID rotation 1 h.
  • Outbound proxy enforces FQDN allow-list per Deployment (graph.facebook.com, api.telegram.org, chatapi.viber.com, etc.).
  • Distributed tracing via OTel collector → Tempo.