Channel Router Service — Deployment Topology

Version: 1.0 Status: Draft Owner: Messaging Core + SRE Last Updated: 2026-04-21 Companion: SERVICE_OVERVIEW · SECURITY_MODEL · LOCAL_DEV_SETUP Related ADR: ADR-0004 §5–§6 (multi-region), §11–§12 (sovereignty, mesh identity)

1. Runtime

Aspect	Value
Language	TypeScript
Runtime	Node.js 22 LTS
Framework	NestJS 10 (gRPC + HTTP)
gRPC server	`@grpc/grpc-js` data plane `:50071`, control plane `:50072`
HTTP server	Fastify on `:3071`
Metrics	Prometheus on `:9061` (channel-router); `:9062` (chan-mo-router)
Container base	`gcr.io/distroless/nodejs22-debian12:nonroot`
Image registry	`registry.ghasi.af/platform/channel-router-service`
OS user	`nonroot` (UID 65532)
Health endpoints	`/health/live`, `/health/ready`

2. Topology — Two Deployments + N OTT-adapter Deployments

Channel-router runs as two distinct workloads for blast-radius isolation:

channel-router — handles RouteWithFallback, REST admin/tenant surface, OTT webhook ingress.
chan-mo-router — handles inbound MO routing (NATS consumer + tenant-webhook delegation to webhook-dispatcher).

Plus separate Deployments per OTT adapter so provider-specific issues do not drain shared pools:

chan-adapter-whatsapp — pinned to nodes with WhatsApp-allow-listed egress IP.
chan-adapter-telegram.
chan-adapter-viber.
chan-adapter-voice — gRPC client to Voice OTP gateway.
chan-adapter-email — SMTP egress from dedicated mail IP pool.

3. Kubernetes Deployment — `channel-router` (decision core)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: channel-router-service
  namespace: np-data
  labels:
    app: channel-router-service
    tier: data-plane
    sovereignty: national
spec:
  replicas: 8
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 2
  selector:
    matchLabels: { app: channel-router-service }
  template:
    metadata:
      labels:
        app: channel-router-service
        tier: data-plane
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9061"
        prometheus.io/path: "/metrics"
        spire.io/managed: "true"
    spec:
      serviceAccountName: channel-router-service
      automountServiceAccountToken: false
      securityContext:
        runAsNonRoot: true
        runAsUser: 65532
        seccompProfile: { type: RuntimeDefault }
        fsGroup: 65532
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - { key: app, operator: In, values: [channel-router-service] }
              topologyKey: topology.kubernetes.io/zone
      nodeSelector:
        node-pool: np-data
        sovereignty: af-only
      priorityClassName: data-plane-critical
      tolerations:
        - key: node-pool
          operator: Equal
          value: np-data
          effect: NoSchedule
      containers:
        - name: channel-router
          image: registry.ghasi.af/platform/channel-router-service:1.0.0
          imagePullPolicy: IfNotPresent
          ports:
            - { containerPort: 50071, name: grpc-data }
            - { containerPort: 50072, name: grpc-ctrl }
            - { containerPort: 3071,  name: http }
            - { containerPort: 9061,  name: metrics }
          env:
            - { name: NODE_ENV, value: production }
            - { name: LOG_LEVEL, value: info }
            - { name: GRPC_DATA_PORT, value: "50071" }
            - { name: GRPC_CTRL_PORT, value: "50072" }
            - { name: HTTP_PORT, value: "3071" }
            - { name: METRICS_PORT, value: "9061" }
            - { name: REGION, valueFrom: { fieldRef: { fieldPath: metadata.labels['topology.kubernetes.io/region'] } } }
            - { name: DATABASE_URL, valueFrom: { secretKeyRef: { name: chan-db, key: url } } }
            - { name: REDIS_URL, valueFrom: { secretKeyRef: { name: chan-redis, key: url } } }
            - { name: NATS_URL, valueFrom: { secretKeyRef: { name: chan-nats, key: url } } }
            - { name: VAULT_ADDR, value: https://vault.np-ctrl.svc.cluster.local:8200 }
            - { name: SPIFFE_ENDPOINT_SOCKET, value: unix:///run/spire/agent-sockets/spire-agent.sock }
            - { name: CONSENT_LEDGER_URL, value: consent-ledger-service.np-data.svc.cluster.local:50051 }
            - { name: COMPLIANCE_ENGINE_URL, value: compliance-engine.np-data.svc.cluster.local:50052 }
            - { name: SENDER_ID_REGISTRY_URL, value: sender-id-registry-service.np-data.svc.cluster.local:50081 }
            - { name: WEBHOOK_DISPATCHER_URL, value: webhook-dispatcher.np-data.svc.cluster.local:50091 }
            - { name: TRITON_URL, value: triton.np-ml.svc.cluster.local:8001 }
            - { name: ROUTE_DECISION_BUDGET_MS, value: "50" }
            - { name: GATE_DEADLINE_MS, value: "15" }
            - { name: MAX_INFLIGHT_GRPC, value: "1000" }
            - { name: MAX_INFLIGHT_CONSUMER, value: "200" }
            - { name: CHAN_EXTERNAL_LLM_ENABLED, value: "false" }     # Sovereignty guard — pod refuses to start if true
            - { name: CHAN_ML_PREFERENCE_ORDERING_ENABLED, value: "true" }
          resources:
            requests: { cpu: 1000m, memory: 1Gi, ephemeral-storage: 1Gi }
            limits:   { cpu: 4000m, memory: 4Gi, ephemeral-storage: 2Gi }
          livenessProbe:
            httpGet: { path: /health/live, port: http }
            initialDelaySeconds: 15
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3
          readinessProbe:
            httpGet: { path: /health/ready, port: http }
            initialDelaySeconds: 10
            periodSeconds: 5
            timeoutSeconds: 2
            failureThreshold: 2
          startupProbe:
            httpGet: { path: /health/ready, port: http }
            failureThreshold: 30
          volumeMounts:
            - { name: spire-agent-socket, mountPath: /run/spire/agent-sockets, readOnly: true }
            - { name: tmp, mountPath: /tmp }
          lifecycle:
            preStop:
              exec:
                command: ["/usr/bin/node", "/app/dist/scripts/graceful-shutdown.js", "--drain-seconds=15"]
      volumes:
        - name: spire-agent-socket
          hostPath: { path: /run/spire/agent-sockets, type: Directory }
        - name: tmp
          emptyDir: { medium: Memory, sizeLimit: 256Mi }
      terminationGracePeriodSeconds: 30

Per-region replicas: kbl: 8 (HPA 8..24), mzr: 6 (HPA 6..20).

4. Kubernetes Deployment — `chan-mo-router`

apiVersion: apps/v1
kind: Deployment
metadata:
  name: chan-mo-router
  namespace: np-data
spec:
  replicas: 4
  template:
    spec:
      containers:
        - name: chan-mo-router
          image: registry.ghasi.af/platform/channel-router-service:1.0.0
          args: ["--mode=mo-router"]
          env:
            - { name: METRICS_PORT, value: "9062" }
            - { name: NATS_CONSUMER_GROUP, value: "chan-mo-router" }
            - { name: MAX_INFLIGHT_CONSUMER, value: "100" }
          resources:
            requests: { cpu: 500m, memory: 512Mi }
            limits:   { cpu: 2000m, memory: 2Gi }

Per-region: kbl: 4, mzr: 4. HPA on consumer lag.

5. OTT-adapter Deployments

apiVersion: apps/v1
kind: Deployment
metadata:
  name: chan-adapter-whatsapp
  namespace: np-data
spec:
  replicas: 4
  template:
    spec:
      nodeSelector:
        node-pool: np-data
        egress-ip-pool: whatsapp-allowlisted   # Only nodes with the Meta-allow-listed source IPs
      containers:
        - name: chan-adapter-whatsapp
          image: registry.ghasi.af/platform/channel-router-service:1.0.0
          args: ["--mode=adapter-whatsapp"]
          env:
            - { name: METRICS_PORT, value: "9063" }
            - { name: WHATSAPP_API_BASE, value: "https://graph.facebook.com/v20.0" }
            - { name: TPS_LIMIT_DEFAULT, value: "80" }

chan-adapter-telegram, chan-adapter-viber, chan-adapter-voice, chan-adapter-email follow the same template with provider-specific config and (for some) dedicated egress IP pools.

6. HorizontalPodAutoscaler (KEDA + Prometheus adapter)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: channel-router-service
  namespace: np-data
spec:
  scaleTargetRef: { name: channel-router-service }
  minReplicaCount: 8
  maxReplicaCount: 24
  pollingInterval: 15
  cooldownPeriod: 120
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.np-obs.svc:9090
        metricName: chan_route_p95
        threshold: "0.040"
        query: |
          histogram_quantile(0.95,
            sum(rate(chan_request_duration_seconds_bucket{rpc="RouteWithFallback"}[3m])) by (le)
          )
    - type: prometheus
      metadata:
        metricName: chan_consumer_lag
        threshold: "500"
        query: |
          max(nats_consumer_pending{stream="NOTIFICATION_DISPATCH",consumer="chan-router"})

chan-mo-router HPA on chan_mo_inbound_total rate + NATS consumer lag.

7. NetworkPolicy

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: channel-router-allow
  namespace: np-data
spec:
  podSelector:
    matchLabels: { app: channel-router-service }
  policyTypes: [Ingress, Egress]
  ingress:
    - from:
        - namespaceSelector: { matchLabels: { name: np-data } }
          podSelector: { matchLabels: { app: sms-orchestrator } }
        - namespaceSelector: { matchLabels: { name: np-ctrl } }
          podSelector: { matchLabels: { app: admin-dashboard } }
        - namespaceSelector: { matchLabels: { name: np-edge } }
          podSelector: { matchLabels: { app: kong } }
        - namespaceSelector: { matchLabels: { name: np-obs } }
          podSelector: { matchLabels: { app: prometheus } }
      ports:
        - { protocol: TCP, port: 50071 }
        - { protocol: TCP, port: 50072 }
        - { protocol: TCP, port: 3071 }
        - { protocol: TCP, port: 9061 }
  egress:
    - to:
        - podSelector: { matchLabels: { app: postgres-chan } }
        - podSelector: { matchLabels: { app: redis-chan } }
        - podSelector: { matchLabels: { app: nats-jetstream } }
        - namespaceSelector: { matchLabels: { name: np-ctrl } }
          podSelector: { matchLabels: { app: vault } }
        - namespaceSelector: { matchLabels: { name: np-data } }
          podSelector: { matchLabels: { app: consent-ledger-service } }
        - namespaceSelector: { matchLabels: { name: np-data } }
          podSelector: { matchLabels: { app: compliance-engine } }
        - namespaceSelector: { matchLabels: { name: np-data } }
          podSelector: { matchLabels: { app: sender-id-registry-service } }
        - namespaceSelector: { matchLabels: { name: np-data } }
          podSelector: { matchLabels: { app: webhook-dispatcher } }
        - namespaceSelector: { matchLabels: { name: np-ml } }
          podSelector: { matchLabels: { app: triton } }

OTT-adapter Deployments have distinct egress NetworkPolicies allowing only the relevant provider FQDNs (resolved via DNS-aware policy or via egress-proxy with FQDN allow-list).

8. Region affinity

Component	Region pinning
`channel-router` (decision)	Region-local; both regions active-active
`chan-mo-router`	Region-local; cross-region MO forwarding via internal NATS subject
`chan-adapter-*`	Region-local for routing; OTT adapter pods may be region-pinned by egress IP allow-list
`postgres-chan` (Patroni)	Per-region cluster (1 primary + 2 sync standbys); cross-region logical replication for control plane
`redis-chan` Sentinel	Per-region (6-node)
Conversations	Region-pinned — pinned to the region that opened them
Profiles	Multi-master with LWW

9. Secrets

Secret	Source
`chan-db` (DB URL + dynamic creds)	Vault (1 h dynamic)
`chan-redis`	Vault (static)
`chan-nats`	SPIRE-issued NATS NKEY
`chan-hsm-pkcs11`	HSM-managed (audit-chain signing)
OTT credentials per-tenant per-provider	Vault `secrets/data/chan/ott/{tenantId}/{provider}`
Tenant webhook HMAC secrets	Vault `secrets/data/chan/webhook/{tenantId}/{inbound}`
Meta app-secret (webhook signature)	Vault `secrets/data/chan/meta/app_secret`

10. Disaster recovery

RPO ≤ 60 s for control-plane data (cross-region logical replication).
RPO ≤ 5 s for audit/outcome streams (JetStream mirror).
RTO ≤ 15 min region-failover (manual, drilled quarterly).
Postgres backups: PITR via WAL-G; nightly base backup; 30 d retention; encrypted with per-environment KMS.
DR drill quarterly: kill kbl region; verify mzr continues serving; verify outcome events not duplicated.

11. Service mesh

Linkerd (or Istio per ADR-0004 §12) sidecars on every pod.
mTLS enforced; SPIRE SVID rotation 1 h.
Outbound proxy enforces FQDN allow-list per Deployment (graph.facebook.com, api.telegram.org, chatapi.viber.com, etc.).
Distributed tracing via OTel collector → Tempo.

1. Runtime​

2. Topology — Two Deployments + N OTT-adapter Deployments​

3. Kubernetes Deployment — channel-router (decision core)​

4. Kubernetes Deployment — chan-mo-router​

5. OTT-adapter Deployments​

6. HorizontalPodAutoscaler (KEDA + Prometheus adapter)​

7. NetworkPolicy​

8. Region affinity​

9. Secrets​

10. Disaster recovery​

11. Service mesh​