Skip to main content

Channel Router Service — Service Readiness

Version: 1.0 Status: Draft Owner: Messaging Core Last Updated: 2026-04-21 Companion: SERVICE_OVERVIEW · TESTING_STRATEGY · SERVICE_RISK_REGISTER Related: ADR-0004, docs/standards/SERVICE_TEMPLATE.md

This document tracks the readiness criteria for taking channel-router-service from development to production. Channel-router is the platform's omnichannel decisioner — it is on the hot path for every omnichannel send, and on the synchronous path for every inbound MO. Readiness bar is elevated: sub-50 ms RouteWithFallback, fail-closed on consent/compliance, audit-chain integrity, and OTT provider health under per-provider rate-limits.


1. Code readiness

CriterionStatus
gRPC ChannelRouterService.v1 (RouteWithFallback, DeliverNow, GetConversationSession, GetRecipientProfile)
gRPC ChannelControlService.v1 (policies, OTT accounts, circuit breakers, inbound routes)
REST /v1/channel/* admin + tenant surfaces
OTT webhook ingress (/v1/webhooks/whatsapp, /telegram/{secretPath}, /viber) with HMAC verification
chan-mo-router worker — mo.allowed.v1 consumer + tenant-webhook fan-out via webhook-dispatcher
Per-adapter implementations (WhatsApp Cloud, Telegram Bot, Viber, Voice OTP, SMTP) behind ChannelAdapter port
Per-adapter circuit-breakers + token-buckets
Deadline scanner (Redis ZSET + 1 s tick) with distributed lock
Outbox relay (chan.outbox, chan.delivery_outbox) with at-least-once + uniqueness
Conversation session manager (Redis HASH + PG mirror)
STOP-keyword detector (en/fa/ps/ar) + tenant overrides
Recipient-profile updater (LWW, async consumer on delivery feedback)
Fail-closed gating on consent / compliance / sender-id (deadlines + cache)
Audit hash-chain implementation (prev_hash, `record_hash = sha256(payload
Audit chain daily verifier job
Idempotency-Key support on REST writes; Nats-Msg-Id on outbound events
mTLS gRPC (SPIRE SVID)
Sovereignty guard (CHAN_EXTERNAL_LLM_ENABLED=false) — pod refuses to boot otherwise

2. Testing readiness

CriterionTargetStatus
Unit coverage≥ 85% line / ≥ 80% branch (domain ≥ 90%)
Unit tests for fallback policy evaluator≥ 30 tests
Unit tests for cost calculator + cost-cap≥ 15 tests
Unit tests for state-machine transitions per attempt≥ 25 tests
Unit tests for STOP-keyword matcher per language≥ 50 per language (en/fa/ps/ar)
Unit tests for MSISDN normalisation + per-tenant hashing≥ 30
Unit tests for provider-status canonicalisation≥ 50 (covers WhatsApp, Telegram, Viber, Voice)
Integration: RouteWithFallback P95 ≤ 50 ms @ 5000 RPSPassed
Integration: full SMS→WhatsApp→Voice cascadePassed
Integration: MO routing roundtrip via mock webhookPassed
Integration: STOP keyword closes session + records opt-outPassed
Integration: profile learning from delivery feedback (LWW)Passed
Integration: consent.revoked.v1 invalidates cache within 1 sPassed
Contract: sms-orchestrator → channel-router gRPC (Pact)Passed
Contract: channel-router → consent-ledger gRPCPassed
Contract: channel-router → compliance-engine gRPCPassed
Contract: channel-router → webhook-dispatcher gRPCPassed
Contract: WhatsApp / Telegram / Viber webhook fixturesPassed
E2E: bank OTP fallback (SMS → Voice)Passed
E2E: government PARALLEL alert across OTTPassed
E2E: conversational MT/MO multi-turnPassed
Load: 5000 RPS sustained 1 h, P99 ≤ 120 msPassed
Chaos: PG primary kill, fail-over within 30 sPassed
Chaos: consent-ledger out → fail-closed verifiedPassed
Chaos: WhatsApp adapter 503 storm → breaker opens, ladder skipsPassed
Chaos: NATS lag → outbox grows; on heal, drains within 5 minPassed
Security: RLS cross-tenant blockedPassed
Security: audit log UPDATE/DELETE rejected at PG rulePassed
Security: webhook signature spoof → 401Passed
Security: cost-cap evasion attempts rejectedPassed
Security: penetration test on REST + OTT webhook surfacePassed

3. Observability readiness

CriterionStatus
All Prometheus metrics emitting (see OBSERVABILITY §2)
Grafana dashboard channel-router.json deployed
All alerts configured in Alertmanager with runbooks
Structured JSON logs (Pino) with MSISDN/body redaction
OTel trace propagation from sms-orchestrator → channel-router → consent-ledger / compliance / OTT adapter verified
Loki parsing rules validated for service logs
SIEM forwarding of channel.audit.v1 via regulator-portal-service verified

Alerts configured

  • ChannelRouteLatencyHigh (P95 > 50 ms for 5 min)
  • ChannelFallbackRateHigh (> 25% for 15 min)
  • ChannelAdapterUnavailable (per provider)
  • ChannelCostCapBreach
  • ChannelConsentViolationAttempt (Critical)
  • ChannelComplianceViolationAttempt
  • ChannelMoLatencyHigh (P95 > 1 s)
  • ChannelMoRoutingFailed
  • ChannelWebhookSignatureInvalidSpike
  • ChannelAuditChainBreak (Critical)
  • ChannelJetStreamMirrorLag
  • ChannelOutcomePublishLagHigh
  • ChannelMlFeatureDrift (medium)

4. Security readiness

CriterionStatus
mTLS enforced on gRPC ports (SPIRE SVID)
NetworkPolicy restricting ingress to allow-listed callers (sms-orchestrator, admin, kong, prometheus)
Kong JWT validation on REST
RLS policies on recipient_profiles, conversations, tenant_inbound_routes verified
MSISDN per-tenant salted hashing implemented; salt rotation procedure documented
OTT credentials in Vault (secrets/data/chan/ott/{tenantId}/{provider}); never in PG plaintext
Tenant webhook secrets in Vault; 24 h rotation grace
WhatsApp / Telegram / Viber webhook HMAC verification (constant-time)
Audit log Postgres rule rejects UPDATE/DELETE
Audit chain verifier passing daily
Sovereignty guard enforced (CHAN_EXTERNAL_LLM_ENABLED=false)
Penetration test against REST + OTT webhook ingress
Security team sign-off

5. Operational readiness

CriterionStatus
K8s Deployment manifests reviewed (channel-router, chan-mo-router, chan-adapter-*)
HPA configured (KEDA on RPS + consumer lag)
PodDisruptionBudget minAvailable: 6 (decision core), minAvailable: 2 (mo-router)
Rolling update: zero dropped gRPC under 2000 RPS
Graceful shutdown: 15 s drain with SIGTERM handler
Resource requests/limits validated under 5000 RPS
PG connection pool sized (PgBouncer transaction mode)
Redis connection pool sized (min 50, max 200 per pod)
Multi-region replication verified (kbl ↔ mzr logical replication for control-plane)
Cross-region MO forwarding tested end-to-end
Adapter-circuit-open runbook drafted
Cost-cap-breach runbook drafted
OTT credential rotation runbook drafted (per provider)
Region failover runbook + drill
On-call schedule populated (Messaging Core primary; SRE secondary)

6. External dependencies

DependencyReadiness gate
WhatsApp Business Cloud accountActive Meta business account; phone-number-id provisioned; tokens issued; webhook URL verified
Telegram Bot FatherBot tokens issued; webhook secret-path configured; setWebhook API confirmed
Viber Public AccountPA token issued; webhook URL registered with Viber
Voice OTP gateway MoUGateway operational; per-trunk CPS contract signed; SLA 99.9%
SMTP egressMail-egress IP pool reputation ≥ "Good"; SPF/DKIM/DMARC aligned
consent-ledger-serviceProduction-ready (own readiness doc)
compliance-engineProduction-ready
sender-id-registry-serviceProduction-ready
webhook-dispatcherProduction-ready
numbering-serviceProduction-ready (lease verification)

CriterionStatus
Default fallback policies reviewed by Legal (especially OTT-after-SMS for OTPs — does it constitute a "new message" under regulation?)
Per-tenant fallback policies require tenant attestation that costCap is acceptable
WhatsApp business-template content categories documented and aligned with compliance-engine
Voice OTP — recorded prompts approved by Legal in en/fa/ps
MO conversation retention (default 90 d) approved
Audit retention (13 m hot + 7 y cold) confirmed by Regulator Liaison
Cross-border data egress (only to OTT providers in approved jurisdictions) reviewed

8. Documentation

DocStatus
SERVICE_OVERVIEW.md
DOMAIN_MODEL.md
APPLICATION_LOGIC.md
API_CONTRACTS.md
EVENT_SCHEMAS.md
DATA_MODEL.md
SYNC_CONTRACT.md
AI_INTEGRATION.md
SECURITY_MODEL.md
OBSERVABILITY.md
TESTING_STRATEGY.md
DEPLOYMENT_TOPOLOGY.md
FAILURE_MODES.md
LOCAL_DEV_SETUP.md
SERVICE_READINESS.md☑ (this)
SERVICE_RISK_REGISTER.md
MIGRATION_PLAN.md
Runbooks (see OBSERVABILITY §7)

9. Sign-off matrix

RoleSign-off required
Messaging Core lead
SRE lead
Security lead
Trust & Safety (consent integration)
Legal (fallback policy review)
Regulator Liaison
Tenant pilot success criteria met (3 design partners)

Production go-live blocked until all above are signed off.


10. Post-launch review

At T+30 d, T+60 d, T+90 d:

  • Fallback rate vs target (≤ 10%)
  • Per-channel success rate per tenant
  • Cost-per-OTP across the fallback ladder
  • Tenant escalations (categorised)
  • Audit-chain integrity (must be 100%)
  • ML preference-ordering quality (vs static baseline)
  • OTT provider relationship health (any quality-score concerns)