Channel Router Service — Service Readiness
Version: 1.0 Status: Draft Owner: Messaging Core Last Updated: 2026-04-21 Companion: SERVICE_OVERVIEW · TESTING_STRATEGY · SERVICE_RISK_REGISTER Related: ADR-0004,
docs/standards/SERVICE_TEMPLATE.md
This document tracks the readiness criteria for taking channel-router-service from development to production. Channel-router is the platform's omnichannel decisioner — it is on the hot path for every omnichannel send, and on the synchronous path for every inbound MO. Readiness bar is elevated: sub-50 ms RouteWithFallback, fail-closed on consent/compliance, audit-chain integrity, and OTT provider health under per-provider rate-limits.
1. Code readiness
| Criterion | Status |
|---|---|
gRPC ChannelRouterService.v1 (RouteWithFallback, DeliverNow, GetConversationSession, GetRecipientProfile) | ☐ |
gRPC ChannelControlService.v1 (policies, OTT accounts, circuit breakers, inbound routes) | ☐ |
REST /v1/channel/* admin + tenant surfaces | ☐ |
OTT webhook ingress (/v1/webhooks/whatsapp, /telegram/{secretPath}, /viber) with HMAC verification | ☐ |
chan-mo-router worker — mo.allowed.v1 consumer + tenant-webhook fan-out via webhook-dispatcher | ☐ |
Per-adapter implementations (WhatsApp Cloud, Telegram Bot, Viber, Voice OTP, SMTP) behind ChannelAdapter port | ☐ |
| Per-adapter circuit-breakers + token-buckets | ☐ |
| Deadline scanner (Redis ZSET + 1 s tick) with distributed lock | ☐ |
Outbox relay (chan.outbox, chan.delivery_outbox) with at-least-once + uniqueness | ☐ |
| Conversation session manager (Redis HASH + PG mirror) | ☐ |
| STOP-keyword detector (en/fa/ps/ar) + tenant overrides | ☐ |
| Recipient-profile updater (LWW, async consumer on delivery feedback) | ☐ |
| Fail-closed gating on consent / compliance / sender-id (deadlines + cache) | ☐ |
Audit hash-chain implementation (prev_hash, `record_hash = sha256(payload | |
| Audit chain daily verifier job | ☐ |
Idempotency-Key support on REST writes; Nats-Msg-Id on outbound events | ☐ |
| mTLS gRPC (SPIRE SVID) | ☐ |
Sovereignty guard (CHAN_EXTERNAL_LLM_ENABLED=false) — pod refuses to boot otherwise | ☐ |
2. Testing readiness
| Criterion | Target | Status |
|---|---|---|
| Unit coverage | ≥ 85% line / ≥ 80% branch (domain ≥ 90%) | ☐ |
| Unit tests for fallback policy evaluator | ≥ 30 tests | ☐ |
| Unit tests for cost calculator + cost-cap | ≥ 15 tests | ☐ |
| Unit tests for state-machine transitions per attempt | ≥ 25 tests | ☐ |
| Unit tests for STOP-keyword matcher per language | ≥ 50 per language (en/fa/ps/ar) | ☐ |
| Unit tests for MSISDN normalisation + per-tenant hashing | ≥ 30 | ☐ |
| Unit tests for provider-status canonicalisation | ≥ 50 (covers WhatsApp, Telegram, Viber, Voice) | ☐ |
Integration: RouteWithFallback P95 ≤ 50 ms @ 5000 RPS | Passed | ☐ |
| Integration: full SMS→WhatsApp→Voice cascade | Passed | ☐ |
| Integration: MO routing roundtrip via mock webhook | Passed | ☐ |
| Integration: STOP keyword closes session + records opt-out | Passed | ☐ |
| Integration: profile learning from delivery feedback (LWW) | Passed | ☐ |
Integration: consent.revoked.v1 invalidates cache within 1 s | Passed | ☐ |
Contract: sms-orchestrator → channel-router gRPC (Pact) | Passed | ☐ |
Contract: channel-router → consent-ledger gRPC | Passed | ☐ |
Contract: channel-router → compliance-engine gRPC | Passed | ☐ |
Contract: channel-router → webhook-dispatcher gRPC | Passed | ☐ |
| Contract: WhatsApp / Telegram / Viber webhook fixtures | Passed | ☐ |
| E2E: bank OTP fallback (SMS → Voice) | Passed | ☐ |
| E2E: government PARALLEL alert across OTT | Passed | ☐ |
| E2E: conversational MT/MO multi-turn | Passed | ☐ |
| Load: 5000 RPS sustained 1 h, P99 ≤ 120 ms | Passed | ☐ |
| Chaos: PG primary kill, fail-over within 30 s | Passed | ☐ |
| Chaos: consent-ledger out → fail-closed verified | Passed | ☐ |
| Chaos: WhatsApp adapter 503 storm → breaker opens, ladder skips | Passed | ☐ |
| Chaos: NATS lag → outbox grows; on heal, drains within 5 min | Passed | ☐ |
| Security: RLS cross-tenant blocked | Passed | ☐ |
| Security: audit log UPDATE/DELETE rejected at PG rule | Passed | ☐ |
| Security: webhook signature spoof → 401 | Passed | ☐ |
| Security: cost-cap evasion attempts rejected | Passed | ☐ |
| Security: penetration test on REST + OTT webhook surface | Passed | ☐ |
3. Observability readiness
| Criterion | Status |
|---|---|
| All Prometheus metrics emitting (see OBSERVABILITY §2) | ☐ |
Grafana dashboard channel-router.json deployed | ☐ |
| All alerts configured in Alertmanager with runbooks | ☐ |
| Structured JSON logs (Pino) with MSISDN/body redaction | ☐ |
OTel trace propagation from sms-orchestrator → channel-router → consent-ledger / compliance / OTT adapter verified | ☐ |
| Loki parsing rules validated for service logs | ☐ |
SIEM forwarding of channel.audit.v1 via regulator-portal-service verified | ☐ |
Alerts configured
-
ChannelRouteLatencyHigh(P95 > 50 ms for 5 min) -
ChannelFallbackRateHigh(> 25% for 15 min) -
ChannelAdapterUnavailable(per provider) -
ChannelCostCapBreach -
ChannelConsentViolationAttempt(Critical) -
ChannelComplianceViolationAttempt -
ChannelMoLatencyHigh(P95 > 1 s) -
ChannelMoRoutingFailed -
ChannelWebhookSignatureInvalidSpike -
ChannelAuditChainBreak(Critical) -
ChannelJetStreamMirrorLag -
ChannelOutcomePublishLagHigh -
ChannelMlFeatureDrift(medium)
4. Security readiness
| Criterion | Status |
|---|---|
| mTLS enforced on gRPC ports (SPIRE SVID) | ☐ |
| NetworkPolicy restricting ingress to allow-listed callers (sms-orchestrator, admin, kong, prometheus) | ☐ |
| Kong JWT validation on REST | ☐ |
RLS policies on recipient_profiles, conversations, tenant_inbound_routes verified | ☐ |
| MSISDN per-tenant salted hashing implemented; salt rotation procedure documented | ☐ |
OTT credentials in Vault (secrets/data/chan/ott/{tenantId}/{provider}); never in PG plaintext | ☐ |
| Tenant webhook secrets in Vault; 24 h rotation grace | ☐ |
| WhatsApp / Telegram / Viber webhook HMAC verification (constant-time) | ☐ |
| Audit log Postgres rule rejects UPDATE/DELETE | ☐ |
| Audit chain verifier passing daily | ☐ |
Sovereignty guard enforced (CHAN_EXTERNAL_LLM_ENABLED=false) | ☐ |
| Penetration test against REST + OTT webhook ingress | ☐ |
| Security team sign-off | ☐ |
5. Operational readiness
| Criterion | Status |
|---|---|
K8s Deployment manifests reviewed (channel-router, chan-mo-router, chan-adapter-*) | ☐ |
| HPA configured (KEDA on RPS + consumer lag) | ☐ |
PodDisruptionBudget minAvailable: 6 (decision core), minAvailable: 2 (mo-router) | ☐ |
| Rolling update: zero dropped gRPC under 2000 RPS | ☐ |
| Graceful shutdown: 15 s drain with SIGTERM handler | ☐ |
| Resource requests/limits validated under 5000 RPS | ☐ |
| PG connection pool sized (PgBouncer transaction mode) | ☐ |
| Redis connection pool sized (min 50, max 200 per pod) | ☐ |
| Multi-region replication verified (kbl ↔ mzr logical replication for control-plane) | ☐ |
| Cross-region MO forwarding tested end-to-end | ☐ |
| Adapter-circuit-open runbook drafted | ☐ |
| Cost-cap-breach runbook drafted | ☐ |
| OTT credential rotation runbook drafted (per provider) | ☐ |
| Region failover runbook + drill | ☐ |
| On-call schedule populated (Messaging Core primary; SRE secondary) | ☐ |
6. External dependencies
| Dependency | Readiness gate |
|---|---|
| WhatsApp Business Cloud account | Active Meta business account; phone-number-id provisioned; tokens issued; webhook URL verified |
| Telegram Bot Father | Bot tokens issued; webhook secret-path configured; setWebhook API confirmed |
| Viber Public Account | PA token issued; webhook URL registered with Viber |
| Voice OTP gateway MoU | Gateway operational; per-trunk CPS contract signed; SLA 99.9% |
| SMTP egress | Mail-egress IP pool reputation ≥ "Good"; SPF/DKIM/DMARC aligned |
| consent-ledger-service | Production-ready (own readiness doc) |
| compliance-engine | Production-ready |
| sender-id-registry-service | Production-ready |
| webhook-dispatcher | Production-ready |
| numbering-service | Production-ready (lease verification) |
7. Legal & regulatory
| Criterion | Status |
|---|---|
| Default fallback policies reviewed by Legal (especially OTT-after-SMS for OTPs — does it constitute a "new message" under regulation?) | ☐ |
| Per-tenant fallback policies require tenant attestation that costCap is acceptable | ☐ |
WhatsApp business-template content categories documented and aligned with compliance-engine | ☐ |
| Voice OTP — recorded prompts approved by Legal in en/fa/ps | ☐ |
| MO conversation retention (default 90 d) approved | ☐ |
| Audit retention (13 m hot + 7 y cold) confirmed by Regulator Liaison | ☐ |
| Cross-border data egress (only to OTT providers in approved jurisdictions) reviewed | ☐ |
8. Documentation
| Doc | Status |
|---|---|
| SERVICE_OVERVIEW.md | ☑ |
| DOMAIN_MODEL.md | ☑ |
| APPLICATION_LOGIC.md | ☑ |
| API_CONTRACTS.md | ☑ |
| EVENT_SCHEMAS.md | ☑ |
| DATA_MODEL.md | ☑ |
| SYNC_CONTRACT.md | ☑ |
| AI_INTEGRATION.md | ☑ |
| SECURITY_MODEL.md | ☑ |
| OBSERVABILITY.md | ☑ |
| TESTING_STRATEGY.md | ☑ |
| DEPLOYMENT_TOPOLOGY.md | ☑ |
| FAILURE_MODES.md | ☑ |
| LOCAL_DEV_SETUP.md | ☑ |
| SERVICE_READINESS.md | ☑ (this) |
| SERVICE_RISK_REGISTER.md | ☑ |
| MIGRATION_PLAN.md | ☑ |
| Runbooks (see OBSERVABILITY §7) | ☐ |
9. Sign-off matrix
| Role | Sign-off required |
|---|---|
| Messaging Core lead | ☐ |
| SRE lead | ☐ |
| Security lead | ☐ |
| Trust & Safety (consent integration) | ☐ |
| Legal (fallback policy review) | ☐ |
| Regulator Liaison | ☐ |
| Tenant pilot success criteria met (3 design partners) | ☐ |
Production go-live blocked until all above are signed off.
10. Post-launch review
At T+30 d, T+60 d, T+90 d:
- Fallback rate vs target (≤ 10%)
- Per-channel success rate per tenant
- Cost-per-OTP across the fallback ladder
- Tenant escalations (categorised)
- Audit-chain integrity (must be 100%)
- ML preference-ordering quality (vs static baseline)
- OTT provider relationship health (any quality-score concerns)