Skip to main content

Channel Router Service — Sync Contract

Version: 1.0 Status: Draft Owner: Messaging Core Last Updated: 2026-04-21 Companion: API_CONTRACTS · APPLICATION_LOGIC · DATA_MODEL

This document defines what other services depend on from channel-router, what channel-router depends on from others, and the conflict resolution rules that govern multi-region replication of channel-router state.


1. Consumers of channel-router

ServiceInterfaceDependency typeSLA expectation
sms-orchestratorgRPC RouteWithFallbackSynchronous, in the async orchestrator pipelineP95 ≤ 50 ms; availability 99.99% per region
admin-dashboardREST /v1/channel/*Synchronous admin surfaceP95 ≤ 500 ms; availability 99.5%
tenant-portalREST /v1/channel/tenants/{tenantId}/...Synchronous tenant surfaceP95 ≤ 500 ms
webhook-dispatcherNATS chan.mo.delivery.requested.v1 (internal)Async fan-out for MO tenant webhooksP99 ≤ 1 s end-to-end
billing-serviceNATS channel.billing.event.v1Async metering feedlag ≤ 60 s
analytics-serviceNATS channel.*, notification.delivery.outcome.v1Async archivallag ≤ 5 min
notification-serviceNATS notification.delivery.outcome.v1, channel.conversation.*Asynclag ≤ 60 s
regulator-portal-serviceNATS channel.audit.v1SIEM / audit forwardinglag ≤ 5 min
compliance-engine(consumes channel.mo.inbound.v1 via sms.mo.inbound re-publish)Asynclag ≤ 5 s

Async contract semantics. The gRPC RouteWithFallback is synchronous only up to step-0 dispatch. The ladder continues async; termination produces a single notification.delivery.outcome.v1 event per (notificationId, recipientId). Consumers of that event see a definitive final state (no partial outcomes).


2. Dependencies of channel-router

DependencyInterfaceFailure mode if unavailable
consent-ledger-servicegRPC CheckConsent (hot path)Fail-closed after deadline + cache miss — refuse dispatch with REFUSED_CONSENT_UNKNOWN
compliance-enginegRPC EvaluateChannelCompliance (hot path)Fail-closed — refuse dispatch with REFUSED_COMPLIANCE_UNKNOWN
sender-id-registry-servicegRPC VerifySenderHot-cached 300 s; beyond that, fail-closed
numbering-servicegRPC GetLeaseOnly on MoRouting CRUD writes; soft-fail on reads (cached)
smpp-connectorNATS sms.outbound.dispatch.v1SMS step fails_temp → ladder skips SMS on breaker-open
dlr-processorNATS sms.dlr.inboundSMS step progresses on deadline elapse instead of DLR receipt
webhook-dispatchergRPC (tenant webhook deliver)MO routing retries via webhook-dispatcher's own back-off; after 5 attempts → mo.webhook.deadletter.v1
fraud-intel-serviceNATS fraud.detected.channel_abuse.v1Signal-absent = default breaker state; no impact to hot path
WhatsApp Cloud APIHTTPSAdapter breaker opens after 50 calls with > 50% errors
Telegram Bot APIHTTPSSame
Viber BusinessHTTPSSame
Voice OTP gatewaygRPCStep fails_temp → ladder progresses
SMTP egressSMTPStep fails_temp → ladder progresses
PostgreSQL chan schemaSQL (PgBouncer transaction mode)Hot-path reads served from Redis up to 300 s; writes fail with UNAVAILABLE causing NATS redelivery
Redis SentinelSliding-window TPS + cache + sessionsHot path drops to PG direct (higher latency); sessions degrade to PG-only
NATS JetStreamOutbox publish + consumerOutbox buffer grows; consumer lag alert; no data loss (outbox persists)
VaultOTT credential fetchCached 60 s in-process; if cache stale + Vault down → adapter breaker opens
HSM (SVID signing)SPIREmTLS SVID rotation blocked; existing SVIDs valid up to 1 h

3. Proto definitions

syntax = "proto3";
package ghasi.sms.channel.v1;
option go_package = "github.com/ghasi/sms-gateway/channel/v1";

// ============== Data plane (:50071) ==============

service ChannelRouterService {
rpc RouteWithFallback (RouteWithFallbackRequest) returns (RouteWithFallbackAck);
rpc DeliverNow (DeliverNowRequest) returns (DeliverNowAck);
rpc GetConversationSession (GetConversationSessionRequest) returns (ConversationSession);
rpc GetRecipientProfile (GetRecipientProfileRequest) returns (RecipientProfile);
}

// ============== Control plane (:50072) ==============

service ChannelControlService {
rpc PutFallbackPolicy (PutFallbackPolicyRequest) returns (FallbackPolicy);
rpc DeleteFallbackPolicy (DeleteFallbackPolicyRequest) returns (Empty);
rpc PutOttAccount (PutOttAccountRequest) returns (OttAccount);
rpc RotateOttAccount (RotateOttAccountRequest) returns (OttAccount);
rpc SetCircuitState (SetCircuitStateRequest) returns (AdapterHealth);
rpc PutInboundRoute (PutInboundRouteRequest) returns (TenantInboundRoute);
rpc DeleteInboundRoute (DeleteInboundRouteRequest) returns (Empty);
rpc PutAdapterStatusMap (PutAdapterStatusMapRequest) returns (AdapterStatusMap);
}

// ============== Shared messages (excerpt) ==============

message FallbackPolicy {
string policy_id = 1;
string tenant_id = 2;
string use_case = 3;
string strategy = 4;
double cost_cap_per_message_ngn = 5;
int32 session_ttl_seconds = 6;
repeated LadderStep ladder = 7;
repeated string stop_keywords_override = 8;
int32 version = 9;
}

message LadderStep {
Channel channel = 1;
int32 deadline_seconds = 2;
int32 retry_budget = 3;
repeated string escalate_on = 4;
double cost_cap_step_ngn = 5;
}

message OttAccount {
string adapter_config_id = 1;
string tenant_id = 2;
string provider = 3;
string phone_number_id_or_handle = 4;
string secret_ref = 5;
string circuit_state = 6;
bool active = 7;
}

message AdapterHealth {
string adapter = 1;
string circuit_state = 2;
int32 in_flight = 3;
double error_rate_1m = 4;
string last_webhook_at = 5;
}

message Empty {}

4. Integration point in sms-orchestrator

// sms-orchestrator NATS consumer (omnichannel lane)
async handleOmniChannelDispatch(msg: JsMsg): Promise<void> {
const req = parseDispatchRequest(msg.data);
try {
const ack = await channelRouter.routeWithFallback({
notificationId: req.notificationId,
recipientId: req.recipientId,
tenantId: req.tenantId,
useCase: req.useCase,
msisdn: req.msisdn,
body: req.body,
segments: req.segments,
encoding: req.encoding,
senderId: req.senderId,
idempotencyKey: req.idempotencyKey,
}, { deadline: 500 }); // 500ms orchestrator-side deadline

if (ack.excluded.length > 0) {
await repo.annotate(req.notificationId, { excluded: ack.excluded });
}
msg.ack();
} catch (err) {
// gRPC UNAVAILABLE / DEADLINE_EXCEEDED / INTERNAL → do NOT ack
// NATS will redeliver; after 5 attempts → notification.dispatch.deadletter.v1
if (isTransient(err)) {
msg.nak(5_000);
} else {
await repo.updateStatus(req.notificationId, 'BLOCKED_BAD_INPUT');
msg.ack();
}
}
}

5. Multi-region conflict policy

Per ADR-0004 §5, the platform runs active-active across kbl and mzr. Channel-router state partitions by region with explicit conflict resolution:

AggregateReplicationConflict ruleRationale
recipient_profilesMulti-master via PG logical replicationLWW on updated_at; causally-related delta merges at the profile-update workerDelivery observations may arrive from either region after a regional outage; LWW gives stable eventual convergence within the 30 d decay window
fallback_policiesControl-plane primary in kbl, replicated read-only to mzrServer authoritative — writes only in kbl; read-your-own-write via sticky tenant-admin sessionPolicies change infrequently; simpler invariants
channel_adapter_configsSame as fallback_policiesServer authoritativeCredential rotations must have a single source of truth
tenant_inbound_routesSameServer authoritativeInbound number uniqueness is a global invariant
conversationsRegion-pinnedConversations pin to the region that opened them; cross-region MO arriving in the other region is routed to the owning region via internal NATS mirrorSession stickiness is critical for 2-way SMS context
fallback_executionsRegion-pinned (write-once, read-anywhere)Append-onlyExecution row is complete within one region; audit read from either region
delivery_attemptsRegion-pinnedAppend-onlySame
auditMirror both regionsAppend-only; hash-chain per-region; daily cross-region reconciliationRegulatory evidence preserved against regional loss
delivery_outboxRegion-pinnedUNIQUE(notificationId, recipientId) enforced locally; cross-region dedup via NATS JetStream dedup windowPrevents double-outcome during regional partition

Cross-region NATS mirrors.

  • CHANNEL_OUTCOMES and CHANNEL_BILLING are mirrored both ways (audit-grade).
  • CHANNEL_CONVERSATIONS is mirrored to the other region only for cross-region MO routing lookups.
  • CHANNEL_EVENTS per-region; consumers in the other region subscribe as durable mirror consumers.

6. Outbox + idempotency contract

  • All state-changing writes go into chan.outbox or chan.delivery_outbox within the same PG transaction as the aggregate change.
  • Relay publishes with Nats-Msg-Id equal to the event's deterministic dedup key:
    • For outcomes: outcome:{notificationId}:{recipientId}
    • For attempts: attempt:{attemptId}
    • For billing: billing:{notificationId}:{recipientId}:{stepIndex}
  • JetStream dedup window 2 min (5 min for CHANNEL_BILLING) guarantees no duplicate delivery to consumers during relay retries.

7. Versioning rules

  • gRPC services evolve via additive proto (buf breaking CI gate).
  • REST endpoints evolve additively; breaking → /v2/.
  • NATS events evolve additively within schemaVersion; breaking → .v2 subject with ≥ 90 d overlap.
  • Adapter-status maps are data, loaded on startup and refreshed on chan.status_map.changed.v1 — no deploy required.

8. Operational contract

  • sms-orchestrator MUST set a client-side gRPC deadline of 500 ms on RouteWithFallback and NOT ack NATS messages on deadline exceeded.
  • Admin-dashboard MUST send If-Match header on policy PUT to prevent lost-update via version mismatch.
  • OTT webhook ingress MUST verify provider HMAC signature; unsigned payloads are dropped with no retry.
  • Cross-region MO arriving in non-owning region is forwarded via internal subject chan.mo.crossregion.forward.v1; expected lag ≤ 100 ms.