Channel Router Service — Sync Contract
Version: 1.0 Status: Draft Owner: Messaging Core Last Updated: 2026-04-21 Companion: API_CONTRACTS · APPLICATION_LOGIC · DATA_MODEL
This document defines what other services depend on from channel-router, what channel-router depends on from others, and the conflict resolution rules that govern multi-region replication of channel-router state.
1. Consumers of channel-router
| Service | Interface | Dependency type | SLA expectation |
|---|---|---|---|
sms-orchestrator | gRPC RouteWithFallback | Synchronous, in the async orchestrator pipeline | P95 ≤ 50 ms; availability 99.99% per region |
admin-dashboard | REST /v1/channel/* | Synchronous admin surface | P95 ≤ 500 ms; availability 99.5% |
tenant-portal | REST /v1/channel/tenants/{tenantId}/... | Synchronous tenant surface | P95 ≤ 500 ms |
webhook-dispatcher | NATS chan.mo.delivery.requested.v1 (internal) | Async fan-out for MO tenant webhooks | P99 ≤ 1 s end-to-end |
billing-service | NATS channel.billing.event.v1 | Async metering feed | lag ≤ 60 s |
analytics-service | NATS channel.*, notification.delivery.outcome.v1 | Async archival | lag ≤ 5 min |
notification-service | NATS notification.delivery.outcome.v1, channel.conversation.* | Async | lag ≤ 60 s |
regulator-portal-service | NATS channel.audit.v1 | SIEM / audit forwarding | lag ≤ 5 min |
compliance-engine | (consumes channel.mo.inbound.v1 via sms.mo.inbound re-publish) | Async | lag ≤ 5 s |
Async contract semantics. The gRPC RouteWithFallback is synchronous only up to step-0 dispatch. The ladder continues async; termination produces a single notification.delivery.outcome.v1 event per (notificationId, recipientId). Consumers of that event see a definitive final state (no partial outcomes).
2. Dependencies of channel-router
| Dependency | Interface | Failure mode if unavailable |
|---|---|---|
consent-ledger-service | gRPC CheckConsent (hot path) | Fail-closed after deadline + cache miss — refuse dispatch with REFUSED_CONSENT_UNKNOWN |
compliance-engine | gRPC EvaluateChannelCompliance (hot path) | Fail-closed — refuse dispatch with REFUSED_COMPLIANCE_UNKNOWN |
sender-id-registry-service | gRPC VerifySender | Hot-cached 300 s; beyond that, fail-closed |
numbering-service | gRPC GetLease | Only on MoRouting CRUD writes; soft-fail on reads (cached) |
smpp-connector | NATS sms.outbound.dispatch.v1 | SMS step fails_temp → ladder skips SMS on breaker-open |
dlr-processor | NATS sms.dlr.inbound | SMS step progresses on deadline elapse instead of DLR receipt |
webhook-dispatcher | gRPC (tenant webhook deliver) | MO routing retries via webhook-dispatcher's own back-off; after 5 attempts → mo.webhook.deadletter.v1 |
fraud-intel-service | NATS fraud.detected.channel_abuse.v1 | Signal-absent = default breaker state; no impact to hot path |
| WhatsApp Cloud API | HTTPS | Adapter breaker opens after 50 calls with > 50% errors |
| Telegram Bot API | HTTPS | Same |
| Viber Business | HTTPS | Same |
| Voice OTP gateway | gRPC | Step fails_temp → ladder progresses |
| SMTP egress | SMTP | Step fails_temp → ladder progresses |
PostgreSQL chan schema | SQL (PgBouncer transaction mode) | Hot-path reads served from Redis up to 300 s; writes fail with UNAVAILABLE causing NATS redelivery |
| Redis Sentinel | Sliding-window TPS + cache + sessions | Hot path drops to PG direct (higher latency); sessions degrade to PG-only |
| NATS JetStream | Outbox publish + consumer | Outbox buffer grows; consumer lag alert; no data loss (outbox persists) |
| Vault | OTT credential fetch | Cached 60 s in-process; if cache stale + Vault down → adapter breaker opens |
| HSM (SVID signing) | SPIRE | mTLS SVID rotation blocked; existing SVIDs valid up to 1 h |
3. Proto definitions
syntax = "proto3";
package ghasi.sms.channel.v1;
option go_package = "github.com/ghasi/sms-gateway/channel/v1";
// ============== Data plane (:50071) ==============
service ChannelRouterService {
rpc RouteWithFallback (RouteWithFallbackRequest) returns (RouteWithFallbackAck);
rpc DeliverNow (DeliverNowRequest) returns (DeliverNowAck);
rpc GetConversationSession (GetConversationSessionRequest) returns (ConversationSession);
rpc GetRecipientProfile (GetRecipientProfileRequest) returns (RecipientProfile);
}
// ============== Control plane (:50072) ==============
service ChannelControlService {
rpc PutFallbackPolicy (PutFallbackPolicyRequest) returns (FallbackPolicy);
rpc DeleteFallbackPolicy (DeleteFallbackPolicyRequest) returns (Empty);
rpc PutOttAccount (PutOttAccountRequest) returns (OttAccount);
rpc RotateOttAccount (RotateOttAccountRequest) returns (OttAccount);
rpc SetCircuitState (SetCircuitStateRequest) returns (AdapterHealth);
rpc PutInboundRoute (PutInboundRouteRequest) returns (TenantInboundRoute);
rpc DeleteInboundRoute (DeleteInboundRouteRequest) returns (Empty);
rpc PutAdapterStatusMap (PutAdapterStatusMapRequest) returns (AdapterStatusMap);
}
// ============== Shared messages (excerpt) ==============
message FallbackPolicy {
string policy_id = 1;
string tenant_id = 2;
string use_case = 3;
string strategy = 4;
double cost_cap_per_message_ngn = 5;
int32 session_ttl_seconds = 6;
repeated LadderStep ladder = 7;
repeated string stop_keywords_override = 8;
int32 version = 9;
}
message LadderStep {
Channel channel = 1;
int32 deadline_seconds = 2;
int32 retry_budget = 3;
repeated string escalate_on = 4;
double cost_cap_step_ngn = 5;
}
message OttAccount {
string adapter_config_id = 1;
string tenant_id = 2;
string provider = 3;
string phone_number_id_or_handle = 4;
string secret_ref = 5;
string circuit_state = 6;
bool active = 7;
}
message AdapterHealth {
string adapter = 1;
string circuit_state = 2;
int32 in_flight = 3;
double error_rate_1m = 4;
string last_webhook_at = 5;
}
message Empty {}
4. Integration point in sms-orchestrator
// sms-orchestrator NATS consumer (omnichannel lane)
async handleOmniChannelDispatch(msg: JsMsg): Promise<void> {
const req = parseDispatchRequest(msg.data);
try {
const ack = await channelRouter.routeWithFallback({
notificationId: req.notificationId,
recipientId: req.recipientId,
tenantId: req.tenantId,
useCase: req.useCase,
msisdn: req.msisdn,
body: req.body,
segments: req.segments,
encoding: req.encoding,
senderId: req.senderId,
idempotencyKey: req.idempotencyKey,
}, { deadline: 500 }); // 500ms orchestrator-side deadline
if (ack.excluded.length > 0) {
await repo.annotate(req.notificationId, { excluded: ack.excluded });
}
msg.ack();
} catch (err) {
// gRPC UNAVAILABLE / DEADLINE_EXCEEDED / INTERNAL → do NOT ack
// NATS will redeliver; after 5 attempts → notification.dispatch.deadletter.v1
if (isTransient(err)) {
msg.nak(5_000);
} else {
await repo.updateStatus(req.notificationId, 'BLOCKED_BAD_INPUT');
msg.ack();
}
}
}
5. Multi-region conflict policy
Per ADR-0004 §5, the platform runs active-active across kbl and mzr. Channel-router state partitions by region with explicit conflict resolution:
| Aggregate | Replication | Conflict rule | Rationale |
|---|---|---|---|
recipient_profiles | Multi-master via PG logical replication | LWW on updated_at; causally-related delta merges at the profile-update worker | Delivery observations may arrive from either region after a regional outage; LWW gives stable eventual convergence within the 30 d decay window |
fallback_policies | Control-plane primary in kbl, replicated read-only to mzr | Server authoritative — writes only in kbl; read-your-own-write via sticky tenant-admin session | Policies change infrequently; simpler invariants |
channel_adapter_configs | Same as fallback_policies | Server authoritative | Credential rotations must have a single source of truth |
tenant_inbound_routes | Same | Server authoritative | Inbound number uniqueness is a global invariant |
conversations | Region-pinned | Conversations pin to the region that opened them; cross-region MO arriving in the other region is routed to the owning region via internal NATS mirror | Session stickiness is critical for 2-way SMS context |
fallback_executions | Region-pinned (write-once, read-anywhere) | Append-only | Execution row is complete within one region; audit read from either region |
delivery_attempts | Region-pinned | Append-only | Same |
audit | Mirror both regions | Append-only; hash-chain per-region; daily cross-region reconciliation | Regulatory evidence preserved against regional loss |
delivery_outbox | Region-pinned | UNIQUE(notificationId, recipientId) enforced locally; cross-region dedup via NATS JetStream dedup window | Prevents double-outcome during regional partition |
Cross-region NATS mirrors.
CHANNEL_OUTCOMESandCHANNEL_BILLINGare mirrored both ways (audit-grade).CHANNEL_CONVERSATIONSis mirrored to the other region only for cross-region MO routing lookups.CHANNEL_EVENTSper-region; consumers in the other region subscribe as durable mirror consumers.
6. Outbox + idempotency contract
- All state-changing writes go into
chan.outboxorchan.delivery_outboxwithin the same PG transaction as the aggregate change. - Relay publishes with
Nats-Msg-Idequal to the event's deterministic dedup key:- For outcomes:
outcome:{notificationId}:{recipientId} - For attempts:
attempt:{attemptId} - For billing:
billing:{notificationId}:{recipientId}:{stepIndex}
- For outcomes:
- JetStream dedup window 2 min (5 min for
CHANNEL_BILLING) guarantees no duplicate delivery to consumers during relay retries.
7. Versioning rules
- gRPC services evolve via additive proto (
buf breakingCI gate). - REST endpoints evolve additively; breaking →
/v2/. - NATS events evolve additively within
schemaVersion; breaking →.v2subject with ≥ 90 d overlap. - Adapter-status maps are data, loaded on startup and refreshed on
chan.status_map.changed.v1— no deploy required.
8. Operational contract
sms-orchestratorMUST set a client-side gRPC deadline of 500 ms onRouteWithFallbackand NOT ack NATS messages on deadline exceeded.- Admin-dashboard MUST send
If-Matchheader on policyPUTto prevent lost-update via version mismatch. - OTT webhook ingress MUST verify provider HMAC signature; unsigned payloads are dropped with no retry.
- Cross-region MO arriving in non-owning region is forwarded via internal subject
chan.mo.crossregion.forward.v1; expected lag ≤ 100 ms.