Channel Router Service — Failure Modes

Version: 1.0 Status: Draft Owner: Messaging Core + SRE Last Updated: 2026-04-21 Companion: SERVICE_OVERVIEW · OBSERVABILITY · SERVICE_RISK_REGISTER Related ADR: ADR-0004 §3 + §6

This catalogue describes how channel-router-service fails, the user/tenant/regulator-visible effect, how the platform detects each failure, and the designed mitigation. The service is consulted synchronously by sms-orchestrator for omnichannel deliveries and synchronously by tenant webhooks via chan-mo-router. Faults propagate to OTP success rate, MO conversation continuity, and per-attempt billing accuracy.

1. Operating principle

The channel-router has a mixed posture:

Fail-closed on consent, compliance, sender-ID gating — refusing to dispatch is safer than violating a recipient's opt-out or a regulator's rule.
Fail-degraded on adapter availability — a single OTT provider being down should not refuse the message; the ladder skips that adapter and continues.
Fail-degraded on profile/ML — these affect ordering quality, not correctness.

2. Failure-mode summary

#	Name	Class	Detection time	User-visible effect	Runbook
FM-01	OTT provider API unavailable (WhatsApp / Telegram / Viber)	Dependency	< 30 s	Ladder skips that channel; recipient still served via fallback	`runbooks/channel/adapter-circuit-open.md`
FM-02	Postgres unavailable (write path)	Infra	< 30 s	gRPC `UNAVAILABLE`; orchestrator NATS redelivery; cache-served reads continue 5 min	`runbooks/channel/pg-out.md`
FM-03	Redis unavailable	Infra	< 10 s	PG-direct fallback; latency P95 30→80 ms; sessions degrade to PG reads	`runbooks/channel/redis-out.md`
FM-04	consent-ledger unreachable on hot path	Dependency	< 10 ms (deadline)	Cache-served until 60 s; then `REFUSED_CONSENT_UNKNOWN` (fail-closed)	`runbooks/channel/consent-out.md`
FM-05	compliance-engine unreachable on hot path	Dependency	< 15 ms (deadline)	Same fail-closed pattern as FM-04	`runbooks/channel/compliance-out.md`
FM-06	sender-id-registry unreachable	Dependency	< 10 ms	Cache 300 s; then fail-closed `REFUSED_SENDER_UNAUTHORIZED`	`runbooks/channel/sender-id-out.md`
FM-07	Profile-learning loop fails (Triton/ML)	Dependency	< 10 ms	Static preference fallback; no SLO impact on correctness	`runbooks/channel/ml-out.md`
FM-08	webhook-dispatcher unreachable	Dependency	< 1 s	MO ingest queues in NATS work-queue; eventually dead-lettered after webhook-dispatcher policy	`runbooks/channel/webhook-dispatcher-out.md`
FM-09	Conversation-id race condition	Code	< 1 s	One MT may transiently see `turnCount` ambiguous; Redis WATCH/MULTI prevents persisted divergence	`runbooks/channel/session-race.md`
FM-10	OTT provider rate-limit hit (429)	Adversarial / Volume	seconds	Adapter back-off; eventual TPS degradation; ladder progresses if back-off > deadline	`runbooks/channel/ott-ratelimit.md`
FM-11	Cost-cap breach during high-volume event	Code / Operator	minutes	Spike of `REFUSED_COST_CAP` outcomes; tenant alerted	`runbooks/channel/cost-cap-breach.md`
FM-12	Regional partition with active conversations	Infra	1 min	Cross-region MO forwarding; sessions remain owned by origin region	`runbooks/channel/region-split.md`
FM-13	Audit hash-chain break	Correctness / Security	24 h (daily verifier)	Audit defensibility for affected partition lost	`runbooks/channel/audit-chain-break.md`
FM-14	OTT provider webhook signature spoof / spike	Adversarial	5 m	401-rejected; alert; no false correlation	`runbooks/channel/webhook-signature-spike.md`
FM-15	OTT credential rotation propagation failure	Code / Vault	60 s	Adapter pods continue with stale creds; eventual 401 from provider; breaker opens	`runbooks/channel/ott-rotation-failed.md`

3. Detailed failure modes

FM-01 — OTT provider API unavailable

Scenario. WhatsApp Cloud API returns 5xx, times out, or DNS fails for graph.facebook.com. Same pattern for Telegram (api.telegram.org) or Viber (chatapi.viber.com).

Impact. Step in the ladder fails_temp; the per-adapter circuit-breaker opens after 50 calls / 50% error rate. While open, that channel is step_skipped with reason adapter_circuit_open. Recipient is served by the next ladder step.

Detection. chan_provider_api_errors_total{provider} rate spike; chan_adapter_circuit_state{provider} flips to OPEN; alert ChannelAdapterUnavailable fires within 2 m.

Mitigation.

Per-adapter circuit-breaker (window 50 calls, error-rate 50%, open 60 s, half-open probe).
Ladder progression bypasses the broken provider with audit-trailed reason.
Fraud-intel fraud.detected.channel_abuse.v1 may force-open the breaker for 15 m on suspected provider abuse.
Manual breaker control via POST /v1/channel/adapters/{adapter}/circuit.

Recovery. Half-open probe (1 dispatch / 30 s) closes breaker on success; chan_adapter_circuit_state returns to CLOSED.

FM-02 — Postgres unavailable (write path)

Scenario. Primary PG (or sync replica during failover) unreachable.

Impact. Writes to fallback_executions, delivery_attempts, outbox fail. Hot-path cache-served reads continue up to TTL (60–300 s). gRPC returns UNAVAILABLE; orchestrator NATS redelivers.

Detection. chan_pg_query_duration_seconds P95 spike; pool-wait counter rises; /health/ready returns 503; alert within 30 s.

Mitigation.

Patroni HA (1 primary + 2 sync standbys); auto-failover ≤ 30 s.
Cross-region failover (manual-gated) ≤ 15 min.
PgBouncer transaction-mode pooling absorbs short blips.

Recovery. Pods auto-recover. NATS backlog drains.

Scenario. consent-ledger pods down; gRPC UNAVAILABLE or deadline exceeded > 10 ms.

Impact. First level: cached gating result still serves (60 s TTL). Beyond cache TTL: hard fail-closed → REFUSED_CONSENT_UNKNOWN; outcome event emitted; orchestrator marks notification BLOCKED. No silent dispatch.

Detection. chan_gate_fail_closed_total{gate="consent"} spike; alert ChannelConsentViolationAttempt (severity critical, 50/5m threshold).

Mitigation.

consent-ledger HA (3+ replicas per region).
60 s decision cache absorbs short outages.
Cross-region failover for consent-ledger.
RouteWithFallback returns FAILED_PRECONDITION + REFUSED_CONSENT_UNKNOWN with traceId so SRE can correlate.

Recovery. Once consent-ledger is back, cache misses succeed; refusal rate returns to baseline.

FM-08 — webhook-dispatcher unreachable

Scenario. webhook-dispatcher down or saturated.

Impact. Inbound MOs queue in NATS work-queue stream; tenant webhooks not delivered; recipient context lost from tenant's perspective.

Detection. chan_mo_routing_duration_seconds P95 spike; webhook-dispatcher upstream metric exposed; alert ChannelMoLatencyHigh.

Mitigation.

webhook-dispatcher HA (5+ replicas) + retry policy (1s, 5s, 30s, 5m, 1h).
After 5 attempts: mo.webhook.deadletter.v1; tenant notified via notification-service.
Tenant may re-deliver from dead-letter via admin API.

FM-09 — Conversation-id race condition

Scenario. Two concurrent MT dispatches for the same (senderId, msisdn, tenantId) racing to create / refresh a session.

Impact. Without guards, turnCount could be incremented twice and conversationId could diverge (one in Redis, one in PG outbox).

Detection. chan_session_race_detected_total (incremented when WATCH/MULTI rejects).

Mitigation.

Redis WATCH on chan:session:{senderId}:{msisdnHash} with MULTI/EXEC guard.
PG INSERT ... ON CONFLICT for the conversation row uses unique (tenantId, senderId, msisdnHash, openedAt).
Reconciliation job daily detects Redis-vs-PG divergence and resolves by closing PG ghosts with reason=redis_loss.

FM-10 — OTT provider rate-limit hit

Scenario. Meta returns 429 on WhatsApp Cloud (Tier-1: 80 msg/s per phone-number-id). Telegram throttles (30 msg/s per bot, 1 msg/s per chat). Viber 429.

Impact. Adapter back-off (exponential with jitter); if back-off > deadline → step_skipped and ladder progresses. Tenant sees minor success-rate dip on that adapter for the burst window.

Detection. chan_provider_api_errors_total{http_status="429"} spike; chan_adapter_tps_bucket_used ≥ 95%.

Mitigation.

Per-provider Redis token bucket sized to provider documented limits (see DOMAIN_MODEL §5).
Per-chat secondary bucket for Telegram (1 msg/s).
Negotiate Tier-2/Tier-3 with Meta/Viber for high-volume tenants.
Spread load via PARALLEL strategy across multiple OTT providers for emergency alerts.

FM-11 — Cost-cap breach during high-volume event

Scenario. A tenant's costCapPerMessage is set conservatively but a fallback cascade (especially Voice OTP) reaches it.

Impact. REFUSED_COST_CAP outcomes; affected recipients not notified.

Detection. chan_cost_cap_breach_total spike; alert ChannelCostCapBreach at 10/15m.

Mitigation.

UC-09 enforces cap pre-step-transition (no surprise charges).
Tenant policy validation rejects costCapPerMessage < cheapest path at write-time.
Tenant alert via portal; suggest policy revision.
Operator can temporarily raise cap via PUT policy with audit.

FM-12 — Regional partition with active conversations

Scenario. Network partition between kbl and mzr; conversations opened in kbl cannot accept MOs that arrive in mzr (because sessions are region-pinned).

Impact. Cross-region MO forwarding queues; eventual delivery on heal; up to ~minutes of cross-region session lag.

Detection. chan_jetstream_mirror_lag_seconds{stream="CHANNEL_CONVERSATIONS"} spike; alert ChannelJetStreamMirrorLag.

Mitigation.

Cross-region NATS subject chan.mo.crossregion.forward.v1 automatically forwards mismatched MOs to the owning region.
On extended partition (> 5 m), DR procedure re-points DNS to single region; conversations continue locally.
After heal, both regions reconcile; no data loss (events durable in JetStream mirror).

FM-13 — Audit hash-chain break

Scenario. Daily verifier finds record_hash ≠ sha256(payload || prev_hash) for some row in chan.audit.

Impact. Audit defensibility for affected partition lost; regulator confidence reduced.

Detection. chan_audit_chain_break_total > 0; alert ChannelAuditChainBreak (Critical).

Mitigation.

Investigate root cause: schema tampering, code bug, replica divergence.
Quarantine affected partition; subsequent rows continue from a new chain origin.
If tamper: CISO + Legal escalation.
If code bug: bug-fix + chain recompute; regulator notified if previously-submitted audit was affected.

FM-14 — Webhook signature spoof / spike

Scenario. Attacker forges WhatsApp webhook payloads to inject false delivered statuses (motive: fraudulent billing or deliverability fraud).

Impact. Without signature validation, false correlation could bias billing or success metrics. With validation: 401-rejected; no impact, but spike indicates probing.

Detection. chan_webhook_signature_invalid_total{provider} rate spike; alert ChannelWebhookSignatureInvalidSpike (10/5m).

Mitigation.

HMAC verification before parse, constant-time compare.
IP allow-list for provider source IPs (Meta, Telegram, Viber publish IP ranges).
Source-IP-based rate-limit before signature verification (defence in depth).
Forensics on User-Agent + source ASN.

FM-15 — OTT credential rotation propagation failure

Scenario. chan.ott_account.rotated.v1 event lost or adapter pod fails to reload from Vault.

Impact. Adapter pod uses stale credential; provider returns 401; circuit opens.

Detection. chan_provider_api_errors_total{http_status="401"} spike on a provider after a rotation event.

Mitigation.

Adapter pods poll Vault every 60 s as a safety net (in addition to event-driven reload).
Rotation procedure publishes the event AND posts to a sentinel Redis key; adapter pods use the more-recent of (event, sentinel).
Old credential accepted in parallel for 24 h grace by the platform — but not by the provider. Operator must verify adapter health within 60 s of rotate.

4. Cross-FM postures

Concern	Posture
Recipient privacy (consent)	Fail-closed
Regulatory rules (compliance)	Fail-closed
Sender-ID authorisation	Fail-closed
Adapter / channel availability	Fail-degraded (skip + audit)
Session continuity	Best-effort; cross-region forwarding
Audit integrity	Fail-loud (alert; partition quarantine)
Cost containment	Fail-fast (refuse with audit)

5. Per-FM metrics binding

Every alert in OBSERVABILITY §5 is bound to one or more failure modes here. SLO error-budget consumption is attributed by outcome label; FM-driven incidents are tagged in PagerDuty for post-incident review.

6. Drill schedule

Drill	Cadence
FM-01 (OTT-down)	Monthly per provider
FM-02 (PG primary kill)	Quarterly
FM-04 (consent-ledger out)	Quarterly
FM-12 (region partition)	Quarterly
FM-13 (audit chain break — synthetic)	Monthly
Full DR (region failover)	Quarterly

All drills tracked in docs/drills/channel-router/.

1. Operating principle​

2. Failure-mode summary​

3. Detailed failure modes​

FM-01 — OTT provider API unavailable​

FM-02 — Postgres unavailable (write path)​

FM-04 — consent-ledger unreachable on hot path​

FM-08 — webhook-dispatcher unreachable​

FM-09 — Conversation-id race condition​

FM-10 — OTT provider rate-limit hit​

FM-11 — Cost-cap breach during high-volume event​

FM-12 — Regional partition with active conversations​

FM-13 — Audit hash-chain break​

FM-14 — Webhook signature spoof / spike​

FM-15 — OTT credential rotation propagation failure​

4. Cross-FM postures​

5. Per-FM metrics binding​

6. Drill schedule​

1. Operating principle

2. Failure-mode summary

3. Detailed failure modes

FM-01 — OTT provider API unavailable

FM-02 — Postgres unavailable (write path)

FM-04 — consent-ledger unreachable on hot path

FM-08 — webhook-dispatcher unreachable

FM-09 — Conversation-id race condition

FM-10 — OTT provider rate-limit hit

FM-11 — Cost-cap breach during high-volume event

FM-12 — Regional partition with active conversations

FM-13 — Audit hash-chain break

FM-14 — Webhook signature spoof / spike

FM-15 — OTT credential rotation propagation failure

4. Cross-FM postures

5. Per-FM metrics binding

6. Drill schedule