Channel Router Service — Failure Modes
Version: 1.0 Status: Draft Owner: Messaging Core + SRE Last Updated: 2026-04-21 Companion: SERVICE_OVERVIEW · OBSERVABILITY · SERVICE_RISK_REGISTER Related ADR: ADR-0004 §3 + §6
This catalogue describes how channel-router-service fails, the user/tenant/regulator-visible effect, how the platform detects each failure, and the designed mitigation. The service is consulted synchronously by sms-orchestrator for omnichannel deliveries and synchronously by tenant webhooks via chan-mo-router. Faults propagate to OTP success rate, MO conversation continuity, and per-attempt billing accuracy.
1. Operating principle
The channel-router has a mixed posture:
- Fail-closed on consent, compliance, sender-ID gating — refusing to dispatch is safer than violating a recipient's opt-out or a regulator's rule.
- Fail-degraded on adapter availability — a single OTT provider being down should not refuse the message; the ladder skips that adapter and continues.
- Fail-degraded on profile/ML — these affect ordering quality, not correctness.
2. Failure-mode summary
| # | Name | Class | Detection time | User-visible effect | Runbook |
|---|---|---|---|---|---|
| FM-01 | OTT provider API unavailable (WhatsApp / Telegram / Viber) | Dependency | < 30 s | Ladder skips that channel; recipient still served via fallback | runbooks/channel/adapter-circuit-open.md |
| FM-02 | Postgres unavailable (write path) | Infra | < 30 s | gRPC UNAVAILABLE; orchestrator NATS redelivery; cache-served reads continue 5 min | runbooks/channel/pg-out.md |
| FM-03 | Redis unavailable | Infra | < 10 s | PG-direct fallback; latency P95 30→80 ms; sessions degrade to PG reads | runbooks/channel/redis-out.md |
| FM-04 | consent-ledger unreachable on hot path | Dependency | < 10 ms (deadline) | Cache-served until 60 s; then REFUSED_CONSENT_UNKNOWN (fail-closed) | runbooks/channel/consent-out.md |
| FM-05 | compliance-engine unreachable on hot path | Dependency | < 15 ms (deadline) | Same fail-closed pattern as FM-04 | runbooks/channel/compliance-out.md |
| FM-06 | sender-id-registry unreachable | Dependency | < 10 ms | Cache 300 s; then fail-closed REFUSED_SENDER_UNAUTHORIZED | runbooks/channel/sender-id-out.md |
| FM-07 | Profile-learning loop fails (Triton/ML) | Dependency | < 10 ms | Static preference fallback; no SLO impact on correctness | runbooks/channel/ml-out.md |
| FM-08 | webhook-dispatcher unreachable | Dependency | < 1 s | MO ingest queues in NATS work-queue; eventually dead-lettered after webhook-dispatcher policy | runbooks/channel/webhook-dispatcher-out.md |
| FM-09 | Conversation-id race condition | Code | < 1 s | One MT may transiently see turnCount ambiguous; Redis WATCH/MULTI prevents persisted divergence | runbooks/channel/session-race.md |
| FM-10 | OTT provider rate-limit hit (429) | Adversarial / Volume | seconds | Adapter back-off; eventual TPS degradation; ladder progresses if back-off > deadline | runbooks/channel/ott-ratelimit.md |
| FM-11 | Cost-cap breach during high-volume event | Code / Operator | minutes | Spike of REFUSED_COST_CAP outcomes; tenant alerted | runbooks/channel/cost-cap-breach.md |
| FM-12 | Regional partition with active conversations | Infra | 1 min | Cross-region MO forwarding; sessions remain owned by origin region | runbooks/channel/region-split.md |
| FM-13 | Audit hash-chain break | Correctness / Security | 24 h (daily verifier) | Audit defensibility for affected partition lost | runbooks/channel/audit-chain-break.md |
| FM-14 | OTT provider webhook signature spoof / spike | Adversarial | 5 m | 401-rejected; alert; no false correlation | runbooks/channel/webhook-signature-spike.md |
| FM-15 | OTT credential rotation propagation failure | Code / Vault | 60 s | Adapter pods continue with stale creds; eventual 401 from provider; breaker opens | runbooks/channel/ott-rotation-failed.md |
3. Detailed failure modes
FM-01 — OTT provider API unavailable
Scenario. WhatsApp Cloud API returns 5xx, times out, or DNS fails for graph.facebook.com. Same pattern for Telegram (api.telegram.org) or Viber (chatapi.viber.com).
Impact. Step in the ladder fails_temp; the per-adapter circuit-breaker opens after 50 calls / 50% error rate. While open, that channel is step_skipped with reason adapter_circuit_open. Recipient is served by the next ladder step.
Detection. chan_provider_api_errors_total{provider} rate spike; chan_adapter_circuit_state{provider} flips to OPEN; alert ChannelAdapterUnavailable fires within 2 m.
Mitigation.
- Per-adapter circuit-breaker (window 50 calls, error-rate 50%, open 60 s, half-open probe).
- Ladder progression bypasses the broken provider with audit-trailed reason.
- Fraud-intel
fraud.detected.channel_abuse.v1may force-open the breaker for 15 m on suspected provider abuse. - Manual breaker control via
POST /v1/channel/adapters/{adapter}/circuit.
Recovery. Half-open probe (1 dispatch / 30 s) closes breaker on success; chan_adapter_circuit_state returns to CLOSED.
FM-02 — Postgres unavailable (write path)
Scenario. Primary PG (or sync replica during failover) unreachable.
Impact. Writes to fallback_executions, delivery_attempts, outbox fail. Hot-path cache-served reads continue up to TTL (60–300 s). gRPC returns UNAVAILABLE; orchestrator NATS redelivers.
Detection. chan_pg_query_duration_seconds P95 spike; pool-wait counter rises; /health/ready returns 503; alert within 30 s.
Mitigation.
- Patroni HA (1 primary + 2 sync standbys); auto-failover ≤ 30 s.
- Cross-region failover (manual-gated) ≤ 15 min.
- PgBouncer transaction-mode pooling absorbs short blips.
Recovery. Pods auto-recover. NATS backlog drains.
FM-04 — consent-ledger unreachable on hot path
Scenario. consent-ledger pods down; gRPC UNAVAILABLE or deadline exceeded > 10 ms.
Impact. First level: cached gating result still serves (60 s TTL). Beyond cache TTL: hard fail-closed → REFUSED_CONSENT_UNKNOWN; outcome event emitted; orchestrator marks notification BLOCKED. No silent dispatch.
Detection. chan_gate_fail_closed_total{gate="consent"} spike; alert ChannelConsentViolationAttempt (severity critical, 50/5m threshold).
Mitigation.
- consent-ledger HA (3+ replicas per region).
- 60 s decision cache absorbs short outages.
- Cross-region failover for consent-ledger.
RouteWithFallbackreturnsFAILED_PRECONDITION + REFUSED_CONSENT_UNKNOWNwith traceId so SRE can correlate.
Recovery. Once consent-ledger is back, cache misses succeed; refusal rate returns to baseline.
FM-08 — webhook-dispatcher unreachable
Scenario. webhook-dispatcher down or saturated.
Impact. Inbound MOs queue in NATS work-queue stream; tenant webhooks not delivered; recipient context lost from tenant's perspective.
Detection. chan_mo_routing_duration_seconds P95 spike; webhook-dispatcher upstream metric exposed; alert ChannelMoLatencyHigh.
Mitigation.
- webhook-dispatcher HA (5+ replicas) + retry policy (1s, 5s, 30s, 5m, 1h).
- After 5 attempts:
mo.webhook.deadletter.v1; tenant notified vianotification-service. - Tenant may re-deliver from dead-letter via admin API.
FM-09 — Conversation-id race condition
Scenario. Two concurrent MT dispatches for the same (senderId, msisdn, tenantId) racing to create / refresh a session.
Impact. Without guards, turnCount could be incremented twice and conversationId could diverge (one in Redis, one in PG outbox).
Detection. chan_session_race_detected_total (incremented when WATCH/MULTI rejects).
Mitigation.
- Redis WATCH on
chan:session:{senderId}:{msisdnHash}with MULTI/EXEC guard. - PG
INSERT ... ON CONFLICTfor the conversation row uses unique(tenantId, senderId, msisdnHash, openedAt). - Reconciliation job daily detects Redis-vs-PG divergence and resolves by closing PG ghosts with
reason=redis_loss.
FM-10 — OTT provider rate-limit hit
Scenario. Meta returns 429 on WhatsApp Cloud (Tier-1: 80 msg/s per phone-number-id). Telegram throttles (30 msg/s per bot, 1 msg/s per chat). Viber 429.
Impact. Adapter back-off (exponential with jitter); if back-off > deadline → step_skipped and ladder progresses. Tenant sees minor success-rate dip on that adapter for the burst window.
Detection. chan_provider_api_errors_total{http_status="429"} spike; chan_adapter_tps_bucket_used ≥ 95%.
Mitigation.
- Per-provider Redis token bucket sized to provider documented limits (see DOMAIN_MODEL §5).
- Per-chat secondary bucket for Telegram (1 msg/s).
- Negotiate Tier-2/Tier-3 with Meta/Viber for high-volume tenants.
- Spread load via PARALLEL strategy across multiple OTT providers for emergency alerts.
FM-11 — Cost-cap breach during high-volume event
Scenario. A tenant's costCapPerMessage is set conservatively but a fallback cascade (especially Voice OTP) reaches it.
Impact. REFUSED_COST_CAP outcomes; affected recipients not notified.
Detection. chan_cost_cap_breach_total spike; alert ChannelCostCapBreach at 10/15m.
Mitigation.
- UC-09 enforces cap pre-step-transition (no surprise charges).
- Tenant policy validation rejects
costCapPerMessage < cheapest pathat write-time. - Tenant alert via portal; suggest policy revision.
- Operator can temporarily raise cap via
PUTpolicy with audit.
FM-12 — Regional partition with active conversations
Scenario. Network partition between kbl and mzr; conversations opened in kbl cannot accept MOs that arrive in mzr (because sessions are region-pinned).
Impact. Cross-region MO forwarding queues; eventual delivery on heal; up to ~minutes of cross-region session lag.
Detection. chan_jetstream_mirror_lag_seconds{stream="CHANNEL_CONVERSATIONS"} spike; alert ChannelJetStreamMirrorLag.
Mitigation.
- Cross-region NATS subject
chan.mo.crossregion.forward.v1automatically forwards mismatched MOs to the owning region. - On extended partition (> 5 m), DR procedure re-points DNS to single region; conversations continue locally.
- After heal, both regions reconcile; no data loss (events durable in JetStream mirror).
FM-13 — Audit hash-chain break
Scenario. Daily verifier finds record_hash ≠ sha256(payload || prev_hash) for some row in chan.audit.
Impact. Audit defensibility for affected partition lost; regulator confidence reduced.
Detection. chan_audit_chain_break_total > 0; alert ChannelAuditChainBreak (Critical).
Mitigation.
- Investigate root cause: schema tampering, code bug, replica divergence.
- Quarantine affected partition; subsequent rows continue from a new chain origin.
- If tamper: CISO + Legal escalation.
- If code bug: bug-fix + chain recompute; regulator notified if previously-submitted audit was affected.
FM-14 — Webhook signature spoof / spike
Scenario. Attacker forges WhatsApp webhook payloads to inject false delivered statuses (motive: fraudulent billing or deliverability fraud).
Impact. Without signature validation, false correlation could bias billing or success metrics. With validation: 401-rejected; no impact, but spike indicates probing.
Detection. chan_webhook_signature_invalid_total{provider} rate spike; alert ChannelWebhookSignatureInvalidSpike (10/5m).
Mitigation.
- HMAC verification before parse, constant-time compare.
- IP allow-list for provider source IPs (Meta, Telegram, Viber publish IP ranges).
- Source-IP-based rate-limit before signature verification (defence in depth).
- Forensics on User-Agent + source ASN.
FM-15 — OTT credential rotation propagation failure
Scenario. chan.ott_account.rotated.v1 event lost or adapter pod fails to reload from Vault.
Impact. Adapter pod uses stale credential; provider returns 401; circuit opens.
Detection. chan_provider_api_errors_total{http_status="401"} spike on a provider after a rotation event.
Mitigation.
- Adapter pods poll Vault every 60 s as a safety net (in addition to event-driven reload).
- Rotation procedure publishes the event AND posts to a sentinel Redis key; adapter pods use the more-recent of (event, sentinel).
- Old credential accepted in parallel for 24 h grace by the platform — but not by the provider. Operator must verify adapter health within 60 s of rotate.
4. Cross-FM postures
| Concern | Posture |
|---|---|
| Recipient privacy (consent) | Fail-closed |
| Regulatory rules (compliance) | Fail-closed |
| Sender-ID authorisation | Fail-closed |
| Adapter / channel availability | Fail-degraded (skip + audit) |
| Session continuity | Best-effort; cross-region forwarding |
| Audit integrity | Fail-loud (alert; partition quarantine) |
| Cost containment | Fail-fast (refuse with audit) |
5. Per-FM metrics binding
Every alert in OBSERVABILITY §5 is bound to one or more failure modes here. SLO error-budget consumption is attributed by outcome label; FM-driven incidents are tagged in PagerDuty for post-incident review.
6. Drill schedule
| Drill | Cadence |
|---|---|
| FM-01 (OTT-down) | Monthly per provider |
| FM-02 (PG primary kill) | Quarterly |
| FM-04 (consent-ledger out) | Quarterly |
| FM-12 (region partition) | Quarterly |
| FM-13 (audit chain break — synthetic) | Monthly |
| Full DR (region failover) | Quarterly |
All drills tracked in docs/drills/channel-router/.