Skip to main content

Channel Router Service — Observability

Version: 1.0 Status: Draft Owner: Messaging Core + SRE Last Updated: 2026-04-21 Companion: SERVICE_OVERVIEW · FAILURE_MODES · SERVICE_READINESS


1. SLIs and SLOs

SLISLOWindowError budget consumed by
RouteWithFallback P95 decision latency≤ 50 ms5 minPG/Redis tail, consent + compliance gating round-trips
RouteWithFallback P99 decision latency≤ 120 ms5 min
Full fallback cascade P95 (3-step OTP)≤ 25 s5 minVoice call setup, SMS DLR tail
Outcome-event availability (published / total executions)≥ 99.99%monthlyOutbox relay, NATS availability
MO routing P95 (mo.allowed.v1 consume → tenant 2xx)≤ 1 s5 minwebhook-dispatcher retries
OTT adapter availability per provider≥ 99.9%24 hProvider-side outages (breaker-driven)
Fallback-taken ratio≤ 10% of executions24 hChannel health, tenant policy quality
Cost-cap breach rate≤ 0.1% of executions24 hPolicy misconfig
Recipient-profile cache hit ratio≥ 85%5 minRedis health, TTL tuning
Audit-chain integrity100% per day24 h

2. Prometheus metrics

Exposed at /metrics on :9061. Prometheus text format. Scraped every 15 s.

2.1 RED metrics

MetricTypeLabelsDescription
chan_requests_totalCounterrpc, outcome, tenant_tierAll gRPC calls
chan_request_duration_secondsHistogramrpc, outcomeEnd-to-end handler latency
chan_errors_totalCounterrpc, codegRPC status errors

Histogram buckets: [0.005, 0.010, 0.020, 0.035, 0.050, 0.075, 0.100, 0.200, 0.500, 1.000, 2.500, 5.000, 10.000] — low-end fine for decision, high-end coarse for cascade.

2.2 Fallback / ladder

MetricTypeLabelsDescription
chan_fallback_taken_totalCounterfrom_channel, to_channel, reasonLadder progressions
chan_fallback_cascade_duration_secondsHistogramuse_case, final_channel, final_outcomeFull cascade end-to-end
chan_ladder_lengthHistogramuse_caseNumber of resolved ladder steps (after gating)
chan_excluded_reasons_totalCounterchannel, reasonPer-channel exclusion accounting
chan_cost_cap_breach_totalCounteruse_caseREFUSED_COST_CAP counts

2.3 Per-channel delivery

MetricTypeLabelsDescription
chan_delivery_attempts_totalCounterchannel, statusPer-channel attempt outcomes
chan_delivery_duration_secondsHistogramchannel, statusPer-channel latency
chan_delivery_success_ratioGaugechannelRolling 5-min success rate
chan_provider_api_duration_secondsHistogramprovider, operationRaw provider API latency
chan_provider_api_errors_totalCounterprovider, http_statusUpstream errors
chan_adapter_circuit_stateGaugeadapter, provider0=closed, 1=half_open, 2=open
chan_adapter_tps_bucket_usedGaugeprovider, account_idToken-bucket occupancy

2.4 MO routing

MetricTypeLabelsDescription
chan_mo_inbound_totalCountertenant_id, destination, mno, route_typeMOs routed (SESSION/STATIC/STOP_KEYWORD)
chan_mo_unmatched_totalCounterdestination, mnoUnmatched MOs
chan_mo_routing_duration_secondsHistogramroute_type, outcomeEnd-to-end routing latency
chan_mo_session_hit_totalCounter(none)Session path (vs static)
chan_mo_stop_keyword_totalCountertenant_id, languageSTOP hits

2.5 Conversation / session

MetricTypeLabelsDescription
chan_conversation_open_countGaugetenant_idActive OPEN conversations
chan_conversation_opened_totalCountertenant_id, channelNew session rate
chan_conversation_closed_totalCounterreasonPer-reason closure
chan_session_lookup_hit_totalCountersource (redis/pg_fallback)
chan_session_lookup_duration_secondsHistogram

2.6 Recipient profile

MetricTypeLabelsDescription
chan_profile_cache_hits_totalCounter
chan_profile_cache_misses_totalCounter
chan_profile_updates_totalCountersource (delivery_feedback/capability_probe)
chan_profile_discovery_stateGaugestateDistribution of profiles across UNSEEN/LEARNING/STABLE
MetricTypeLabelsDescription
chan_gate_cache_hits_totalCountergate (consent/compliance)
chan_gate_cache_misses_totalCountergate
chan_gate_latency_secondsHistogramgateRound-trip to dependency
chan_gate_fail_closed_totalCountergate, reasonHard fail-closed refusals

2.8 OTT webhook ingress

MetricTypeLabelsDescription
chan_webhook_requests_totalCounterprovider, outcomeProvider webhook traffic
chan_webhook_signature_invalid_totalCounterproviderRejected by signature
chan_webhook_correlation_miss_totalCounterproviderNo matching attempt found

2.9 Audit + outbox + replication

MetricTypeLabelsDescription
chan_outbox_unpublished_totalGaugeoutbox (delivery/general)Backlog depth
chan_outbox_publish_duration_secondsHistogramPublish latency
chan_audit_rows_totalCounterentity_type, actionAudit write rate
chan_audit_chain_break_totalCounter(none)Daily verifier breaks (target 0)
chan_jetstream_mirror_lag_secondsGaugestreamCross-region lag
chan_pg_replication_lag_secondsGaugePG logical replication lag

2.10 ML (see AI_INTEGRATION §9)

chan_ml_inference_duration_seconds, chan_ml_budget_exceeded_total, chan_ml_fallback_rate, chan_ml_feature_drift_psi.


3. Structured logs (Pino)

All logs are JSON. Pino redactor masks msisdn, body, senderId (preserves first 3 chars), secret*, token*, authorization.

3.1 Decision log (sampled 1% on DELIVERED, 100% on FAILED / REFUSED_*)

{
"level": "info",
"time": "2026-04-21T10:14:23.123Z",
"event": "chan.route.decided",
"executionId": "exec_01HKX...",
"notificationId": "n_01HKX...",
"tenantId": "t_...",
"useCase": "otp",
"ladderAccepted": ["SMS", "WHATSAPP", "VOICE"],
"excluded": [{"channel": "EMAIL", "reason": "recipient_opt_out"}],
"decisionLatencyMs": 38,
"gateLatencies": {"consent": 7, "compliance": 12, "senderId": 3},
"traceId": "00-abc123-def456-01"
}

3.2 Fallback progression log

{
"event": "chan.fallback.taken",
"executionId": "exec_01HKX...",
"fromChannel": "SMS",
"toChannel": "WHATSAPP",
"fromStatus": "timed_out",
"reason": "deadline_elapsed",
"fromDurationMs": 60001,
"costAccumulatedNgn": 2.50
}

3.3 MO routing log

{
"event": "chan.mo.routed",
"messageId": "mo_01HKX...",
"tenantId": "t_...",
"destination": "2211",
"mno": "AWCC",
"routeType": "SESSION",
"conversationId": "conv_01HKX...",
"closedSessionByStop": false,
"tenantWebhookStatus": 200,
"latencyMs": 320
}

3.4 OTT webhook log

{
"event": "chan.webhook.received",
"provider": "WHATSAPP_CLOUD",
"correlationHit": true,
"attemptId": "attempt_01HKX...",
"providerStatus": "delivered",
"providerMessageId": "wamid.HBgL..."
}

3.5 Error log

{
"level": "error",
"event": "chan.gate.fail_closed",
"gate": "consent",
"reason": "deadline_exceeded",
"cacheAgeSeconds": 312,
"executionId": "exec_01HKX..."
}

4. OpenTelemetry tracing

Parent span: chan.RouteWithFallback or chan.RouteMo. Trace context propagated via grpc-trace-bin and W3C traceparent. Head-sampling: 100% for FAILED/REFUSED_*; 1% for DELIVERED.

SpanOperationAttributes
chan.policy.loadRedis GET + PG fallbackcache.hit, policy.version
chan.profile.loadRedis GET + PG fallbackcache.hit, discovery.state
chan.gate.consentgRPC to consent-ledgercache.hit, result.denied_channels[]
chan.gate.compliancegRPC to compliance-enginecache.hit, result.denied_channels[]
chan.gate.sendergRPC to sender-id-registrycache.hit, allowed
chan.ladder.resolveIn-processsteps, cost_pre_check
chan.dispatch.<channel>Per-adapter callprovider, provider_message_id, status
chan.webhook.verifyHMAC / secret-pathprovider, signature_valid
chan.outcome.emitOutbox + NATS publishfinal, attempts

5. Alerts (Prometheus / Alertmanager YAML)

groups:
- name: channel.slo
rules:
- alert: ChannelRouteLatencyHigh
expr: |
histogram_quantile(0.95,
rate(chan_request_duration_seconds_bucket{rpc="RouteWithFallback"}[5m])
) > 0.050
for: 5m
labels: { severity: high, service: channel-router }
annotations:
summary: "RouteWithFallback P95 > 50ms for 5m"
runbook: "https://runbooks.ghasi.af/channel/route-latency-high"

- alert: ChannelFallbackRateHigh
expr: |
sum(rate(chan_fallback_taken_total[30m])) /
sum(rate(chan_requests_total{rpc="RouteWithFallback"}[30m])) > 0.25
for: 15m
labels: { severity: high }
annotations:
summary: "Fallback taken on > 25% of traffic — SMS channel may be degraded"

- alert: ChannelOutcomePublishLagHigh
expr: |
histogram_quantile(0.99,
rate(chan_outbox_publish_duration_seconds_bucket[5m])
) > 1.0
for: 5m
labels: { severity: high }

- name: channel.adapter
rules:
- alert: ChannelAdapterUnavailable
expr: chan_adapter_circuit_state == 2
for: 2m
labels: { severity: high }
annotations:
summary: "Adapter {{ $labels.adapter }} breaker OPEN"
runbook: "https://runbooks.ghasi.af/channel/adapter-circuit-open"

- alert: ChannelProviderApiErrorSpike
expr: |
sum(rate(chan_provider_api_errors_total{http_status=~"5.."}[5m])) by (provider) > 10
for: 3m
labels: { severity: high }

- name: channel.gate
rules:
- alert: ChannelConsentViolationAttempt
expr: increase(chan_gate_fail_closed_total{gate="consent"}[5m]) > 50
labels: { severity: critical }
annotations:
summary: "Spike of consent-gate fail-closed events — investigate consent-ledger"

- alert: ChannelComplianceViolationAttempt
expr: increase(chan_excluded_reasons_total{reason="compliance_block"}[5m]) > 200
labels: { severity: medium }

- name: channel.cost
rules:
- alert: ChannelCostCapBreach
expr: increase(chan_cost_cap_breach_total[15m]) > 10
labels: { severity: medium }
annotations:
summary: "Multiple cost-cap breaches — tenant policies may be misconfigured"

- name: channel.mo
rules:
- alert: ChannelMoRoutingFailed
expr: |
sum(rate(chan_mo_unmatched_total[5m])) > 100
for: 5m
labels: { severity: medium }
annotations:
summary: "Unmatched MO spike — inbound route configuration drift?"

- alert: ChannelMoLatencyHigh
expr: |
histogram_quantile(0.95,
rate(chan_mo_routing_duration_seconds_bucket[5m])
) > 1.0
for: 5m
labels: { severity: high }

- name: channel.webhook
rules:
- alert: ChannelWebhookSignatureInvalidSpike
expr: rate(chan_webhook_signature_invalid_total[5m]) > 10
labels: { severity: high }
annotations:
summary: "Spike of invalid OTT webhook signatures — potential spoofing"

- name: channel.audit
rules:
- alert: ChannelAuditChainBreak
expr: increase(chan_audit_chain_break_total[24h]) > 0
labels: { severity: critical }
annotations:
summary: "Audit hash-chain integrity failure — engage Security IR"

- name: channel.replication
rules:
- alert: ChannelJetStreamMirrorLag
expr: chan_jetstream_mirror_lag_seconds{stream="CHANNEL_OUTCOMES"} > 10
for: 5m
labels: { severity: high }

6. Grafana dashboards

Dashboard: dashboards/channel-router.json. Tags: service:channel-router, environment:prod, region:{kbl,mzr}.

PanelQueryVisualisation
RouteWithFallback RPSsum(rate(chan_requests_total{rpc="RouteWithFallback"}[1m])) by (outcome)Stacked area
Decision P50/P95/P99histogram_quantile(0.{50,95,99}, …)Time series
Fallback-taken ratio (24h)Single stat
Fallback Sankey (from→to)sum(rate(chan_fallback_taken_total[24h])) by (from_channel, to_channel)Sankey
Per-channel success ratechan_delivery_success_ratioLine
Adapter circuit stateschan_adapter_circuit_stateState timeline per provider
Provider API latency P95per providerTime series
MO RPS by route-typesum(rate(chan_mo_inbound_total[1m])) by (route_type)Stacked area
MO latency P95histogram_quantile(0.95, chan_mo_routing_duration_seconds_bucket)Time series
Active conversations by tenanttopk(20, chan_conversation_open_count)Bar
Cost-cap breach raterate(chan_cost_cap_breach_total[1h])Line
Profile cache hit ratiochan_profile_cache_hits_total / (chan_profile_cache_hits_total + chan_profile_cache_misses_total)Gauge
Discovery-state distributionsum(chan_profile_discovery_state) by (state)Pie
Audit chain break (must be 0)chan_audit_chain_break_totalSingle stat (red if > 0)
Mirror lag (kbl↔mzr)chan_jetstream_mirror_lag_secondsTime series
ML inference P95 + fallback rateDual-axis

  • runbooks/channel/route-latency-high.md
  • runbooks/channel/adapter-circuit-open.md (per provider: whatsapp / telegram / viber / voice)
  • runbooks/channel/consent-fail-closed-spike.md
  • runbooks/channel/mo-routing-failed.md
  • runbooks/channel/audit-chain-break.md
  • runbooks/channel/webhook-signature-spike.md (possible spoofing)
  • runbooks/channel/cost-cap-breach.md (per-tenant investigation)
  • runbooks/channel/mirror-lag.md

All runbooks versioned in docs/runbooks/channel/.