Channel Router Service — Observability
Version: 1.0 Status: Draft Owner: Messaging Core + SRE Last Updated: 2026-04-21 Companion: SERVICE_OVERVIEW · FAILURE_MODES · SERVICE_READINESS
1. SLIs and SLOs
| SLI | SLO | Window | Error budget consumed by |
|---|---|---|---|
RouteWithFallback P95 decision latency | ≤ 50 ms | 5 min | PG/Redis tail, consent + compliance gating round-trips |
RouteWithFallback P99 decision latency | ≤ 120 ms | 5 min | — |
| Full fallback cascade P95 (3-step OTP) | ≤ 25 s | 5 min | Voice call setup, SMS DLR tail |
| Outcome-event availability (published / total executions) | ≥ 99.99% | monthly | Outbox relay, NATS availability |
MO routing P95 (mo.allowed.v1 consume → tenant 2xx) | ≤ 1 s | 5 min | webhook-dispatcher retries |
| OTT adapter availability per provider | ≥ 99.9% | 24 h | Provider-side outages (breaker-driven) |
| Fallback-taken ratio | ≤ 10% of executions | 24 h | Channel health, tenant policy quality |
| Cost-cap breach rate | ≤ 0.1% of executions | 24 h | Policy misconfig |
| Recipient-profile cache hit ratio | ≥ 85% | 5 min | Redis health, TTL tuning |
| Audit-chain integrity | 100% per day | 24 h | — |
2. Prometheus metrics
Exposed at /metrics on :9061. Prometheus text format. Scraped every 15 s.
2.1 RED metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
chan_requests_total | Counter | rpc, outcome, tenant_tier | All gRPC calls |
chan_request_duration_seconds | Histogram | rpc, outcome | End-to-end handler latency |
chan_errors_total | Counter | rpc, code | gRPC status errors |
Histogram buckets: [0.005, 0.010, 0.020, 0.035, 0.050, 0.075, 0.100, 0.200, 0.500, 1.000, 2.500, 5.000, 10.000] — low-end fine for decision, high-end coarse for cascade.
2.2 Fallback / ladder
| Metric | Type | Labels | Description |
|---|---|---|---|
chan_fallback_taken_total | Counter | from_channel, to_channel, reason | Ladder progressions |
chan_fallback_cascade_duration_seconds | Histogram | use_case, final_channel, final_outcome | Full cascade end-to-end |
chan_ladder_length | Histogram | use_case | Number of resolved ladder steps (after gating) |
chan_excluded_reasons_total | Counter | channel, reason | Per-channel exclusion accounting |
chan_cost_cap_breach_total | Counter | use_case | REFUSED_COST_CAP counts |
2.3 Per-channel delivery
| Metric | Type | Labels | Description |
|---|---|---|---|
chan_delivery_attempts_total | Counter | channel, status | Per-channel attempt outcomes |
chan_delivery_duration_seconds | Histogram | channel, status | Per-channel latency |
chan_delivery_success_ratio | Gauge | channel | Rolling 5-min success rate |
chan_provider_api_duration_seconds | Histogram | provider, operation | Raw provider API latency |
chan_provider_api_errors_total | Counter | provider, http_status | Upstream errors |
chan_adapter_circuit_state | Gauge | adapter, provider | 0=closed, 1=half_open, 2=open |
chan_adapter_tps_bucket_used | Gauge | provider, account_id | Token-bucket occupancy |
2.4 MO routing
| Metric | Type | Labels | Description |
|---|---|---|---|
chan_mo_inbound_total | Counter | tenant_id, destination, mno, route_type | MOs routed (SESSION/STATIC/STOP_KEYWORD) |
chan_mo_unmatched_total | Counter | destination, mno | Unmatched MOs |
chan_mo_routing_duration_seconds | Histogram | route_type, outcome | End-to-end routing latency |
chan_mo_session_hit_total | Counter | (none) | Session path (vs static) |
chan_mo_stop_keyword_total | Counter | tenant_id, language | STOP hits |
2.5 Conversation / session
| Metric | Type | Labels | Description |
|---|---|---|---|
chan_conversation_open_count | Gauge | tenant_id | Active OPEN conversations |
chan_conversation_opened_total | Counter | tenant_id, channel | New session rate |
chan_conversation_closed_total | Counter | reason | Per-reason closure |
chan_session_lookup_hit_total | Counter | source (redis/pg_fallback) | |
chan_session_lookup_duration_seconds | Histogram | — | — |
2.6 Recipient profile
| Metric | Type | Labels | Description |
|---|---|---|---|
chan_profile_cache_hits_total | Counter | — | |
chan_profile_cache_misses_total | Counter | — | |
chan_profile_updates_total | Counter | source (delivery_feedback/capability_probe) | |
chan_profile_discovery_state | Gauge | state | Distribution of profiles across UNSEEN/LEARNING/STABLE |
2.7 Consent / compliance gating
| Metric | Type | Labels | Description |
|---|---|---|---|
chan_gate_cache_hits_total | Counter | gate (consent/compliance) | |
chan_gate_cache_misses_total | Counter | gate | |
chan_gate_latency_seconds | Histogram | gate | Round-trip to dependency |
chan_gate_fail_closed_total | Counter | gate, reason | Hard fail-closed refusals |
2.8 OTT webhook ingress
| Metric | Type | Labels | Description |
|---|---|---|---|
chan_webhook_requests_total | Counter | provider, outcome | Provider webhook traffic |
chan_webhook_signature_invalid_total | Counter | provider | Rejected by signature |
chan_webhook_correlation_miss_total | Counter | provider | No matching attempt found |
2.9 Audit + outbox + replication
| Metric | Type | Labels | Description |
|---|---|---|---|
chan_outbox_unpublished_total | Gauge | outbox (delivery/general) | Backlog depth |
chan_outbox_publish_duration_seconds | Histogram | — | Publish latency |
chan_audit_rows_total | Counter | entity_type, action | Audit write rate |
chan_audit_chain_break_total | Counter | (none) | Daily verifier breaks (target 0) |
chan_jetstream_mirror_lag_seconds | Gauge | stream | Cross-region lag |
chan_pg_replication_lag_seconds | Gauge | — | PG logical replication lag |
2.10 ML (see AI_INTEGRATION §9)
chan_ml_inference_duration_seconds, chan_ml_budget_exceeded_total, chan_ml_fallback_rate, chan_ml_feature_drift_psi.
3. Structured logs (Pino)
All logs are JSON. Pino redactor masks msisdn, body, senderId (preserves first 3 chars), secret*, token*, authorization.
3.1 Decision log (sampled 1% on DELIVERED, 100% on FAILED / REFUSED_*)
{
"level": "info",
"time": "2026-04-21T10:14:23.123Z",
"event": "chan.route.decided",
"executionId": "exec_01HKX...",
"notificationId": "n_01HKX...",
"tenantId": "t_...",
"useCase": "otp",
"ladderAccepted": ["SMS", "WHATSAPP", "VOICE"],
"excluded": [{"channel": "EMAIL", "reason": "recipient_opt_out"}],
"decisionLatencyMs": 38,
"gateLatencies": {"consent": 7, "compliance": 12, "senderId": 3},
"traceId": "00-abc123-def456-01"
}
3.2 Fallback progression log
{
"event": "chan.fallback.taken",
"executionId": "exec_01HKX...",
"fromChannel": "SMS",
"toChannel": "WHATSAPP",
"fromStatus": "timed_out",
"reason": "deadline_elapsed",
"fromDurationMs": 60001,
"costAccumulatedNgn": 2.50
}
3.3 MO routing log
{
"event": "chan.mo.routed",
"messageId": "mo_01HKX...",
"tenantId": "t_...",
"destination": "2211",
"mno": "AWCC",
"routeType": "SESSION",
"conversationId": "conv_01HKX...",
"closedSessionByStop": false,
"tenantWebhookStatus": 200,
"latencyMs": 320
}
3.4 OTT webhook log
{
"event": "chan.webhook.received",
"provider": "WHATSAPP_CLOUD",
"correlationHit": true,
"attemptId": "attempt_01HKX...",
"providerStatus": "delivered",
"providerMessageId": "wamid.HBgL..."
}
3.5 Error log
{
"level": "error",
"event": "chan.gate.fail_closed",
"gate": "consent",
"reason": "deadline_exceeded",
"cacheAgeSeconds": 312,
"executionId": "exec_01HKX..."
}
4. OpenTelemetry tracing
Parent span: chan.RouteWithFallback or chan.RouteMo. Trace context propagated via grpc-trace-bin and W3C traceparent. Head-sampling: 100% for FAILED/REFUSED_*; 1% for DELIVERED.
| Span | Operation | Attributes |
|---|---|---|
chan.policy.load | Redis GET + PG fallback | cache.hit, policy.version |
chan.profile.load | Redis GET + PG fallback | cache.hit, discovery.state |
chan.gate.consent | gRPC to consent-ledger | cache.hit, result.denied_channels[] |
chan.gate.compliance | gRPC to compliance-engine | cache.hit, result.denied_channels[] |
chan.gate.sender | gRPC to sender-id-registry | cache.hit, allowed |
chan.ladder.resolve | In-process | steps, cost_pre_check |
chan.dispatch.<channel> | Per-adapter call | provider, provider_message_id, status |
chan.webhook.verify | HMAC / secret-path | provider, signature_valid |
chan.outcome.emit | Outbox + NATS publish | final, attempts |
5. Alerts (Prometheus / Alertmanager YAML)
groups:
- name: channel.slo
rules:
- alert: ChannelRouteLatencyHigh
expr: |
histogram_quantile(0.95,
rate(chan_request_duration_seconds_bucket{rpc="RouteWithFallback"}[5m])
) > 0.050
for: 5m
labels: { severity: high, service: channel-router }
annotations:
summary: "RouteWithFallback P95 > 50ms for 5m"
runbook: "https://runbooks.ghasi.af/channel/route-latency-high"
- alert: ChannelFallbackRateHigh
expr: |
sum(rate(chan_fallback_taken_total[30m])) /
sum(rate(chan_requests_total{rpc="RouteWithFallback"}[30m])) > 0.25
for: 15m
labels: { severity: high }
annotations:
summary: "Fallback taken on > 25% of traffic — SMS channel may be degraded"
- alert: ChannelOutcomePublishLagHigh
expr: |
histogram_quantile(0.99,
rate(chan_outbox_publish_duration_seconds_bucket[5m])
) > 1.0
for: 5m
labels: { severity: high }
- name: channel.adapter
rules:
- alert: ChannelAdapterUnavailable
expr: chan_adapter_circuit_state == 2
for: 2m
labels: { severity: high }
annotations:
summary: "Adapter {{ $labels.adapter }} breaker OPEN"
runbook: "https://runbooks.ghasi.af/channel/adapter-circuit-open"
- alert: ChannelProviderApiErrorSpike
expr: |
sum(rate(chan_provider_api_errors_total{http_status=~"5.."}[5m])) by (provider) > 10
for: 3m
labels: { severity: high }
- name: channel.gate
rules:
- alert: ChannelConsentViolationAttempt
expr: increase(chan_gate_fail_closed_total{gate="consent"}[5m]) > 50
labels: { severity: critical }
annotations:
summary: "Spike of consent-gate fail-closed events — investigate consent-ledger"
- alert: ChannelComplianceViolationAttempt
expr: increase(chan_excluded_reasons_total{reason="compliance_block"}[5m]) > 200
labels: { severity: medium }
- name: channel.cost
rules:
- alert: ChannelCostCapBreach
expr: increase(chan_cost_cap_breach_total[15m]) > 10
labels: { severity: medium }
annotations:
summary: "Multiple cost-cap breaches — tenant policies may be misconfigured"
- name: channel.mo
rules:
- alert: ChannelMoRoutingFailed
expr: |
sum(rate(chan_mo_unmatched_total[5m])) > 100
for: 5m
labels: { severity: medium }
annotations:
summary: "Unmatched MO spike — inbound route configuration drift?"
- alert: ChannelMoLatencyHigh
expr: |
histogram_quantile(0.95,
rate(chan_mo_routing_duration_seconds_bucket[5m])
) > 1.0
for: 5m
labels: { severity: high }
- name: channel.webhook
rules:
- alert: ChannelWebhookSignatureInvalidSpike
expr: rate(chan_webhook_signature_invalid_total[5m]) > 10
labels: { severity: high }
annotations:
summary: "Spike of invalid OTT webhook signatures — potential spoofing"
- name: channel.audit
rules:
- alert: ChannelAuditChainBreak
expr: increase(chan_audit_chain_break_total[24h]) > 0
labels: { severity: critical }
annotations:
summary: "Audit hash-chain integrity failure — engage Security IR"
- name: channel.replication
rules:
- alert: ChannelJetStreamMirrorLag
expr: chan_jetstream_mirror_lag_seconds{stream="CHANNEL_OUTCOMES"} > 10
for: 5m
labels: { severity: high }
6. Grafana dashboards
Dashboard: dashboards/channel-router.json. Tags: service:channel-router, environment:prod, region:{kbl,mzr}.
| Panel | Query | Visualisation |
|---|---|---|
| RouteWithFallback RPS | sum(rate(chan_requests_total{rpc="RouteWithFallback"}[1m])) by (outcome) | Stacked area |
| Decision P50/P95/P99 | histogram_quantile(0.{50,95,99}, …) | Time series |
| Fallback-taken ratio (24h) | — | Single stat |
| Fallback Sankey (from→to) | sum(rate(chan_fallback_taken_total[24h])) by (from_channel, to_channel) | Sankey |
| Per-channel success rate | chan_delivery_success_ratio | Line |
| Adapter circuit states | chan_adapter_circuit_state | State timeline per provider |
| Provider API latency P95 | per provider | Time series |
| MO RPS by route-type | sum(rate(chan_mo_inbound_total[1m])) by (route_type) | Stacked area |
| MO latency P95 | histogram_quantile(0.95, chan_mo_routing_duration_seconds_bucket) | Time series |
| Active conversations by tenant | topk(20, chan_conversation_open_count) | Bar |
| Cost-cap breach rate | rate(chan_cost_cap_breach_total[1h]) | Line |
| Profile cache hit ratio | chan_profile_cache_hits_total / (chan_profile_cache_hits_total + chan_profile_cache_misses_total) | Gauge |
| Discovery-state distribution | sum(chan_profile_discovery_state) by (state) | Pie |
| Audit chain break (must be 0) | chan_audit_chain_break_total | Single stat (red if > 0) |
| Mirror lag (kbl↔mzr) | chan_jetstream_mirror_lag_seconds | Time series |
| ML inference P95 + fallback rate | — | Dual-axis |
7. Runbook links
runbooks/channel/route-latency-high.mdrunbooks/channel/adapter-circuit-open.md(per provider: whatsapp / telegram / viber / voice)runbooks/channel/consent-fail-closed-spike.mdrunbooks/channel/mo-routing-failed.mdrunbooks/channel/audit-chain-break.mdrunbooks/channel/webhook-signature-spike.md(possible spoofing)runbooks/channel/cost-cap-breach.md(per-tenant investigation)runbooks/channel/mirror-lag.md
All runbooks versioned in docs/runbooks/channel/.