SMS Firewall Service — Observability
Version: 1.0 Status: Draft Owner: Trust & Safety + SRE Last Updated: 2026-04-21 Companion: SERVICE_OVERVIEW · FAILURE_MODES · SERVICE_READINESS
1. SLIs and SLOs
The firewall is the synchronous gate for inbound MO and transit MT. SLOs are hard NFRs — every breach degrades MNO bind health and propagates to subscriber experience.
| SLI | SLO | Window | Error budget consumed by |
|---|---|---|---|
FilterInbound P95 latency | ≤ 30 ms | 5 min | Redis tail latency, regex evaluation, classifier mis-budgeting |
FilterInbound P99 latency | ≤ 50 ms | 5 min | — |
EvaluateTransit P95 latency | ≤ 50 ms | 5 min | + sender-id-registry RPC, number-intel HLR lookup |
EvaluateTransit P99 latency | ≤ 100 ms | 5 min | — |
Verdict availability (gRPC OK / total) | ≥ 99.99% per region | monthly | Pod restarts, PG/Redis blips |
| Audit-event publish lag (verdict → NATS ACK) | ≤ 1 s P99 | 5 min | Outbox relay throughput |
| DND snapshot age | ≤ 1 h continuous | — | consent-ledger snapshot worker delays |
| Federation freshness (last regulator import) | ≤ 24 h | — | Regulator publishing cadence |
Bloom rebuild lag (after blocklist.changed) | ≤ 5 s P99 | — | Worker scheduling |
| Cross-region replication lag (kbl→mzr) | ≤ 5 s control-plane / ≤ 10 s audit | — | WAN, JetStream mirror |
| Hash-chain integrity | 100% rows verified daily | 24h | — |
2. Prometheus metrics
All metrics exposed at GET /metrics on port 9061. Prometheus text format. Scraped every 15 s.
2.1 RED metrics (Requests / Errors / Duration)
| Metric | Type | Labels | Description |
|---|---|---|---|
firewall_requests_total | Counter | rpc, verdict, direction, mno_id, peer_asn | Total verdict calls by outcome |
firewall_request_duration_seconds | Histogram | rpc, verdict | End-to-end gRPC handler latency |
firewall_errors_total | Counter | rpc, code (gRPC status) | Errors by code |
Histogram buckets for firewall_request_duration_seconds:
[0.001, 0.005, 0.010, 0.020, 0.030, 0.050, 0.075, 0.100, 0.150, 0.250, 0.500] (firewall is sub-100 ms regime; finer buckets at low end).
2.2 Verdict & rule metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
firewall_verdict_total | Counter | verdict, direction, block_reason, mno_bind_id, peer_asn | Verdict distribution |
firewall_rule_eval_seconds | Histogram | rule_id, rule_type | Per-rule evaluation latency |
firewall_rule_match_total | Counter | rule_id, rule_type, action | Rule hit counts |
firewall_rule_degraded_total | Counter | rule_id, reason | Auto-disabled rules (REGEX_TIMEOUT, CLASSIFIER_UNAVAILABLE) |
firewall_active_rules | Gauge | scope, rule_type | Active rule count |
firewall_rule_set_version | Gauge | (none) | Currently loaded rule-set version |
2.3 Cache & data-plane infra
| Metric | Type | Labels | Description |
|---|---|---|---|
firewall_verdict_cache_hits_total | Counter | verdict | fw:verdict:* Redis hits |
firewall_verdict_cache_misses_total | Counter | — | Misses |
firewall_bloom_check_total | Counter | bloom_kind (blocklist/dnd), result (hit/miss) | Bloom lookup outcomes |
firewall_bloom_unavailable_total | Counter | bloom_kind | Times Bloom fell through to PG |
firewall_rate_governor_skip_total | Counter | scope | Rate governor skipped (Redis unavailable) |
firewall_rate_governor_block_total | Counter | scope, window | Rate governor BLOCKED |
firewall_pg_pool_waiting_clients | Gauge | — | PgBouncer wait queue |
firewall_pg_query_duration_seconds | Histogram | query | PG round-trip |
firewall_redis_command_duration_seconds | Histogram | command | Redis round-trip |
2.4 Quarantine & federation
| Metric | Type | Labels | Description |
|---|---|---|---|
firewall_quarantine_depth | Gauge | direction, status | Hold queue depth |
firewall_quarantine_review_duration_seconds | Histogram | outcome (released/rejected/auto_expired) | Time hold spent in queue |
firewall_federation_import_total | Counter | source, result (success/signature_invalid) | Federation imports |
firewall_federation_export_lag_seconds | Gauge | — | Time since last successful export |
firewall_blocklist_entries_total | Gauge | source, active | Blocklist size |
firewall_dnd_snapshot_age_seconds | Gauge | — | Age of currently-loaded DND snapshot |
2.5 SIM-box / AIT / fraud-intel
| Metric | Type | Labels | Description |
|---|---|---|---|
firewall_simbox_signals_total | Gauge | active | Active SIM-box signals |
firewall_simbox_block_total | Counter | (none) | MOs blocked by SIM-box signature |
firewall_ait_patterns_total | Gauge | pattern_type, active | Active AIT patterns |
firewall_ait_block_total | Counter | pattern_type | MOs blocked by AIT signature |
firewall_fraudintel_event_age_seconds | Gauge | event_type | Age of last consumed fraud.detected.* event |
2.6 Peer & MNO bind
| Metric | Type | Labels | Description |
|---|---|---|---|
firewall_peer_hygiene_score | Gauge | peer_id | Current rolling 24h peer score (0–100) |
firewall_peer_quarantined | Gauge | peer_id | 1 if quarantined |
firewall_bind_heartbeat_age_seconds | Gauge | mno_bind_id | Time since last connector heartbeat |
2.7 Operating mode
| Metric | Type | Labels | Description |
|---|---|---|---|
firewall_mode_current | Gauge | mode | One-hot per mode (NORMAL/DEGRADED/PANIC/MAINTENANCE) |
firewall_mode_panic_active | Gauge | (none) | 1 when in PANIC |
firewall_mode_changed_total | Counter | from, to, trigger | Mode-switch events |
2.8 Audit + outbox + replication
| Metric | Type | Labels | Description |
|---|---|---|---|
firewall_audit_rows_total | Counter | direction, verdict | Audit insert rate |
firewall_audit_chain_break_total | Counter | (none) | Chain-break detections (target: 0) |
firewall_outbox_unpublished_total | Gauge | — | Backlog depth |
firewall_outbox_publish_duration_seconds | Histogram | — | Publish latency |
firewall_jetstream_mirror_lag_seconds | Gauge | stream | JetStream mirror lag |
firewall_pg_replication_lag_seconds | Gauge | stream | Postgres logical replication lag |
2.9 Classifier
| Metric | Type | Labels | Description |
|---|---|---|---|
firewall_classifier_requests_total | Counter | model_version, cache_hit | Classifier invocations |
firewall_classifier_duration_seconds | Histogram | model_version | Inference latency |
firewall_classifier_timeout_total | Counter | — | 15 ms timeout breaches |
firewall_classifier_skip_total | Counter | reason (PANIC_MODE, BUDGET_SKIP, UNAVAILABLE) | Classifier skipped |
3. Structured logs (Pino)
All log output is valid JSON. PII redaction enforced by Pino redactor — fields srcMsisdn, dstMsisdn, pduBody, senderId are masked at logger level. ESLint rule prevents direct logging of PDU body.
3.1 Verdict log (sampled 1% for ALLOW; 100% for FLAG/BLOCK/QUARANTINE)
{
"level": "info",
"time": "2026-04-21T10:14:23.123Z",
"event": "firewall.verdict",
"verdictId": "fv_01HKX...",
"rpc": "FilterInbound",
"direction": "MO",
"verdict": "BLOCK",
"blockReason": "ORIGIN_BLOCKLIST",
"mnoBindId": "awcc-rx-01",
"srcMsisdnMasked": "+93701***",
"dstMsisdnMasked": "+93702***",
"evaluatedRuleIds": ["fr_01HKX...", "fr_01HKY..."],
"ruleHits": [{ "ruleId": "fr_01HKY...", "action": "BLOCK", "severity": "HIGH" }],
"operatingMode": "NORMAL",
"evaluationLatencyMs": 14,
"flags": [],
"traceId": "00-abc123-def456-01"
}
3.2 Quarantine event log
{
"level": "info",
"event": "firewall.quarantine.held",
"holdId": "fq_01HKX...",
"verdictId": "fv_01HKX...",
"triggerRuleIds": ["fr_classifier-phishing-v3"],
"reasonCode": "CLASSIFIER_PHISHING",
"expiresAt": "2026-04-22T10:14:23Z"
}
3.3 Federation log
{
"level": "info",
"event": "firewall.federation.import",
"source": "REGULATOR",
"regulatorRef": "ATRA-2026-04-21-001",
"addedCount": 142,
"removedCount": 7,
"signatureValid": true,
"importBatchId": "imp_01HKX..."
}
3.4 Mode change log
{
"level": "warn",
"event": "firewall.mode.changed",
"previousMode": "NORMAL",
"newMode": "PANIC",
"trigger": "AUTO_LATENCY_BREACH",
"latencyP95Ms": 112,
"breachedForSeconds": 65,
"disabledRuleTypes": ["REGEX","CLASSIFIER"]
}
3.5 Error log
{
"level": "error",
"event": "firewall.evaluation.error",
"rpc": "FilterInbound",
"errorType": "pg_unavailable",
"failClosedAction": "MO_WAL_REPLAY_ENQUEUED",
"err": { "message": "Connection timeout", "code": "ECONNREFUSED" }
}
4. OpenTelemetry tracing
Parent span: firewall.FilterInbound or firewall.EvaluateTransit. Trace context propagated via grpc-trace-bin (W3C). Sampling: 100% for FLAG/BLOCK/QUARANTINE; 1% head-sampled for ALLOW.
| Span | Operation | Attributes |
|---|---|---|
firewall.cache.verdict | Redis GET fw:verdict:* | cache.hit |
firewall.bloom.origin | BF.EXISTS fw:blocklist:national | bloom.hit |
firewall.pg.blocklist | PG SELECT (after Bloom hit) | pg.rows |
firewall.geo.check | In-process MCC/MNC lookup + numint Lookup | numint.available |
firewall.rate.governor | Redis ZADD/ZCARD | window, current_count, threshold |
firewall.bloom.dnd | BF.EXISTS fw:dnd:bloom | bloom.hit |
firewall.rules.evaluate | CEL-style rule pipeline | rules_evaluated, verdict |
firewall.classifier.invoke | Local LLM HTTP call | model_version, cache.hit, inference_ms |
firewall.audit.write | PG INSERT + outbox row | verdict_id, partition |
firewall.quarantine.write | PG INSERT (encrypted PDU) | hold_id, kek_id |
5. Alerting rules (Prometheus / AlertManager YAML)
groups:
- name: firewall.slo
rules:
- alert: FirewallFilterInboundLatencyHigh
expr: |
histogram_quantile(0.95,
rate(firewall_request_duration_seconds_bucket{rpc="FilterInbound"}[5m])
) > 0.030
for: 5m
labels: { severity: high, service: sms-firewall }
annotations:
summary: "FilterInbound P95 > 30ms for 5m"
runbook: "https://runbooks.ghasi.af/firewall/latency-high"
- alert: FirewallEvaluateTransitLatencyHigh
expr: |
histogram_quantile(0.95,
rate(firewall_request_duration_seconds_bucket{rpc="EvaluateTransit"}[5m])
) > 0.050
for: 5m
labels: { severity: high, service: sms-firewall }
annotations:
runbook: "https://runbooks.ghasi.af/firewall/latency-high"
- alert: FirewallVerdictAvailabilityBudget
expr: |
sum(rate(firewall_errors_total[1h])) /
sum(rate(firewall_requests_total[1h])) > 0.0001
for: 10m
labels: { severity: critical, service: sms-firewall }
annotations:
summary: "Firewall verdict availability < 99.99% over 1h"
- alert: FirewallAuditPublishLagHigh
expr: |
histogram_quantile(0.99,
rate(firewall_outbox_publish_duration_seconds_bucket[5m])
) > 1.0
for: 5m
labels: { severity: high }
- name: firewall.federation
rules:
- alert: FirewallBlocklistFederationStale
expr: time() - firewall_federation_export_lag_seconds < 86400
for: 30m
labels: { severity: medium }
annotations:
summary: "No regulator blocklist import in > 24h"
- alert: FirewallFederationSignatureInvalid
expr: increase(firewall_federation_import_total{result="signature_invalid"}[15m]) > 0
labels: { severity: critical }
annotations:
summary: "Regulator HSM signature failed validation"
runbook: "https://runbooks.ghasi.af/firewall/federation-signature-invalid"
- name: firewall.simbox.ait
rules:
- alert: FirewallSimboxSurge
expr: increase(firewall_simbox_block_total[5m]) > 1000
for: 2m
labels: { severity: high }
annotations:
summary: "SIM-box block rate spike (> 1000 in 5m)"
- alert: FirewallAitSurge
expr: increase(firewall_ait_block_total[5m]) > 5000
for: 2m
labels: { severity: high }
- alert: FirewallFraudIntelStale
expr: firewall_fraudintel_event_age_seconds > 3600
for: 30m
labels: { severity: medium }
- name: firewall.dnd
rules:
- alert: FirewallDndSnapshotStale
expr: firewall_dnd_snapshot_age_seconds > 21600 # 6h
for: 15m
labels: { severity: medium }
annotations:
summary: "National DND projection > 6h stale"
- name: firewall.audit
rules:
- alert: FirewallAuditChainBreak
expr: increase(firewall_audit_chain_break_total[1h]) > 0
labels: { severity: critical }
annotations:
summary: "Audit hash-chain integrity failure — engage Security IR"
runbook: "https://runbooks.ghasi.af/firewall/audit-chain-break"
- alert: FirewallPartitionMissing
expr: |
(firewall_audit_partitions_provisioned_count - firewall_audit_partitions_required_count) < 0
labels: { severity: high }
- name: firewall.mode
rules:
- alert: FirewallPanicEntered
expr: firewall_mode_panic_active == 1
for: 1m
labels: { severity: critical }
annotations:
summary: "Firewall in PANIC mode"
runbook: "https://runbooks.ghasi.af/firewall/panic-mode-entered"
- name: firewall.peers
rules:
- alert: FirewallPeerDegraded
expr: firewall_peer_hygiene_score < 60
for: 15m
labels: { severity: medium }
- alert: FirewallPeerAutoQuarantined
expr: increase(firewall_peer_quarantined[5m]) > 0
labels: { severity: high }
annotations:
summary: "Peer auto-quarantined; carrier-relations should engage"
- name: firewall.replication
rules:
- alert: FirewallJetStreamMirrorLag
expr: firewall_jetstream_mirror_lag_seconds{stream="FIREWALL_AUDIT"} > 30
for: 5m
labels: { severity: high }
annotations:
summary: "Audit stream mirror lag > 30s — DR exposure"
- name: firewall.bind.heartbeat
rules:
- alert: FirewallBindMissing
expr: firewall_bind_heartbeat_age_seconds > 60
for: 2m
labels: { severity: high }
annotations:
summary: "smpp-connector heartbeat missing; bind may be down"
6. Grafana dashboards
Dashboard: dashboards/sms-firewall.json (in monorepo).
| Panel | Query | Visualisation |
|---|---|---|
| Verdict rate (RPS) | sum(rate(firewall_requests_total[1m])) by (verdict) | Stacked area |
| FilterInbound P50/P95/P99 | histogram_quantile(0.{50,95,99}, ...) | Time series, 3 lines |
| EvaluateTransit P50/P95/P99 | same for rpc="EvaluateTransit" | Time series |
| Verdict distribution (24h) | sum(firewall_verdict_total) by (verdict) | Pie chart |
| Block-reason breakdown | sum(firewall_verdict_total{verdict="BLOCK"}) by (block_reason) | Bar chart |
| Quarantine queue depth | firewall_quarantine_depth{status="PENDING"} | Gauge + time series |
| Top hot rules (24h) | topk(10, sum(rate(firewall_rule_match_total[24h])) by (rule_id)) | Bar chart |
| Bloom hit rate | firewall_bloom_check_total{result="hit"} / firewall_bloom_check_total | Gauge |
| Federation lag | time() - firewall_federation_export_lag_seconds | Single stat |
| DND snapshot age | firewall_dnd_snapshot_age_seconds | Single stat |
| Operating mode timeline | firewall_mode_current | State timeline |
| SIM-box / AIT block volume | rate(firewall_simbox_block_total[5m]), rate(firewall_ait_block_total[5m]) | Time series |
| Peer hygiene heatmap | firewall_peer_hygiene_score by peer_id | Heatmap |
| Audit chain break (must be 0) | firewall_audit_chain_break_total | Single stat (red if > 0) |
| JetStream mirror lag | firewall_jetstream_mirror_lag_seconds | Time series |
| Per-MNO block rate (fairness) | rate(firewall_verdict_total{verdict="BLOCK"}[1h]) / rate(firewall_requests_total[1h]) by (mno_id) | Bar chart |
| Classifier latency + cache hit | firewall_classifier_duration_seconds P95 + cache-hit rate | Dual-axis |
Dashboard tags: service:sms-firewall, environment:prod, region:{kbl,mzr}.
7. Runbook links
runbooks/firewall/latency-high.md— diagnosis: classifier mis-budget, regex catastrophic, PG slow query, GC pauserunbooks/firewall/panic-mode-entered.md— confirm auto-trip vs manual; investigate root cause; manual recoveryrunbooks/firewall/quarantine-backlog.md— reviewer staffing, bulk-action, tier escalationrunbooks/firewall/federation-signature-invalid.md— verify regulator HSM key in Vault, escalate to regulator-liaisonrunbooks/firewall/audit-chain-break.md— Security IR engagement, Postgres forensics, restore from JetStream mirrorrunbooks/firewall/bind-missing.md— connector pod restart, SVID rotation check, SMPP bind diagnosticsrunbooks/firewall/simbox-surge.md— coordinate with fraud-intel; check for systemic vs spikerunbooks/firewall/dnd-snapshot-stale.md— consent-ledger consumer health, manual rebuild
All runbooks are versioned in docs/runbooks/firewall/.