Skip to main content

SMS Firewall Service — Observability

Version: 1.0 Status: Draft Owner: Trust & Safety + SRE Last Updated: 2026-04-21 Companion: SERVICE_OVERVIEW · FAILURE_MODES · SERVICE_READINESS


1. SLIs and SLOs

The firewall is the synchronous gate for inbound MO and transit MT. SLOs are hard NFRs — every breach degrades MNO bind health and propagates to subscriber experience.

SLISLOWindowError budget consumed by
FilterInbound P95 latency≤ 30 ms5 minRedis tail latency, regex evaluation, classifier mis-budgeting
FilterInbound P99 latency≤ 50 ms5 min
EvaluateTransit P95 latency≤ 50 ms5 min+ sender-id-registry RPC, number-intel HLR lookup
EvaluateTransit P99 latency≤ 100 ms5 min
Verdict availability (gRPC OK / total)≥ 99.99% per regionmonthlyPod restarts, PG/Redis blips
Audit-event publish lag (verdict → NATS ACK)≤ 1 s P995 minOutbox relay throughput
DND snapshot age≤ 1 h continuousconsent-ledger snapshot worker delays
Federation freshness (last regulator import)≤ 24 hRegulator publishing cadence
Bloom rebuild lag (after blocklist.changed)≤ 5 s P99Worker scheduling
Cross-region replication lag (kbl→mzr)≤ 5 s control-plane / ≤ 10 s auditWAN, JetStream mirror
Hash-chain integrity100% rows verified daily24h

2. Prometheus metrics

All metrics exposed at GET /metrics on port 9061. Prometheus text format. Scraped every 15 s.

2.1 RED metrics (Requests / Errors / Duration)

MetricTypeLabelsDescription
firewall_requests_totalCounterrpc, verdict, direction, mno_id, peer_asnTotal verdict calls by outcome
firewall_request_duration_secondsHistogramrpc, verdictEnd-to-end gRPC handler latency
firewall_errors_totalCounterrpc, code (gRPC status)Errors by code

Histogram buckets for firewall_request_duration_seconds: [0.001, 0.005, 0.010, 0.020, 0.030, 0.050, 0.075, 0.100, 0.150, 0.250, 0.500] (firewall is sub-100 ms regime; finer buckets at low end).

2.2 Verdict & rule metrics

MetricTypeLabelsDescription
firewall_verdict_totalCounterverdict, direction, block_reason, mno_bind_id, peer_asnVerdict distribution
firewall_rule_eval_secondsHistogramrule_id, rule_typePer-rule evaluation latency
firewall_rule_match_totalCounterrule_id, rule_type, actionRule hit counts
firewall_rule_degraded_totalCounterrule_id, reasonAuto-disabled rules (REGEX_TIMEOUT, CLASSIFIER_UNAVAILABLE)
firewall_active_rulesGaugescope, rule_typeActive rule count
firewall_rule_set_versionGauge(none)Currently loaded rule-set version

2.3 Cache & data-plane infra

MetricTypeLabelsDescription
firewall_verdict_cache_hits_totalCounterverdictfw:verdict:* Redis hits
firewall_verdict_cache_misses_totalCounterMisses
firewall_bloom_check_totalCounterbloom_kind (blocklist/dnd), result (hit/miss)Bloom lookup outcomes
firewall_bloom_unavailable_totalCounterbloom_kindTimes Bloom fell through to PG
firewall_rate_governor_skip_totalCounterscopeRate governor skipped (Redis unavailable)
firewall_rate_governor_block_totalCounterscope, windowRate governor BLOCKED
firewall_pg_pool_waiting_clientsGaugePgBouncer wait queue
firewall_pg_query_duration_secondsHistogramqueryPG round-trip
firewall_redis_command_duration_secondsHistogramcommandRedis round-trip

2.4 Quarantine & federation

MetricTypeLabelsDescription
firewall_quarantine_depthGaugedirection, statusHold queue depth
firewall_quarantine_review_duration_secondsHistogramoutcome (released/rejected/auto_expired)Time hold spent in queue
firewall_federation_import_totalCountersource, result (success/signature_invalid)Federation imports
firewall_federation_export_lag_secondsGaugeTime since last successful export
firewall_blocklist_entries_totalGaugesource, activeBlocklist size
firewall_dnd_snapshot_age_secondsGaugeAge of currently-loaded DND snapshot

2.5 SIM-box / AIT / fraud-intel

MetricTypeLabelsDescription
firewall_simbox_signals_totalGaugeactiveActive SIM-box signals
firewall_simbox_block_totalCounter(none)MOs blocked by SIM-box signature
firewall_ait_patterns_totalGaugepattern_type, activeActive AIT patterns
firewall_ait_block_totalCounterpattern_typeMOs blocked by AIT signature
firewall_fraudintel_event_age_secondsGaugeevent_typeAge of last consumed fraud.detected.* event

2.6 Peer & MNO bind

MetricTypeLabelsDescription
firewall_peer_hygiene_scoreGaugepeer_idCurrent rolling 24h peer score (0–100)
firewall_peer_quarantinedGaugepeer_id1 if quarantined
firewall_bind_heartbeat_age_secondsGaugemno_bind_idTime since last connector heartbeat

2.7 Operating mode

MetricTypeLabelsDescription
firewall_mode_currentGaugemodeOne-hot per mode (NORMAL/DEGRADED/PANIC/MAINTENANCE)
firewall_mode_panic_activeGauge(none)1 when in PANIC
firewall_mode_changed_totalCounterfrom, to, triggerMode-switch events

2.8 Audit + outbox + replication

MetricTypeLabelsDescription
firewall_audit_rows_totalCounterdirection, verdictAudit insert rate
firewall_audit_chain_break_totalCounter(none)Chain-break detections (target: 0)
firewall_outbox_unpublished_totalGaugeBacklog depth
firewall_outbox_publish_duration_secondsHistogramPublish latency
firewall_jetstream_mirror_lag_secondsGaugestreamJetStream mirror lag
firewall_pg_replication_lag_secondsGaugestreamPostgres logical replication lag

2.9 Classifier

MetricTypeLabelsDescription
firewall_classifier_requests_totalCountermodel_version, cache_hitClassifier invocations
firewall_classifier_duration_secondsHistogrammodel_versionInference latency
firewall_classifier_timeout_totalCounter15 ms timeout breaches
firewall_classifier_skip_totalCounterreason (PANIC_MODE, BUDGET_SKIP, UNAVAILABLE)Classifier skipped

3. Structured logs (Pino)

All log output is valid JSON. PII redaction enforced by Pino redactor — fields srcMsisdn, dstMsisdn, pduBody, senderId are masked at logger level. ESLint rule prevents direct logging of PDU body.

3.1 Verdict log (sampled 1% for ALLOW; 100% for FLAG/BLOCK/QUARANTINE)

{
"level": "info",
"time": "2026-04-21T10:14:23.123Z",
"event": "firewall.verdict",
"verdictId": "fv_01HKX...",
"rpc": "FilterInbound",
"direction": "MO",
"verdict": "BLOCK",
"blockReason": "ORIGIN_BLOCKLIST",
"mnoBindId": "awcc-rx-01",
"srcMsisdnMasked": "+93701***",
"dstMsisdnMasked": "+93702***",
"evaluatedRuleIds": ["fr_01HKX...", "fr_01HKY..."],
"ruleHits": [{ "ruleId": "fr_01HKY...", "action": "BLOCK", "severity": "HIGH" }],
"operatingMode": "NORMAL",
"evaluationLatencyMs": 14,
"flags": [],
"traceId": "00-abc123-def456-01"
}

3.2 Quarantine event log

{
"level": "info",
"event": "firewall.quarantine.held",
"holdId": "fq_01HKX...",
"verdictId": "fv_01HKX...",
"triggerRuleIds": ["fr_classifier-phishing-v3"],
"reasonCode": "CLASSIFIER_PHISHING",
"expiresAt": "2026-04-22T10:14:23Z"
}

3.3 Federation log

{
"level": "info",
"event": "firewall.federation.import",
"source": "REGULATOR",
"regulatorRef": "ATRA-2026-04-21-001",
"addedCount": 142,
"removedCount": 7,
"signatureValid": true,
"importBatchId": "imp_01HKX..."
}

3.4 Mode change log

{
"level": "warn",
"event": "firewall.mode.changed",
"previousMode": "NORMAL",
"newMode": "PANIC",
"trigger": "AUTO_LATENCY_BREACH",
"latencyP95Ms": 112,
"breachedForSeconds": 65,
"disabledRuleTypes": ["REGEX","CLASSIFIER"]
}

3.5 Error log

{
"level": "error",
"event": "firewall.evaluation.error",
"rpc": "FilterInbound",
"errorType": "pg_unavailable",
"failClosedAction": "MO_WAL_REPLAY_ENQUEUED",
"err": { "message": "Connection timeout", "code": "ECONNREFUSED" }
}

4. OpenTelemetry tracing

Parent span: firewall.FilterInbound or firewall.EvaluateTransit. Trace context propagated via grpc-trace-bin (W3C). Sampling: 100% for FLAG/BLOCK/QUARANTINE; 1% head-sampled for ALLOW.

SpanOperationAttributes
firewall.cache.verdictRedis GET fw:verdict:*cache.hit
firewall.bloom.originBF.EXISTS fw:blocklist:nationalbloom.hit
firewall.pg.blocklistPG SELECT (after Bloom hit)pg.rows
firewall.geo.checkIn-process MCC/MNC lookup + numint Lookupnumint.available
firewall.rate.governorRedis ZADD/ZCARDwindow, current_count, threshold
firewall.bloom.dndBF.EXISTS fw:dnd:bloombloom.hit
firewall.rules.evaluateCEL-style rule pipelinerules_evaluated, verdict
firewall.classifier.invokeLocal LLM HTTP callmodel_version, cache.hit, inference_ms
firewall.audit.writePG INSERT + outbox rowverdict_id, partition
firewall.quarantine.writePG INSERT (encrypted PDU)hold_id, kek_id

5. Alerting rules (Prometheus / AlertManager YAML)

groups:
- name: firewall.slo
rules:
- alert: FirewallFilterInboundLatencyHigh
expr: |
histogram_quantile(0.95,
rate(firewall_request_duration_seconds_bucket{rpc="FilterInbound"}[5m])
) > 0.030
for: 5m
labels: { severity: high, service: sms-firewall }
annotations:
summary: "FilterInbound P95 > 30ms for 5m"
runbook: "https://runbooks.ghasi.af/firewall/latency-high"

- alert: FirewallEvaluateTransitLatencyHigh
expr: |
histogram_quantile(0.95,
rate(firewall_request_duration_seconds_bucket{rpc="EvaluateTransit"}[5m])
) > 0.050
for: 5m
labels: { severity: high, service: sms-firewall }
annotations:
runbook: "https://runbooks.ghasi.af/firewall/latency-high"

- alert: FirewallVerdictAvailabilityBudget
expr: |
sum(rate(firewall_errors_total[1h])) /
sum(rate(firewall_requests_total[1h])) > 0.0001
for: 10m
labels: { severity: critical, service: sms-firewall }
annotations:
summary: "Firewall verdict availability < 99.99% over 1h"

- alert: FirewallAuditPublishLagHigh
expr: |
histogram_quantile(0.99,
rate(firewall_outbox_publish_duration_seconds_bucket[5m])
) > 1.0
for: 5m
labels: { severity: high }

- name: firewall.federation
rules:
- alert: FirewallBlocklistFederationStale
expr: time() - firewall_federation_export_lag_seconds < 86400
for: 30m
labels: { severity: medium }
annotations:
summary: "No regulator blocklist import in > 24h"

- alert: FirewallFederationSignatureInvalid
expr: increase(firewall_federation_import_total{result="signature_invalid"}[15m]) > 0
labels: { severity: critical }
annotations:
summary: "Regulator HSM signature failed validation"
runbook: "https://runbooks.ghasi.af/firewall/federation-signature-invalid"

- name: firewall.simbox.ait
rules:
- alert: FirewallSimboxSurge
expr: increase(firewall_simbox_block_total[5m]) > 1000
for: 2m
labels: { severity: high }
annotations:
summary: "SIM-box block rate spike (> 1000 in 5m)"

- alert: FirewallAitSurge
expr: increase(firewall_ait_block_total[5m]) > 5000
for: 2m
labels: { severity: high }

- alert: FirewallFraudIntelStale
expr: firewall_fraudintel_event_age_seconds > 3600
for: 30m
labels: { severity: medium }

- name: firewall.dnd
rules:
- alert: FirewallDndSnapshotStale
expr: firewall_dnd_snapshot_age_seconds > 21600 # 6h
for: 15m
labels: { severity: medium }
annotations:
summary: "National DND projection > 6h stale"

- name: firewall.audit
rules:
- alert: FirewallAuditChainBreak
expr: increase(firewall_audit_chain_break_total[1h]) > 0
labels: { severity: critical }
annotations:
summary: "Audit hash-chain integrity failure — engage Security IR"
runbook: "https://runbooks.ghasi.af/firewall/audit-chain-break"

- alert: FirewallPartitionMissing
expr: |
(firewall_audit_partitions_provisioned_count - firewall_audit_partitions_required_count) < 0
labels: { severity: high }

- name: firewall.mode
rules:
- alert: FirewallPanicEntered
expr: firewall_mode_panic_active == 1
for: 1m
labels: { severity: critical }
annotations:
summary: "Firewall in PANIC mode"
runbook: "https://runbooks.ghasi.af/firewall/panic-mode-entered"

- name: firewall.peers
rules:
- alert: FirewallPeerDegraded
expr: firewall_peer_hygiene_score < 60
for: 15m
labels: { severity: medium }

- alert: FirewallPeerAutoQuarantined
expr: increase(firewall_peer_quarantined[5m]) > 0
labels: { severity: high }
annotations:
summary: "Peer auto-quarantined; carrier-relations should engage"

- name: firewall.replication
rules:
- alert: FirewallJetStreamMirrorLag
expr: firewall_jetstream_mirror_lag_seconds{stream="FIREWALL_AUDIT"} > 30
for: 5m
labels: { severity: high }
annotations:
summary: "Audit stream mirror lag > 30s — DR exposure"

- name: firewall.bind.heartbeat
rules:
- alert: FirewallBindMissing
expr: firewall_bind_heartbeat_age_seconds > 60
for: 2m
labels: { severity: high }
annotations:
summary: "smpp-connector heartbeat missing; bind may be down"

6. Grafana dashboards

Dashboard: dashboards/sms-firewall.json (in monorepo).

PanelQueryVisualisation
Verdict rate (RPS)sum(rate(firewall_requests_total[1m])) by (verdict)Stacked area
FilterInbound P50/P95/P99histogram_quantile(0.{50,95,99}, ...)Time series, 3 lines
EvaluateTransit P50/P95/P99same for rpc="EvaluateTransit"Time series
Verdict distribution (24h)sum(firewall_verdict_total) by (verdict)Pie chart
Block-reason breakdownsum(firewall_verdict_total{verdict="BLOCK"}) by (block_reason)Bar chart
Quarantine queue depthfirewall_quarantine_depth{status="PENDING"}Gauge + time series
Top hot rules (24h)topk(10, sum(rate(firewall_rule_match_total[24h])) by (rule_id))Bar chart
Bloom hit ratefirewall_bloom_check_total{result="hit"} / firewall_bloom_check_totalGauge
Federation lagtime() - firewall_federation_export_lag_secondsSingle stat
DND snapshot agefirewall_dnd_snapshot_age_secondsSingle stat
Operating mode timelinefirewall_mode_currentState timeline
SIM-box / AIT block volumerate(firewall_simbox_block_total[5m]), rate(firewall_ait_block_total[5m])Time series
Peer hygiene heatmapfirewall_peer_hygiene_score by peer_idHeatmap
Audit chain break (must be 0)firewall_audit_chain_break_totalSingle stat (red if > 0)
JetStream mirror lagfirewall_jetstream_mirror_lag_secondsTime series
Per-MNO block rate (fairness)rate(firewall_verdict_total{verdict="BLOCK"}[1h]) / rate(firewall_requests_total[1h]) by (mno_id)Bar chart
Classifier latency + cache hitfirewall_classifier_duration_seconds P95 + cache-hit rateDual-axis

Dashboard tags: service:sms-firewall, environment:prod, region:{kbl,mzr}.


  • runbooks/firewall/latency-high.md — diagnosis: classifier mis-budget, regex catastrophic, PG slow query, GC pause
  • runbooks/firewall/panic-mode-entered.md — confirm auto-trip vs manual; investigate root cause; manual recovery
  • runbooks/firewall/quarantine-backlog.md — reviewer staffing, bulk-action, tier escalation
  • runbooks/firewall/federation-signature-invalid.md — verify regulator HSM key in Vault, escalate to regulator-liaison
  • runbooks/firewall/audit-chain-break.md — Security IR engagement, Postgres forensics, restore from JetStream mirror
  • runbooks/firewall/bind-missing.md — connector pod restart, SVID rotation check, SMPP bind diagnostics
  • runbooks/firewall/simbox-surge.md — coordinate with fraud-intel; check for systemic vs spike
  • runbooks/firewall/dnd-snapshot-stale.md — consent-ledger consumer health, manual rebuild

All runbooks are versioned in docs/runbooks/firewall/.