SMS Firewall Service — Sync Contract
Version: 1.0 Status: Draft Owner: Trust & Safety Last Updated: 2026-04-21 Companion: DATA_MODEL · SERVICE_OVERVIEW · API_CONTRACTS · EVENT_SCHEMAS Related ADR: ADR-0004 §6 Multi-Region Replication
1. What other services depend on from us
| Caller | Interface | Dependency type | SLA |
|---|---|---|---|
smpp-connector-{mno}-rx / -trx | gRPC SmsFirewallService/FilterInbound | Synchronous, hot path, mandatory before MNO deliver_sm_resp | P95 ≤ 30 ms; availability 99.95% per region |
smpp-connector-transit-rx | gRPC SmsFirewallService/EvaluateTransit | Synchronous, fail-closed | P95 ≤ 50 ms |
routing-engine | gRPC FirewallControlPlane/CheckOutboundEgress | Synchronous DND check at egress | P95 ≤ 20 ms |
channel-router-service | gRPC SmsFirewallService/GetVerdict | Cache-only verdict lookup | P95 ≤ 5 ms (Redis-backed) |
cdr-mediation-service | NATS firewall.audit.v1 | At-least-once event consumer | Lag ≤ 10 s P99 |
fraud-intel-service | NATS firewall.audit.v1, firewall.simbox.detected.v1 | At-least-once event consumer | Lag ≤ 30 s P99 |
regulator-portal-service | NATS firewall.audit.v1, firewall.federation.exported.v1 | At-least-once + REST /v1/internal/firewall/blocklist/export | Lag ≤ 30 s P99 |
admin-dashboard | REST /v1/admin/firewall/* | Admin synchronous | P95 ≤ 500 ms |
notification-service | NATS firewall.alert.*, firewall.peer_quarantine.entered.v1 | At-least-once | Lag ≤ 5 s P99 |
Fail-closed semantics
- Transit MT: firewall UNAVAILABLE →
smpp-connector-transit-rxreturnssubmit_sm_respwithcommand_status = ESME_RSUBMITFAIL (0x00000045)to peer; emitsfirewall.transit.unavailable.v1. - Inbound MO: firewall UNAVAILABLE →
smpp-connector-{mno}-rxwrites the PDU to local-disk WAL (/var/lib/connector/wal/inbound/{ts}.pdu), responds to MNO withdeliver_sm_resp ESME_ROK(so MO is not NACK'd back to subscriber), and replays the WAL once the firewall is available. This preserves the subscriber relationship per ADR-0004 §3. - Egress DND check:
routing-engineUNAVAILABLE → service is the firewall itself, so degraded latency only; if firewall fully unreachable,routing-enginedefers the route to local Redis-cached DND projection (last known good ≤ 5 min) — this is operator-knowledge-only fallback, audited explicitly.
2. What we depend on
| Dependency | Interface | Failure mode |
|---|---|---|
Postgres firewall schema | TCP, pg driver | Service degrades to Redis-only verdict cache; new verdicts cannot be persisted; fail-closed for new evaluations after 60 s |
| Redis (region-local, cluster mode) | TCP | Bloom + rate-governor unavailable → fall through to definitive Postgres reads; latency increases ~30 ms; rate governor returns ALLOW + flag=RATE_GOVERNOR_DEGRADED |
| NATS JetStream | TCP | Outbox queues events locally; consumers retry; events delivered with bounded delay |
consent-ledger-service | NATS consent.dnd.snapshot.v1 | DND snapshot stale → emit firewall.alert.dnd.snapshot.stale.v1; existing projection continues to be used |
fraud-intel-service | NATS fraud.detected.* | Signal updates pause; existing signatures continue to apply; alert at 1h staleness |
regulator-portal-service | NATS regulator.blocklist.published.v1 | Federation imports pause; existing entries remain active |
sender-id-registry-service | gRPC Verify(senderId, peerId) | Fall back to local hourly cache firewall.peer_senderid_allowlist; alert if cache > 4h stale |
number-intelligence-service | gRPC Lookup(msisdn) | Geo + grey-route checks degrade; emit flag=NUMINT_UNAVAILABLE; verdicts continue using MCC/MNC table only |
| Vault Transit (KEK for quarantine + HSM signer) | HTTPS (token rotated 30d) | Quarantine inserts cannot encrypt; new QUARANTINE verdicts fail-closed → re-emit as BLOCK with reason recorded; signer unavailable → federation export postponed |
3. Per-aggregate conflict / replication policy
The firewall is region-pinned for the data plane (verdict latency demands region-local Redis + Postgres) and uses Postgres logical replication (kbl primary → mzr replica) for control-plane state. The audit log is replicated via NATS JetStream stream mirror.
| Aggregate | Replication mechanism | Conflict policy | Rationale |
|---|---|---|---|
firewall.audit (verdict log) | NATS JetStream mirror (3× kbl, 2× mzr, leaf dxb) + Postgres logical replication | append_only | Verdicts are immutable evidence; chain hash continuity preserved per partition |
firewall.rules | Postgres logical replication (kbl primary) | server_authoritative | Rule edits go through admin REST in kbl only; mzr is read-only |
firewall.rule_versions | Same | append_only | Snapshot history |
firewall.blocklist_entries | Postgres logical replication + NATS event | merge_on_unique | Multiple federation sources may report the same entry; merge sources JSONB array; recompute confidence_score after merge |
firewall.blocklist_audit | Postgres logical replication | append_only | History |
firewall.quarantine_queue | Region-local (NOT replicated) | region_authoritative | Quarantine review is region-local; on regional failover, in-flight holds in lost region are reconstructed from firewall.audit rows (verdict + encrypted payload reference) |
firewall.mno_bind_registry | Postgres logical replication (kbl primary) | server_authoritative | Platform config |
firewall.peer_aggregators, peer_asn_allowlist, peer_senderid_allowlist, peer_mno_routes | Postgres logical replication (kbl primary) | server_authoritative | Control-plane |
firewall.peer_hygiene_scores | Region-local + nightly merge to kbl aggregator | max_of per window | Both regions independently compute; the worst score wins (conservative — degraded peers stay quarantined) |
firewall.simbox_signals, firewall.ait_patterns | Postgres logical replication (consumer of fraud-intel) | merge_on_unique (originator / dst_msisdn_range) | Both regions consume fraud.detected.* independently; merge with last_seen_at = max(...), confidence = max(...) |
firewall.dnd_snapshot | Region-local rebuild from consent.dnd.snapshot.v1 | server_authoritative (consent-ledger is canonical) | Both regions independently materialise the same snapshot URL |
firewall.operating_mode | Region-local | region_authoritative | Mode is per region; cross-region coordination uses an explicit "set both regions PANIC" admin endpoint (not auto-replicated) |
firewall.federation_log | Postgres logical replication | append_only | Audit |
3.1 Merge algorithm for blocklist_entries.sources
When two federation imports for the same (source, regulator_ref, type, value) arrive in different regions:
function mergeBlocklistEntry(local: Entry, remote: Entry): Entry {
return {
...local,
sources: dedupeBy(local.sources.concat(remote.sources),
s => `${s.sourceId}:${s.sourceType}`),
confidenceScore: recomputeConfidence(merged.sources),
autoApply: confidenceScore >= 0.8,
active: local.active && remote.active, // any deactivation wins
updatedAt: max(local.updatedAt, remote.updatedAt)
};
}
3.2 Region-failover behaviour
Per ADR-0004 §6:
- kbl loss:
mzrpromotes Postgres replica to primary; admin REST DNS swings (Cloudflare); rule edits resume inmzr. In-flightkblquarantines reconstructed fromfirewall.auditrows + encrypted payload archive in MinIO. - mzr loss:
kblcontinues unaffected; mzr-region MO/transit traffic is rebalanced to kbl-region connectors via service-mesh failover (cost: cross-region latency 8–15 ms additional on the firewall verdict path). - JetStream mirror: audit-event mirror lag P99 ≤ 10 s; alert
FirewallAuditMirrorLagHighat > 30 s.
4. Outbox pattern (transactional event publishing)
Every state mutation that produces a NATS event writes to firewall.outbox in the same Postgres transaction:
BEGIN;
INSERT INTO firewall.audit (...) VALUES (...);
INSERT INTO firewall.outbox (event_id, subject, payload, partition_key)
VALUES ($eventId, 'firewall.audit.v1', $payload, $mnoBindId);
COMMIT;
A relay worker (OutboxRelayWorker) polls firewall.outbox WHERE published_at IS NULL ORDER BY created_at LIMIT 1000 every 250 ms; on successful NATS ack, sets published_at = now(). After 7 days, published rows are pruned.
Guarantees:
- No emitted event without a persisted state change.
- No persisted state change without an emitted event for more than a few seconds.
- Sticky partition routing via
partition_key(e.g.mnoBindId) preserves ordering for downstream stateful consumers.
5. Replication lag targets & telemetry
| Path | Target | Alert threshold | Metric |
|---|---|---|---|
firewall.rules (kbl → mzr) Postgres logical | P99 ≤ 5 s | > 30 s for 5 min | firewall_pg_replication_lag_seconds{stream='control'} |
firewall.audit (kbl → mzr) NATS mirror | P99 ≤ 10 s | > 30 s for 5 min | firewall_jetstream_mirror_lag_seconds |
firewall.blocklist_entries cross-region | P99 ≤ 15 s | > 60 s for 5 min | firewall_pg_replication_lag_seconds{stream='blocklist'} |
Bloom filter rebuild after firewall.blocklist.changed.v1 | P99 ≤ 5 s | > 15 s | firewall_bloom_refresh_lag_seconds |
| Outbox unpublished depth | < 100 messages | > 10 000 for 1 min | firewall_outbox_unpublished_total |
| Conflict resolutions per minute | — | > 100/min sustained 5 min | firewall_conflict_resolved_total{aggregate,policy} |
6. Operator override procedure (cross-region merge conflict)
Although merge_on_unique is intended to converge automatically, an operator may need to manually resolve a contested blocklist entry (e.g. regulator and peer-MNO disagree on active status):
- NOC runs
GET /v1/admin/firewall/blocklist/{entryId}/regionsto compare per-region state. - NOC posts
POST /v1/admin/firewall/blocklist/{entryId}/resolve {decision: 'ACTIVATE'|'DEACTIVATE', reason}. - Decision applies in kbl (primary) and replicates to mzr; emits
firewall.blocklist.changed.v1withactor='OPERATOR_OVERRIDE'. - Audit row records both the pre-merge regional states and the resolution.
7. Schema stability guarantees
7.1 gRPC
| Field | Stability |
|---|---|
FilterInboundRequest.* required fields | Stable |
EvaluateTransitRequest.* required fields | Stable |
Verdict.* | Stable |
BlockReason enum | Additive only; callers MUST handle BLOCK_REASON_UNSPECIFIED |
FirewallAction enum | Additive only |
7.2 NATS events
| Aspect | Stability |
|---|---|
firewall.audit.v1 field set | Additive (optional) is non-breaking |
| Removing a field | Requires firewall.audit.v2 + 90-day deprecation overlap |
| Hash-chain canonical JSON shape | STABLE — any change requires partition cutover at month boundary |
7.3 Postgres
| Aspect | Stability |
|---|---|
firewall.audit column set | Additive non-null requires backfill migration |
| Hash-chain trigger | Versioned; mode upgrade requires partition rotation |
| Partition naming convention | firewall.audit_YYYY_MM — never renamed |
8. Versioning policy
- gRPC:
ghasi.sms.firewall.v1. Breaking →v2package coexisting for ≥ 90 days. - NATS subjects:
firewall.<topic>.v1. Breaking →.v2with 90-day overlap. - REST:
/v1/admin/firewall/*. Additive non-breaking; breaking →/v2/. - Postgres: forward-only migrations via Atlas/Drizzle; expand-then-contract.
9. Backpressure & flow control
- Per-bind concurrency cap 200 in-flight at the gRPC server prevents one runaway bind from starving healthy ones.
- gRPC server returns
RESOURCE_EXHAUSTEDwhen cap hit; connector retries on a sibling firewall replica via DNS round-robin. - NATS consumers use
MaxAckPending = 5000; durable consumer offsets persisted per replica group. - Outbox relay backpressure: if NATS publish fails > 1000 times consecutively, alert
FirewallOutboxStalled(HIGH); admin can manually flush viaPOST /v1/admin/firewall/internal/outbox/flush.