Skip to main content

SMS Firewall Service — Sync Contract

Version: 1.0 Status: Draft Owner: Trust & Safety Last Updated: 2026-04-21 Companion: DATA_MODEL · SERVICE_OVERVIEW · API_CONTRACTS · EVENT_SCHEMAS Related ADR: ADR-0004 §6 Multi-Region Replication


1. What other services depend on from us

CallerInterfaceDependency typeSLA
smpp-connector-{mno}-rx / -trxgRPC SmsFirewallService/FilterInboundSynchronous, hot path, mandatory before MNO deliver_sm_respP95 ≤ 30 ms; availability 99.95% per region
smpp-connector-transit-rxgRPC SmsFirewallService/EvaluateTransitSynchronous, fail-closedP95 ≤ 50 ms
routing-enginegRPC FirewallControlPlane/CheckOutboundEgressSynchronous DND check at egressP95 ≤ 20 ms
channel-router-servicegRPC SmsFirewallService/GetVerdictCache-only verdict lookupP95 ≤ 5 ms (Redis-backed)
cdr-mediation-serviceNATS firewall.audit.v1At-least-once event consumerLag ≤ 10 s P99
fraud-intel-serviceNATS firewall.audit.v1, firewall.simbox.detected.v1At-least-once event consumerLag ≤ 30 s P99
regulator-portal-serviceNATS firewall.audit.v1, firewall.federation.exported.v1At-least-once + REST /v1/internal/firewall/blocklist/exportLag ≤ 30 s P99
admin-dashboardREST /v1/admin/firewall/*Admin synchronousP95 ≤ 500 ms
notification-serviceNATS firewall.alert.*, firewall.peer_quarantine.entered.v1At-least-onceLag ≤ 5 s P99

Fail-closed semantics

  • Transit MT: firewall UNAVAILABLE → smpp-connector-transit-rx returns submit_sm_resp with command_status = ESME_RSUBMITFAIL (0x00000045) to peer; emits firewall.transit.unavailable.v1.
  • Inbound MO: firewall UNAVAILABLE → smpp-connector-{mno}-rx writes the PDU to local-disk WAL (/var/lib/connector/wal/inbound/{ts}.pdu), responds to MNO with deliver_sm_resp ESME_ROK (so MO is not NACK'd back to subscriber), and replays the WAL once the firewall is available. This preserves the subscriber relationship per ADR-0004 §3.
  • Egress DND check: routing-engine UNAVAILABLE → service is the firewall itself, so degraded latency only; if firewall fully unreachable, routing-engine defers the route to local Redis-cached DND projection (last known good ≤ 5 min) — this is operator-knowledge-only fallback, audited explicitly.

2. What we depend on

DependencyInterfaceFailure mode
Postgres firewall schemaTCP, pg driverService degrades to Redis-only verdict cache; new verdicts cannot be persisted; fail-closed for new evaluations after 60 s
Redis (region-local, cluster mode)TCPBloom + rate-governor unavailable → fall through to definitive Postgres reads; latency increases ~30 ms; rate governor returns ALLOW + flag=RATE_GOVERNOR_DEGRADED
NATS JetStreamTCPOutbox queues events locally; consumers retry; events delivered with bounded delay
consent-ledger-serviceNATS consent.dnd.snapshot.v1DND snapshot stale → emit firewall.alert.dnd.snapshot.stale.v1; existing projection continues to be used
fraud-intel-serviceNATS fraud.detected.*Signal updates pause; existing signatures continue to apply; alert at 1h staleness
regulator-portal-serviceNATS regulator.blocklist.published.v1Federation imports pause; existing entries remain active
sender-id-registry-servicegRPC Verify(senderId, peerId)Fall back to local hourly cache firewall.peer_senderid_allowlist; alert if cache > 4h stale
number-intelligence-servicegRPC Lookup(msisdn)Geo + grey-route checks degrade; emit flag=NUMINT_UNAVAILABLE; verdicts continue using MCC/MNC table only
Vault Transit (KEK for quarantine + HSM signer)HTTPS (token rotated 30d)Quarantine inserts cannot encrypt; new QUARANTINE verdicts fail-closed → re-emit as BLOCK with reason recorded; signer unavailable → federation export postponed

3. Per-aggregate conflict / replication policy

The firewall is region-pinned for the data plane (verdict latency demands region-local Redis + Postgres) and uses Postgres logical replication (kbl primary → mzr replica) for control-plane state. The audit log is replicated via NATS JetStream stream mirror.

AggregateReplication mechanismConflict policyRationale
firewall.audit (verdict log)NATS JetStream mirror (3× kbl, 2× mzr, leaf dxb) + Postgres logical replicationappend_onlyVerdicts are immutable evidence; chain hash continuity preserved per partition
firewall.rulesPostgres logical replication (kbl primary)server_authoritativeRule edits go through admin REST in kbl only; mzr is read-only
firewall.rule_versionsSameappend_onlySnapshot history
firewall.blocklist_entriesPostgres logical replication + NATS eventmerge_on_uniqueMultiple federation sources may report the same entry; merge sources JSONB array; recompute confidence_score after merge
firewall.blocklist_auditPostgres logical replicationappend_onlyHistory
firewall.quarantine_queueRegion-local (NOT replicated)region_authoritativeQuarantine review is region-local; on regional failover, in-flight holds in lost region are reconstructed from firewall.audit rows (verdict + encrypted payload reference)
firewall.mno_bind_registryPostgres logical replication (kbl primary)server_authoritativePlatform config
firewall.peer_aggregators, peer_asn_allowlist, peer_senderid_allowlist, peer_mno_routesPostgres logical replication (kbl primary)server_authoritativeControl-plane
firewall.peer_hygiene_scoresRegion-local + nightly merge to kbl aggregatormax_of per windowBoth regions independently compute; the worst score wins (conservative — degraded peers stay quarantined)
firewall.simbox_signals, firewall.ait_patternsPostgres logical replication (consumer of fraud-intel)merge_on_unique (originator / dst_msisdn_range)Both regions consume fraud.detected.* independently; merge with last_seen_at = max(...), confidence = max(...)
firewall.dnd_snapshotRegion-local rebuild from consent.dnd.snapshot.v1server_authoritative (consent-ledger is canonical)Both regions independently materialise the same snapshot URL
firewall.operating_modeRegion-localregion_authoritativeMode is per region; cross-region coordination uses an explicit "set both regions PANIC" admin endpoint (not auto-replicated)
firewall.federation_logPostgres logical replicationappend_onlyAudit

3.1 Merge algorithm for blocklist_entries.sources

When two federation imports for the same (source, regulator_ref, type, value) arrive in different regions:

function mergeBlocklistEntry(local: Entry, remote: Entry): Entry {
return {
...local,
sources: dedupeBy(local.sources.concat(remote.sources),
s => `${s.sourceId}:${s.sourceType}`),
confidenceScore: recomputeConfidence(merged.sources),
autoApply: confidenceScore >= 0.8,
active: local.active && remote.active, // any deactivation wins
updatedAt: max(local.updatedAt, remote.updatedAt)
};
}

3.2 Region-failover behaviour

Per ADR-0004 §6:

  • kbl loss: mzr promotes Postgres replica to primary; admin REST DNS swings (Cloudflare); rule edits resume in mzr. In-flight kbl quarantines reconstructed from firewall.audit rows + encrypted payload archive in MinIO.
  • mzr loss: kbl continues unaffected; mzr-region MO/transit traffic is rebalanced to kbl-region connectors via service-mesh failover (cost: cross-region latency 8–15 ms additional on the firewall verdict path).
  • JetStream mirror: audit-event mirror lag P99 ≤ 10 s; alert FirewallAuditMirrorLagHigh at > 30 s.

4. Outbox pattern (transactional event publishing)

Every state mutation that produces a NATS event writes to firewall.outbox in the same Postgres transaction:

BEGIN;
INSERT INTO firewall.audit (...) VALUES (...);
INSERT INTO firewall.outbox (event_id, subject, payload, partition_key)
VALUES ($eventId, 'firewall.audit.v1', $payload, $mnoBindId);
COMMIT;

A relay worker (OutboxRelayWorker) polls firewall.outbox WHERE published_at IS NULL ORDER BY created_at LIMIT 1000 every 250 ms; on successful NATS ack, sets published_at = now(). After 7 days, published rows are pruned.

Guarantees:

  • No emitted event without a persisted state change.
  • No persisted state change without an emitted event for more than a few seconds.
  • Sticky partition routing via partition_key (e.g. mnoBindId) preserves ordering for downstream stateful consumers.

5. Replication lag targets & telemetry

PathTargetAlert thresholdMetric
firewall.rules (kbl → mzr) Postgres logicalP99 ≤ 5 s> 30 s for 5 minfirewall_pg_replication_lag_seconds{stream='control'}
firewall.audit (kbl → mzr) NATS mirrorP99 ≤ 10 s> 30 s for 5 minfirewall_jetstream_mirror_lag_seconds
firewall.blocklist_entries cross-regionP99 ≤ 15 s> 60 s for 5 minfirewall_pg_replication_lag_seconds{stream='blocklist'}
Bloom filter rebuild after firewall.blocklist.changed.v1P99 ≤ 5 s> 15 sfirewall_bloom_refresh_lag_seconds
Outbox unpublished depth< 100 messages> 10 000 for 1 minfirewall_outbox_unpublished_total
Conflict resolutions per minute> 100/min sustained 5 minfirewall_conflict_resolved_total{aggregate,policy}

6. Operator override procedure (cross-region merge conflict)

Although merge_on_unique is intended to converge automatically, an operator may need to manually resolve a contested blocklist entry (e.g. regulator and peer-MNO disagree on active status):

  1. NOC runs GET /v1/admin/firewall/blocklist/{entryId}/regions to compare per-region state.
  2. NOC posts POST /v1/admin/firewall/blocklist/{entryId}/resolve {decision: 'ACTIVATE'|'DEACTIVATE', reason}.
  3. Decision applies in kbl (primary) and replicates to mzr; emits firewall.blocklist.changed.v1 with actor='OPERATOR_OVERRIDE'.
  4. Audit row records both the pre-merge regional states and the resolution.

7. Schema stability guarantees

7.1 gRPC

FieldStability
FilterInboundRequest.* required fieldsStable
EvaluateTransitRequest.* required fieldsStable
Verdict.*Stable
BlockReason enumAdditive only; callers MUST handle BLOCK_REASON_UNSPECIFIED
FirewallAction enumAdditive only

7.2 NATS events

AspectStability
firewall.audit.v1 field setAdditive (optional) is non-breaking
Removing a fieldRequires firewall.audit.v2 + 90-day deprecation overlap
Hash-chain canonical JSON shapeSTABLE — any change requires partition cutover at month boundary

7.3 Postgres

AspectStability
firewall.audit column setAdditive non-null requires backfill migration
Hash-chain triggerVersioned; mode upgrade requires partition rotation
Partition naming conventionfirewall.audit_YYYY_MM — never renamed

8. Versioning policy

  • gRPC: ghasi.sms.firewall.v1. Breaking → v2 package coexisting for ≥ 90 days.
  • NATS subjects: firewall.<topic>.v1. Breaking → .v2 with 90-day overlap.
  • REST: /v1/admin/firewall/*. Additive non-breaking; breaking → /v2/.
  • Postgres: forward-only migrations via Atlas/Drizzle; expand-then-contract.

9. Backpressure & flow control

  • Per-bind concurrency cap 200 in-flight at the gRPC server prevents one runaway bind from starving healthy ones.
  • gRPC server returns RESOURCE_EXHAUSTED when cap hit; connector retries on a sibling firewall replica via DNS round-robin.
  • NATS consumers use MaxAckPending = 5000; durable consumer offsets persisted per replica group.
  • Outbox relay backpressure: if NATS publish fails > 1000 times consecutively, alert FirewallOutboxStalled (HIGH); admin can manually flush via POST /v1/admin/firewall/internal/outbox/flush.