Skip to main content

Number Intelligence Service — Sync Contract

Version: 1.0 Status: Draft Owner: Messaging Core Last Updated: 2026-04-21 Companion: API_CONTRACTS · APPLICATION_LOGIC · SECURITY_MODEL · ADR-0004 §14

This document defines what other services depend on from number-intelligence-service, what it depends on from others, how its aggregates resolve concurrent updates, how it replicates across regions per ADR-0004, and the proto contract for the synchronous gRPC surface.


1. Consumers of number-intelligence-service

ServiceInterfaceDependency typeSLA expectation
routing-enginegRPC ResolveMsisdn, LookupPortingSynchronous, per-MT pre-dispatchP95 ≤ 5 ms; availability 99.95 %
sms-firewall-servicegRPC ResolveMsisdn, LookupEirSynchronous, per-MO + per-MTP95 ≤ 10 ms; availability 99.9 %
compliance-enginegRPC ResolveMsisdn (for GEO_RESTRICTION rule)Synchronous, per evaluationP95 ≤ 15 ms; availability 99.9 %
channel-router-servicegRPC ResolveMsisdn (capability check lineType == MOBILE before SMS dispatch)SynchronousP95 ≤ 10 ms; availability 99.9 %
fraud-intel-servicegRPC ResolveMsisdn + LookupMsisdnImei + numint.hlr_probe.completed.v1, numint.mnp.divergence.v1 eventsSync + asyncSync P95 ≤ 15 ms; event lag P95 ≤ 5 s
sms-orchestratorgRPC ResolveBatch for bulk pre-flightSynchronousP95 ≤ 80 ms (500 entries)
Tenant via KongREST /v1/lookup/* (Public Lookup API — billable)Synchronous public-facingP95 ≤ 200 ms; availability 99.5 %
regulator-portal-serviceREST /v1/regulator/numint/mnp/* + numint.audit.*Async event deliveryEvent delivery P95 ≤ 5 s
billing-serviceConsumes numint.lookup.billed.v1AsyncDelivery P95 ≤ 5 s

1.1 Async contract semantics

ResolveMsisdn is synchronous but every internal caller is operating inside the platform's async outbound pipeline. Tenants have already received 202 Accepted before any call reaches NI. Hence NI's fail-degraded behaviour (returning LOW/UNKNOWN confidence rather than error) does not violate any tenant-facing SLA — the worst case manifests as sub-optimal routing that gets corrected on the next reconciliation cycle.

Public Lookup API tenants, by contrast, see errors synchronously and MUST treat 5xx as retryable.

1.2 Internal-caller defence-in-depth pattern

// inside routing-engine MT pipeline
let attribution: MsisdnAttribution;
try {
attribution = await numintClient.resolveMsisdn(
{ e164, opts: { maxStalenessSeconds: 86400 }, traceId },
{ deadline: 30 /* ms */ }
);
} catch (err) {
// UNAVAILABLE / DEADLINE_EXCEEDED → fall back to prefix table
attribution = prefixTable.lookup(e164); // local, in-memory
}

if (attribution.confidence === 'UNKNOWN') {
// still acceptable for routing — prefix table will have given a default MNO
metrics.inc('routing_engine_numint_unknown_total');
}

sms-firewall-service uses a similar shape with an 80 ms deadline; compliance-engine with a 50 ms deadline inside its 450 ms evaluation budget.


2. Dependencies of number-intelligence-service

DependencyInterfaceFailure mode if unavailable
PostgreSQL numint schemaRead/write via PgBouncerHot path falls back to Redis + prefix-table; writes fail with 503; MNP reconciliation pauses
Redis (cluster, DB 5)GET/SET + Lua for token bucketsCascade falls through to PG (latency degrades from 5 ms → 15 ms P95); distributed locks cannot be acquired (workers skip cycle)
NATS JetStreamOutbox publishes; minimal consumption (operator.config.changed.v1, billing.tenant.plan.changed.v1)Outbox accumulates; events delayed; hot path unaffected
ni-hlr-gateway DaemonSetgRPC LiveLookup(e164)Live probes return ADAPTER_DOWN; callers receive stale/LOW-confidence answer
Per-MNO HLR/HSS endpoint (SS7/MAP or REST)SIGTRAN M3UA/SCTP (MAP SRI_SM per 3GPP TS 29.002) or HTTPS RESTTIMEOUT / MAP_ABORT / REST_5XX; fall back to last-known persisted attribution
MNO MNP SFTPDaily file fetchReconciliation run fails; NumIntMnpReconciliationStale after 24 h; MNP-overlay accuracy degrades
ATRA / per-MNO CEIR SFTPDaily EIR file fetchEIR sync stale; LookupEir returns UNKNOWN for newly-flagged IMEIs
Vault (PKI, KV)mTLS certs, MSISDN pepper, per-tenant salts, PCAP KEKService refuses to boot without TLS; cached pepper covers ≤ 15 min outage
MinIO / S3MNP raw archive, HLR PCAP samples, audit cold archiveArchive writes queued; MNP run still commits to PG; archive lag alert
operator-management-serviceNATS operator.config.changed.v1Adapter config drifts until event flows; alert on event-lag
billing-serviceNATS billing.tenant.plan.changed.v1Quota snapshot drifts ≤ 60 s; minor tenant-facing surprise on plan-change
auth-serviceJWT introspection via KongKong caches JWKS; short outage transparent

3. Proto Definition

See API_CONTRACTS §1 for the complete proto. Reproduced here is the core hot-path:

syntax = "proto3";
package ghasi.sms.numint.v1;
option go_package = "github.com/ghasi/sms-gateway/numint/v1";

import "google/protobuf/timestamp.proto";

service NumberIntelligenceService {
rpc ResolveMsisdn (ResolveMsisdnRequest) returns (MsisdnAttribution);
rpc ResolveBatch (ResolveBatchRequest) returns (stream MsisdnAttribution);
rpc ProbeHlr (ProbeHlrRequest) returns (HlrProbeResult);
rpc LookupPorting (LookupPortingRequest) returns (PortingStatus);
rpc LookupEir (LookupEirRequest) returns (EirStatus);
rpc GetMnpHistory (GetMnpHistoryRequest) returns (MnpHistory);
rpc LookupMsisdnImei(LookupMsisdnImeiRequest) returns (MsisdnImeiLink);
}

Full enums and message bodies are in API_CONTRACTS §1 and are not duplicated here.


4. Per-aggregate conflict policy

Per ADR-0004 §14, number-intelligence-service holds control-plane-adjacent data: reconciliation state is regional (Kabul runs jobs), but the attribution tables and Redis cache are hot-read from any region. The replication posture is therefore active-active with per-job leader election rather than strict primary-standby.

AggregatePolicyRationale
NumberRecordserver_authoritative with source-priority + monotonic versionSources ranked: ADMIN_OVERRIDE > MNP_RECON > LIVE_HLR_MAP > LIVE_HLR_REST > MNO_HLR_DUMP > POSTGRES > PREFIX_FALLBACK. Updates conditional on version = :expected. Concurrent writers serialise on pg_advisory_xact_lock(hashtext(msisdn_hash)).
PortabilityRecordappend_only with chain orderingPer-MSISDN seq monotonic; chain hash inviolable. Duplicate (msisdn_hash, port_date, recipient_mno, source_feed) is a no-op (INSERT … ON CONFLICT DO NOTHING).
LookupAuditEntryappend_only with per-partition chainPartition-scoped seq; advisory lock on partition_name.
EirRecordserver_authoritative with most-restrictive mergeMultiple reporters (ATRA + MNOs) may disagree on status; effective_status = max_restriction(statuses).
ReconciliationRunsingleton per (mno, date)Distributed Redis lock numint:lock:mnp_recon:{mnoId} enforces exclusivity
ReconciliationConflictserver_authoritativeOne row per unresolved conflict; idempotent on (msisdn_hash, candidate_a, candidate_b). Admin resolution is the single writer of resolution.
MnoSnapshotlast-write-wins on config_versionMirrors operator-management-service; version bumps monotonically; older versions ignored.
TenantLookupQuotalast-write-wins on plan_versionMirrors billing-service.
HlrProbeappend_onlyProbe ledger is audit-class; never updated.

4.1 Outbox pattern

Every state mutation writes a row to numint.outbox in the same transaction as the source change. OutboxRelay (continuous worker; per-replica with SELECT … FOR UPDATE SKIP LOCKED):

  1. Picks up to 200 unpublished rows ordered by created_at.
  2. Publishes to NATS with Nats-Msg-Id: event_id for consumer-side dedup.
  3. Updates published_at on success; increments attempts and stores last_error on failure.
  4. After 3 attempts → NumIntOutboxStuck alert (rows remain for SRE inspection; never auto-discarded).

4.2 Cross-region replication topology

Per ADR-0004 §14 — NI is active-active across Kabul and Mazar. Unlike consent-ledger-service (strict primary-standby), NI's attribution data is effectively public and strongly convergent under MNP-file reconciliation, so both regions can serve hot reads independently.

Kabul (af-kabul-1) Mazar (af-mzr-1) Dubai (ae-dxb-1, cold-DR)
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Postgres 16 │ streaming │ Postgres 16 │ logical │ Postgres 16 │
RW ──▶ │ numint schema │ ◄──────▶ │ numint schema │ ──────▶ │ audit-only │
RW ──▶ │ (MNP jobs here) │ sync │ (failover RW) │ async │ (AES-GCM wrapped │
│ NATS cluster A │ │ NATS cluster B │ │ backups only; │
│ Redis cluster │ (local) │ Redis cluster │ (local) │ keys stay kbl) │
│ ni-hlr-gateway │ │ ni-hlr-gateway │ │ no hot service │
└──────────────────┘ └──────────────────┘ └──────────────────┘
▲ ▲
│ │
└────── MNP recon leader ──────┘
(Kabul wins; Mazar warm-idle)

4.3 MNP reconciliation leader election

MNP reconciliation is batch-exclusive: only one region runs the job per MNO per day. Leadership is held through a Redis SET NX EX lock in the Kabul cluster; Mazar picks up the job if and only if the Kabul lock cannot be acquired for > 10 minutes.

On Kabul-region isolation, Mazar promotes to both (a) hot-read primary and (b) reconciliation leader; manual cutover by on-call on the hot-path path is a no-op because Mazar was already serving hot reads.

4.4 Failover semantics

FailureDetectionAction
Kabul Postgres primary downPatroni + etcd consensusPromote Kabul standby; RTO ≤ 90 s
Kabul region isolated (network partition)Cross-region heartbeatMazar continues hot path unchanged; MNP jobs pause until Kabul returns OR > 10 min elapses → Mazar promotes to MNP leader
HLR gateway adapter failurenumint_hlr_adapter_health{mno} goes 0Route probes to sibling pod; alert NumIntHlrAdapterDown
MNP SFTP unreachableHTTP fetch error metricRetry hourly until 23:00; escalate to P1
Cross-region divergence detectedCross-region audit verifier cronCRITICAL alert; freeze writes (manual)

5. Schema stability guarantees

5.1 gRPC proto

FieldStability
ResolveMsisdnRequest.e164, scopeStable; required forever
MsisdnAttribution.mno, line_type, mnp_status, source, confidence, tierStable
All enumsStable; new values may be added; callers MUST handle *_UNSPECIFIED/*_UNKNOWN as a no-op default
MsisdnAttribution.risk_flagsStable as a list; new enum values may appear
New fields with proto3 defaultsNon-breaking

5.2 REST API

  • /v1/lookup/*, /v1/admin/numint/*, /v1/regulator/numint/* are stable within v1.
  • Breaking changes require /v2/* with a 90-day deprecation window.
  • Public Lookup API is tenant-facing and therefore has additional stability requirements — response JSON keys are frozen within v1; deprecated fields are tombstoned (retained in response as null) rather than removed.

5.3 Event subjects

Per EVENT_SCHEMAS §4.


6. Versioning policy

  • gRPC package: ghasi.sms.numint.v1. Major bump → coordinated migration plan.
  • REST: /v1/lookup/*, /v1/admin/numint/*. OpenAPI document at /v1/numint/openapi.json is the contract source of truth.
  • Contract tests: Pact (tenant REST), gRPC reflection-based contract tests for routing-engine, sms-firewall-service, compliance-engine, channel-router-service, fraud-intel-service. Run on every PR; failures block merge.

7. Fail-degraded vs Fail-closed semantics

Consumers MUST NOT treat NI as fail-closed. The correct caller pattern is:

  1. ResolveMsisdn returns OK with confidence = UNKNOWN → use caller-side fallback (prefix table).
  2. ResolveMsisdn returns UNAVAILABLE / DEADLINE_EXCEEDED → use caller-side fallback.
  3. ResolveMsisdn returns INVALID_ARGUMENT → this is a caller bug; surface the validation error upstream.
  4. LookupEir returning UNKNOWN for an unknown IMEI is not an error — it is a legitimate result.

The one hard contract: LookupPorting MUST NOT return a stale isPorted = false on a MSISDN with a fresh PortabilityRecord insertion; the MNP overlay step in UC-Lookup enforces this.


8. Cross-service invariants

  1. Authoritative source. No other service may query an MNO HLR directly, ingest an MNO MNP file, or run a parallel attribution table.
  2. Confidence floor. Callers using confidence: UNKNOWN as the sole basis for a regulatory decision (compliance-engine GEO_RESTRICTION specifically) MUST treat it as the most-restrictive class. Documented in compliance-engine DOMAIN_MODEL.
  3. Tenant-salt pairing. Per-tenant salt for lookup_audit.msisdn_hash is the single mechanism preventing cross-tenant audit correlation. Salt rotation is a coordinated operation documented in SECURITY_MODEL §3.2.
  4. Backward compatibility. Schema changes follow the platform's evolution policy (add-only fields; new RPCs / subjects for breaking changes).