Number Intelligence Service — Sync Contract
Version: 1.0 Status: Draft Owner: Messaging Core Last Updated: 2026-04-21 Companion: API_CONTRACTS · APPLICATION_LOGIC · SECURITY_MODEL · ADR-0004 §14
This document defines what other services depend on from number-intelligence-service, what it depends on from others, how its aggregates resolve concurrent updates, how it replicates across regions per ADR-0004, and the proto contract for the synchronous gRPC surface.
1. Consumers of number-intelligence-service
| Service | Interface | Dependency type | SLA expectation |
|---|---|---|---|
routing-engine | gRPC ResolveMsisdn, LookupPorting | Synchronous, per-MT pre-dispatch | P95 ≤ 5 ms; availability 99.95 % |
sms-firewall-service | gRPC ResolveMsisdn, LookupEir | Synchronous, per-MO + per-MT | P95 ≤ 10 ms; availability 99.9 % |
compliance-engine | gRPC ResolveMsisdn (for GEO_RESTRICTION rule) | Synchronous, per evaluation | P95 ≤ 15 ms; availability 99.9 % |
channel-router-service | gRPC ResolveMsisdn (capability check lineType == MOBILE before SMS dispatch) | Synchronous | P95 ≤ 10 ms; availability 99.9 % |
fraud-intel-service | gRPC ResolveMsisdn + LookupMsisdnImei + numint.hlr_probe.completed.v1, numint.mnp.divergence.v1 events | Sync + async | Sync P95 ≤ 15 ms; event lag P95 ≤ 5 s |
sms-orchestrator | gRPC ResolveBatch for bulk pre-flight | Synchronous | P95 ≤ 80 ms (500 entries) |
| Tenant via Kong | REST /v1/lookup/* (Public Lookup API — billable) | Synchronous public-facing | P95 ≤ 200 ms; availability 99.5 % |
regulator-portal-service | REST /v1/regulator/numint/mnp/* + numint.audit.* | Async event delivery | Event delivery P95 ≤ 5 s |
billing-service | Consumes numint.lookup.billed.v1 | Async | Delivery P95 ≤ 5 s |
1.1 Async contract semantics
ResolveMsisdn is synchronous but every internal caller is operating inside the platform's async outbound pipeline. Tenants have already received 202 Accepted before any call reaches NI. Hence NI's fail-degraded behaviour (returning LOW/UNKNOWN confidence rather than error) does not violate any tenant-facing SLA — the worst case manifests as sub-optimal routing that gets corrected on the next reconciliation cycle.
Public Lookup API tenants, by contrast, see errors synchronously and MUST treat 5xx as retryable.
1.2 Internal-caller defence-in-depth pattern
// inside routing-engine MT pipeline
let attribution: MsisdnAttribution;
try {
attribution = await numintClient.resolveMsisdn(
{ e164, opts: { maxStalenessSeconds: 86400 }, traceId },
{ deadline: 30 /* ms */ }
);
} catch (err) {
// UNAVAILABLE / DEADLINE_EXCEEDED → fall back to prefix table
attribution = prefixTable.lookup(e164); // local, in-memory
}
if (attribution.confidence === 'UNKNOWN') {
// still acceptable for routing — prefix table will have given a default MNO
metrics.inc('routing_engine_numint_unknown_total');
}
sms-firewall-service uses a similar shape with an 80 ms deadline; compliance-engine with a 50 ms deadline inside its 450 ms evaluation budget.
2. Dependencies of number-intelligence-service
| Dependency | Interface | Failure mode if unavailable |
|---|---|---|
PostgreSQL numint schema | Read/write via PgBouncer | Hot path falls back to Redis + prefix-table; writes fail with 503; MNP reconciliation pauses |
| Redis (cluster, DB 5) | GET/SET + Lua for token buckets | Cascade falls through to PG (latency degrades from 5 ms → 15 ms P95); distributed locks cannot be acquired (workers skip cycle) |
| NATS JetStream | Outbox publishes; minimal consumption (operator.config.changed.v1, billing.tenant.plan.changed.v1) | Outbox accumulates; events delayed; hot path unaffected |
ni-hlr-gateway DaemonSet | gRPC LiveLookup(e164) | Live probes return ADAPTER_DOWN; callers receive stale/LOW-confidence answer |
| Per-MNO HLR/HSS endpoint (SS7/MAP or REST) | SIGTRAN M3UA/SCTP (MAP SRI_SM per 3GPP TS 29.002) or HTTPS REST | TIMEOUT / MAP_ABORT / REST_5XX; fall back to last-known persisted attribution |
| MNO MNP SFTP | Daily file fetch | Reconciliation run fails; NumIntMnpReconciliationStale after 24 h; MNP-overlay accuracy degrades |
| ATRA / per-MNO CEIR SFTP | Daily EIR file fetch | EIR sync stale; LookupEir returns UNKNOWN for newly-flagged IMEIs |
| Vault (PKI, KV) | mTLS certs, MSISDN pepper, per-tenant salts, PCAP KEK | Service refuses to boot without TLS; cached pepper covers ≤ 15 min outage |
| MinIO / S3 | MNP raw archive, HLR PCAP samples, audit cold archive | Archive writes queued; MNP run still commits to PG; archive lag alert |
operator-management-service | NATS operator.config.changed.v1 | Adapter config drifts until event flows; alert on event-lag |
billing-service | NATS billing.tenant.plan.changed.v1 | Quota snapshot drifts ≤ 60 s; minor tenant-facing surprise on plan-change |
auth-service | JWT introspection via Kong | Kong caches JWKS; short outage transparent |
3. Proto Definition
See API_CONTRACTS §1 for the complete proto. Reproduced here is the core hot-path:
syntax = "proto3";
package ghasi.sms.numint.v1;
option go_package = "github.com/ghasi/sms-gateway/numint/v1";
import "google/protobuf/timestamp.proto";
service NumberIntelligenceService {
rpc ResolveMsisdn (ResolveMsisdnRequest) returns (MsisdnAttribution);
rpc ResolveBatch (ResolveBatchRequest) returns (stream MsisdnAttribution);
rpc ProbeHlr (ProbeHlrRequest) returns (HlrProbeResult);
rpc LookupPorting (LookupPortingRequest) returns (PortingStatus);
rpc LookupEir (LookupEirRequest) returns (EirStatus);
rpc GetMnpHistory (GetMnpHistoryRequest) returns (MnpHistory);
rpc LookupMsisdnImei(LookupMsisdnImeiRequest) returns (MsisdnImeiLink);
}
Full enums and message bodies are in API_CONTRACTS §1 and are not duplicated here.
4. Per-aggregate conflict policy
Per ADR-0004 §14, number-intelligence-service holds control-plane-adjacent data: reconciliation state is regional (Kabul runs jobs), but the attribution tables and Redis cache are hot-read from any region. The replication posture is therefore active-active with per-job leader election rather than strict primary-standby.
| Aggregate | Policy | Rationale |
|---|---|---|
NumberRecord | server_authoritative with source-priority + monotonic version | Sources ranked: ADMIN_OVERRIDE > MNP_RECON > LIVE_HLR_MAP > LIVE_HLR_REST > MNO_HLR_DUMP > POSTGRES > PREFIX_FALLBACK. Updates conditional on version = :expected. Concurrent writers serialise on pg_advisory_xact_lock(hashtext(msisdn_hash)). |
PortabilityRecord | append_only with chain ordering | Per-MSISDN seq monotonic; chain hash inviolable. Duplicate (msisdn_hash, port_date, recipient_mno, source_feed) is a no-op (INSERT … ON CONFLICT DO NOTHING). |
LookupAuditEntry | append_only with per-partition chain | Partition-scoped seq; advisory lock on partition_name. |
EirRecord | server_authoritative with most-restrictive merge | Multiple reporters (ATRA + MNOs) may disagree on status; effective_status = max_restriction(statuses). |
ReconciliationRun | singleton per (mno, date) | Distributed Redis lock numint:lock:mnp_recon:{mnoId} enforces exclusivity |
ReconciliationConflict | server_authoritative | One row per unresolved conflict; idempotent on (msisdn_hash, candidate_a, candidate_b). Admin resolution is the single writer of resolution. |
MnoSnapshot | last-write-wins on config_version | Mirrors operator-management-service; version bumps monotonically; older versions ignored. |
TenantLookupQuota | last-write-wins on plan_version | Mirrors billing-service. |
HlrProbe | append_only | Probe ledger is audit-class; never updated. |
4.1 Outbox pattern
Every state mutation writes a row to numint.outbox in the same transaction as the source change. OutboxRelay (continuous worker; per-replica with SELECT … FOR UPDATE SKIP LOCKED):
- Picks up to 200 unpublished rows ordered by
created_at. - Publishes to NATS with
Nats-Msg-Id: event_idfor consumer-side dedup. - Updates
published_aton success; incrementsattemptsand storeslast_erroron failure. - After 3 attempts →
NumIntOutboxStuckalert (rows remain for SRE inspection; never auto-discarded).
4.2 Cross-region replication topology
Per ADR-0004 §14 — NI is active-active across Kabul and Mazar. Unlike consent-ledger-service (strict primary-standby), NI's attribution data is effectively public and strongly convergent under MNP-file reconciliation, so both regions can serve hot reads independently.
Kabul (af-kabul-1) Mazar (af-mzr-1) Dubai (ae-dxb-1, cold-DR)
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Postgres 16 │ streaming │ Postgres 16 │ logical │ Postgres 16 │
RW ──▶ │ numint schema │ ◄──────▶ │ numint schema │ ──────▶ │ audit-only │
RW ──▶ │ (MNP jobs here) │ sync │ (failover RW) │ async │ (AES-GCM wrapped │
│ NATS cluster A │ │ NATS cluster B │ │ backups only; │
│ Redis cluster │ (local) │ Redis cluster │ (local) │ keys stay kbl) │
│ ni-hlr-gateway │ │ ni-hlr-gateway │ │ no hot service │
└──────────────────┘ └──────────────────┘ └──────────────────┘
▲ ▲
│ │
└────── MNP recon leader ──────┘
(Kabul wins; Mazar warm-idle)
4.3 MNP reconciliation leader election
MNP reconciliation is batch-exclusive: only one region runs the job per MNO per day. Leadership is held through a Redis SET NX EX lock in the Kabul cluster; Mazar picks up the job if and only if the Kabul lock cannot be acquired for > 10 minutes.
On Kabul-region isolation, Mazar promotes to both (a) hot-read primary and (b) reconciliation leader; manual cutover by on-call on the hot-path path is a no-op because Mazar was already serving hot reads.
4.4 Failover semantics
| Failure | Detection | Action |
|---|---|---|
| Kabul Postgres primary down | Patroni + etcd consensus | Promote Kabul standby; RTO ≤ 90 s |
| Kabul region isolated (network partition) | Cross-region heartbeat | Mazar continues hot path unchanged; MNP jobs pause until Kabul returns OR > 10 min elapses → Mazar promotes to MNP leader |
| HLR gateway adapter failure | numint_hlr_adapter_health{mno} goes 0 | Route probes to sibling pod; alert NumIntHlrAdapterDown |
| MNP SFTP unreachable | HTTP fetch error metric | Retry hourly until 23:00; escalate to P1 |
| Cross-region divergence detected | Cross-region audit verifier cron | CRITICAL alert; freeze writes (manual) |
5. Schema stability guarantees
5.1 gRPC proto
| Field | Stability |
|---|---|
ResolveMsisdnRequest.e164, scope | Stable; required forever |
MsisdnAttribution.mno, line_type, mnp_status, source, confidence, tier | Stable |
| All enums | Stable; new values may be added; callers MUST handle *_UNSPECIFIED/*_UNKNOWN as a no-op default |
MsisdnAttribution.risk_flags | Stable as a list; new enum values may appear |
| New fields with proto3 defaults | Non-breaking |
5.2 REST API
/v1/lookup/*,/v1/admin/numint/*,/v1/regulator/numint/*are stable withinv1.- Breaking changes require
/v2/*with a 90-day deprecation window. - Public Lookup API is tenant-facing and therefore has additional stability requirements — response JSON keys are frozen within
v1; deprecated fields are tombstoned (retained in response asnull) rather than removed.
5.3 Event subjects
Per EVENT_SCHEMAS §4.
6. Versioning policy
- gRPC package:
ghasi.sms.numint.v1. Major bump → coordinated migration plan. - REST:
/v1/lookup/*,/v1/admin/numint/*. OpenAPI document at/v1/numint/openapi.jsonis the contract source of truth. - Contract tests: Pact (tenant REST), gRPC reflection-based contract tests for
routing-engine,sms-firewall-service,compliance-engine,channel-router-service,fraud-intel-service. Run on every PR; failures block merge.
7. Fail-degraded vs Fail-closed semantics
Consumers MUST NOT treat NI as fail-closed. The correct caller pattern is:
ResolveMsisdnreturnsOKwithconfidence = UNKNOWN→ use caller-side fallback (prefix table).ResolveMsisdnreturnsUNAVAILABLE/DEADLINE_EXCEEDED→ use caller-side fallback.ResolveMsisdnreturnsINVALID_ARGUMENT→ this is a caller bug; surface the validation error upstream.LookupEirreturningUNKNOWNfor an unknown IMEI is not an error — it is a legitimate result.
The one hard contract: LookupPorting MUST NOT return a stale isPorted = false on a MSISDN with a fresh PortabilityRecord insertion; the MNP overlay step in UC-Lookup enforces this.
8. Cross-service invariants
- Authoritative source. No other service may query an MNO HLR directly, ingest an MNO MNP file, or run a parallel attribution table.
- Confidence floor. Callers using
confidence: UNKNOWNas the sole basis for a regulatory decision (compliance-engine GEO_RESTRICTION specifically) MUST treat it as the most-restrictive class. Documented incompliance-engineDOMAIN_MODEL. - Tenant-salt pairing. Per-tenant salt for
lookup_audit.msisdn_hashis the single mechanism preventing cross-tenant audit correlation. Salt rotation is a coordinated operation documented in SECURITY_MODEL §3.2. - Backward compatibility. Schema changes follow the platform's evolution policy (add-only fields; new RPCs / subjects for breaking changes).