Number Intelligence Service — Application Logic
Version: 1.0 Status: Draft Owner: Messaging Core Last Updated: 2026-04-21 Companion: DOMAIN_MODEL · API_CONTRACTS · SYNC_CONTRACT · SECURITY_MODEL
1. Use Cases
Use cases are organised by caller plane: hot path (gRPC, sub-15 ms P95), batch (cron, SFTP intake), and tenant-facing (REST via Kong, billable).
UC-Lookup: ResolveMsisdn (gRPC hot path)
Trigger: Any authorised internal caller (routing-engine, sms-firewall-service, compliance-engine, channel-router-service, fraud-intel-service) invokes NumberIntelligenceService/ResolveMsisdn(e164, opts).
Input: ResolveMsisdnRequest { e164, scope, opts: { maxStalenessSeconds, forceFresh, tpsWaitMs }, traceId }
Output: MsisdnAttribution { mno, originalMno?, lineType, country, mnpStatus, riskFlags[], source, confidence, cachedAt, stalenessSeconds, tier }
SLA: P95 ≤ 15 ms aggregate (assumes ≥ 99 % cache hit), P99 ≤ 50 ms. Worst-case under forced live_hlr is bounded by the MAP timeout (1500 ms) or REST timeout (800 ms) per SERVICE_OVERVIEW §10.
Steps:
- Input validation.
e164must match^\+[1-9]\d{6,14}$; normalise to NFKC; reject withINVALID_ARGUMENTon failure. - Compute
msisdnHashwith the platform pepper. - LRU probe.
lru.get(msisdnHash)— in-processlru-cachev10; TTL 60 s; 100 000 entries/pod. On HIT return immediately withtier = LRU,sourcepreserved from the cached record. - Redis probe.
GET numint:lookup:{hash}— hash-per-record containing{ mno, lineType, country, mnpStatus, vlr, cachedAt, source, confidence }. On HIT populate LRU and return withtier = REDIS. - Postgres probe.
SELECT * FROM numint.number_records WHERE msisdn_hash = $1on the replica pool. On HIT withcachedAt >= now() - ttl_by_class(source):- Populate Redis with per-class TTL (
LINE_TYPE30 d,MNO24 h,VLR5 min; see SERVICE_OVERVIEW §9). - Populate LRU; return
tier = PG.
- Populate Redis with per-class TTL (
- Stale or missing — apply refresh policy:
- If
opts.forceFresh = trueORopts.maxStalenessSecondsis violated by the PG row, proceed to UC-HlrProbe. - Else return PG row with
confidence = LOWandtier = PG; do not probe live HLR.
- If
- MNP overlay. Before returning, consult
PortabilityRecordmost-recent row for thismsisdnHash: if a port is recorded afterNumberRecord.cachedAt, overwritemnoIdwithrecipientMnoIdand emitnumint.mnp.divergence.v1for fraud correlation (async). - MNP divergence detection. When step 7 triggers, mark
riskFlags += MNP_DIVERGENCEon the response. - Write-through (async). On any live/MNP update, enqueue an outbox write (UC-WriteThrough) so the durable store grows toward steady-state coverage.
- Metric side-effects. Increment
numint_lookup_total{tier, source, confidence}, recordnumint_lookup_duration_secondshistogram. - Return to caller.
Error codes:
| gRPC status | Condition |
|---|---|
INVALID_ARGUMENT | Malformed E.164 |
RESOURCE_EXHAUSTED | Per-pod concurrency cap (default 10 000 in-flight) |
DEADLINE_EXCEEDED | > 1 s default deadline (rare — usually cache returns sub-ms) |
UNAVAILABLE | PG and Redis both down AND no live HLR fallback succeeds → { source: "PREFIX_FALLBACK", confidence: UNKNOWN } preferred over error; only if even the prefix table load fails do we return UNAVAILABLE |
INTERNAL | Unhandled exception |
Fail-degraded rationale.
routing-enginehas a prefix-table fallback;sms-firewall-servicehasallow-by-defaultfor unknown origins;compliance-engineGEO_RESTRICTION treats UNKNOWN country as the most-restrictive class. A typed answer with LOW/UNKNOWN confidence is always more useful than an error.
UC-BulkLookup: ResolveBatch (gRPC server-streaming)
Trigger: sms-orchestrator bulk-submit pipeline or tenant SDK calls ResolveBatch(repeated e164).
Input: ResolveBatchRequest { entries: string[ ≤ 1000 ], opts, traceId }
Output: Stream of MsisdnAttribution in input order, one message per entry.
SLA: P95 ≤ 80 ms for 500-entry batch at ≥ 95 % cache-hit.
Steps:
- Validate size (> 1000 →
RESOURCE_EXHAUSTED); validate each entry (invalid → emit one error slot for that index, not a whole-batch fail). - Deduplicate. Group by
msisdnHash; consult cascade once per unique hash; replicate into each slot. - Parallel cascade. For unique entries: LRU batch-get → Redis
MGET→ PostgresWHERE msisdn_hash = ANY($1). - Live-HLR fan-out (bounded). If
forceFreshor stale beyond limit, enqueue live probes through the per-MNO TPS governor; entries exceeding the 2 s internal deadline return withsource = FALLBACK_PREFIX,confidence = LOW,tier = FALLBACK. - Emit in input order.
Error handling: Per-slot errors are returned inline; a partial failure never fails the whole batch unless input validation of the request envelope fails.
UC-HlrProbe: Live HLR lookup via ni-hlr-gateway
Trigger: UC-Lookup or UC-BulkLookup escalates to live, OR admin explicitly invokes ProbeHlr(e164).
Steps:
- TPS gate.
EVALLua against Redis token bucketnumint:tps:hlr:{mnoHint}(capacity =MnoSnapshot.tpsLimit, refill = same / sec). On bucket empty, wait up toopts.tpsWaitMs(default 200 ms). On exhaustion, emitHlrProbe { status: THROTTLED }and return the most-recent persisted answer withsource = stale_throttled,confidence = MEDIUM. - Transport selection. Read
MnoSnapshot.hlrEndpoint.kind:MAP→ dispatch throughni-hlr-gatewayLiveLookup(e164, mnoHint)gRPC. Gateway builds a MAPSendRoutingInfoForSMper 3GPP TS 29.002; application contextshortMsgGatewayContext-v3; timeout 1500 ms.REST→ gateway issuesPOST {endpoint}/v1/hlr/lookupwith client-credentials JWT; timeout 800 ms.
- Response normalisation. The gateway returns
{ imsi, vlr, lineType, mnoId }. NI derivescountryfrom the E.164 CC (MCC derivation from IMSI first 3 digits is used as a secondary confirmation). - PCAP sampling. At 0.1 % sampling, the gateway captures the full MAP TCAP+SCCP PDU (encrypted with KMS key
numint-pcap-kek) to MinIOnumint-hlr-pcap/for post-incident review. - Write-through to PG + Redis + LRU (UC-WriteThrough).
- Record
HlrProberow (append-only) withstatus, durationMs, resultSnapshot. - Emit
numint.hlr_probe.completed.v1(async, for fraud-intel VLR-change correlation).
Error codes:
| Condition | Behaviour |
|---|---|
DEADLINE_EXCEEDED (MAP timeout) | Status TIMEOUT; caller gets last-known answer with confidence = LOW |
MAP_ABORT (SS7 MAP abort) | Status MAP_ABORT; same as timeout |
REST 5xx | Status REST_5XX; retry once then stale fallback |
ADAPTER_DOWN | Gateway DaemonSet pod unreachable → retry against sibling pod; alert NumIntHlrAdapterDown |
Mermaid sequence:
UC-WriteThrough: Authoritative attribution UPSERT
Trigger: Any successful live HLR or MNP reconciliation result.
Steps:
- Begin PG transaction.
INSERT INTO numint.number_records (...) ON CONFLICT (msisdn_hash) DO UPDATE SET mno_id = EXCLUDED.mno_id, line_type = EXCLUDED.line_type, vlr = EXCLUDED.vlr, imsi_prefix = EXCLUDED.imsi_prefix, last_seen = now(), lookup_count = lookup_count + 1, version = version + 1, cached_at = now() WHERE number_records.version = $expected_version.- If
mno_idormnp_statuschanged vs prior, insertnumint.outboxrow fornumint.attribution.changed.v1. SETEX numint:lookup:{hash}with per-class TTL.- Commit; LRU update happens outside the transaction.
- The caller-facing response does not wait on write; failures retry through the outbox with exponential backoff (max 6 attempts).
Idempotency. Same (msisdn_hash, mno_id, line_type) as the current row → version bump only; no event emitted.
UC-MnpReconciliationDaily: Per-MNO SFTP intake
Trigger: Kubernetes CronJob mnp-recon at 02:30 Asia/Kabul daily; one job per MNO (fan-out). Runs only in kbl region.
Steps:
- Distributed lock.
SET numint:lock:mnp_recon:{mnoId} NX EX 1800— prevents concurrent runs during cron re-runs. - Fetch. Pull
sftp://{mno-sftp}/mnp/{yyyy-mm-dd}.csv(CSV:msisdn,donor_mno,recipient_mno,port_date,direction). - Archive raw file.
PUT s3://numint-mnp-raw/{mnoId}/{yyyy}/{mm}/{dd}.csvwithsha256:{hash}tag. - Validate each row: E.164 regex; valid
mno_id;port_date <= today; CSV schema version header. - Conflict detection pre-insert. For each row compute
msisdnHash; read the most-recent existingPortabilityRecordfor this MSISDN. If a differentrecipientMnoIdwithport_datewithin ±2 days exists, this is a conflict — insert intonumint.reconciliation_conflictsand SKIP the port insert (the activeNumberRecord.mnpStatusis not updated). - Insert valid, non-conflicting rows into
numint.portability_historyviaINSERT … ON CONFLICT DO NOTHINGkeyed on(msisdn_hash, port_date, recipient_mno_id, source_feed). - Materialise
NumberRecord. For each new port, UPDATEnumint.number_records SET mno_id = :recipient, original_mno_id = COALESCE(original_mno_id, :donor), mnp_status = 'PORTED_IN', version = version + 1, cached_at = now(). - Chain hash. Update
ReconciliationRun.recordHash = sha256(canonical(run-payload) || prevChainHash)— per-MNO chain. - Invalidate cache.
DEL numint:lookup:{hash}for every changed MSISDN; emitnumint.attribution.changed.v1per row to warm subscriber caches (routing-engine,sms-firewall-service). - Summary event. Publish
numint.reconciliation.completed.v1 { runId, mnoId, totalRecords, accepted, rejected, conflictsCount, durationMs, fileSha256 }. - Failure handling. If SFTP fetch fails, the job retries hourly until 23:00 same day; after that it escalates to P1 via
NumIntMnpReconciliationStalealert.
Budget. Each MNO's daily file is typically < 100 k rows; whole run completes P99 ≤ 4 h end-to-end per SERVICE_OVERVIEW §1.
UC-MnpConflictResolve: Admin dispute resolution
Trigger: Platform admin invokes POST /v1/admin/mnp/conflicts/{conflictId}/resolve with { resolution, note }.
Steps:
- Load
ReconciliationConflictby id; reject ifresolution IS NOT NULL(already resolved). - Apply the chosen resolution:
A_WINS→ insert the candidate APortabilityRecord; materialiseNumberRecord.B_WINS→ insert B.KEEP_BOTH_PENDING_VENDOR_CONFIRM→ leave state as-is; mark conflict as deferred; create a follow-up ticket.DISCARDED→ ignore both (e.g., both turned out to be erroneous reports).
- Write
AuditLogentry{ entityType: 'MNP_CONFLICT', action: 'RESOLVE', before, after }. - Emit
numint.mnp.changed.v1if a record was committed.
UC-EirCheck: LookupEir(imei)
Trigger: sms-firewall-service or fraud-intel-service calls LookupEir(imei).
Steps:
- Validate Luhn per 3GPP TS 23.003 §6.2.1; reject malformed with
INVALID_ARGUMENT. GET numint:eir:{imeiHash}(Redis) → HIT returns{ status, reasonCode, reportedBy[], lastUpdated }.- On MISS:
SELECT … FROM numint.eir_records WHERE imei_hash = $1. On row return, populate Redis with 24 h TTL. - On row absent: return
{ status: UNKNOWN }— never error (known-unknowns are a legitimate response shape). - Emit
numint_eir_lookup_total{outcome}metric.
UC-LookupBillingEvent: Per-call billing meter
Trigger: Public Lookup API REST call (GET /v1/lookup/{msisdn} or POST /v1/lookup/batch).
Steps:
- After successful response, determine SKU:
lookup.v1— standard (cache or PG path;maxStaleness >= 86400).lookup.fresh.v1— forced live probe (maxStaleness < 86400AND a live HLR was actually issued).
- Compute
msisdnHashwith per-tenant salt (tenantSaltloaded from Vault KVsecret/ghasi/numint/tenant-salts/{tenantId}). - Insert
numint.outboxrow with subjectbilling.metering.recorded.v1and payload{ tenantId, sku, quantity: 1, occurredAt, requestId, msisdnHash }. - Insert matching
numint.lookup_auditrow (hash-chained). - Response header
X-Metering-Status: ok(ordegradedif outbox insert failed; the call still returns 200 but the tenant is not billed — SRE alert fires). - Internal gRPC callers (SPIFFE SAN in
{routing-engine, sms-firewall-service, compliance-engine, channel-router-service, fraud-intel-service}) are not metered.
UC-CacheWarmCold: Warm-on-deploy
Trigger: Pod startup (readiness gate OFF until warm completes) or cron numint-cache-warm hourly.
Steps:
- Query top-N MSISDNs by
lookup_countfromnumint.number_records(default N = 500 000, tuned per pod capacity). - Load into Redis via
SETEXin pipelined batches of 1 000. - Emit
numint.cache.refreshed.v1 { kind: "warm_on_deploy", keys: N, durationMs }. - Flip readiness to ready once ≥ 80 % of the target is loaded.
UC-TenantQuotaEnforce: Rate limit + monthly cap
Trigger: Every Public Lookup API call.
Steps:
- Plan snapshot. Load
TenantLookupQuotafrom Redis cachenumint:quota:{tenantId}(TTL 60 s); on MISS read PG. - RPS bucket. Lua-eval Redis token bucket
numint:tps:lookup:{tenantId}; capacity = plan RPS (default 10). On empty, 429 +Retry-After. - Monthly counter.
INCR numint:quota:lookup:{tenantId}:{yyyymm}; expire at 1st of next month 00:00 Asia/Kabul; if result >monthlyQuota, 429. - Fresh-lookup separate bucket. For
maxStaleness < 86400, enforcefreshLookupRpsLimitseparately (lower cap — defends SS7 quota). - Plan changes. Consume
billing.tenant.plan.changed.v1→ invalidate local cache → new caps take effect within 60 s. - Quota audit. Emit
audit.lookup.quota_exceeded.v1when 429 occurs (for tenant churn analysis).
UC-AuditChainVerify: Daily hash-chain integrity check
Trigger: CronJob numint-audit-verifier at 04:30 Asia/Kabul daily (and on demand via admin endpoint).
Steps:
- Acquire distributed lock
numint:lock:audit_verifier. - For
LookupAuditEntryandPortabilityRecordpartitions updated in last 24 h: re-compute chain tail-to-head; comparerecordHashwith stored values. - On mismatch: insert
AuditLog { action: AUDIT_INTEGRITY_BROKEN }; page on-call withNumIntAuditChainBrokenCRITICAL; freeze writes (manual un-freeze only). - On success: emit
numint.audit.chain_verified.v1(ops marker).
2. Performance Optimisation
2.1 Fast-path ordering (sub-5 ms budget)
Evaluation order in UC-Lookup is chosen so the majority of calls terminate at step 3:
- LRU (P50 0.2 ms) — absorbs OTP-storm repeat lookups (same MSISDN hit 3-10× within seconds).
- Redis (P50 1.5 ms) — per-region hot cache; 6-node cluster sized for ≥ 95 % hit ratio.
- Postgres (P50 6 ms) — replica pool; monthly-partitioned
number_records. - Live HLR (P50 250 ms MAP / 80 ms REST) — MNO-facing; TPS-governed.
2.2 Budget enforcement
Each ResolveMsisdn call has an internal 12 ms budget (leaves 3 ms for gRPC serialisation). Per-step budget sub-allocation:
| Step | Budget |
|---|---|
| Validation + hash | 0.1 ms |
| LRU get | 0.2 ms |
| Redis get | 3 ms |
| PG select | 8 ms |
| MNP overlay | 2 ms |
If budget exhausts mid-step, step 6 default behaviour (return PG with confidence = LOW) applies.
2.3 Redis-Lua atomic TPS gate
-- KEYS[1] = numint:tps:hlr:{mno}
-- ARGV[1] = capacity, ARGV[2] = refill_per_sec, ARGV[3] = now_ms
local bucket = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens = tonumber(bucket[1]) or tonumber(ARGV[1])
local ts = tonumber(bucket[2]) or tonumber(ARGV[3])
local elapsed = math.max(0, tonumber(ARGV[3]) - ts) / 1000.0
tokens = math.min(tonumber(ARGV[1]), tokens + elapsed * tonumber(ARGV[2]))
if tokens < 1 then
redis.call('HSET', KEYS[1], 'tokens', tokens, 'ts', ARGV[3])
return 0
end
tokens = tokens - 1
redis.call('HSET', KEYS[1], 'tokens', tokens, 'ts', ARGV[3])
redis.call('EXPIRE', KEYS[1], 3600)
return 1
3. MNP Conflict Resolution Heuristics
When two MNOs claim the same ported number within ±2 days:
| Heuristic | Weight |
|---|---|
Which MNO's file has a more recent sourceFeed timestamp? | 30 % |
Which candidate has the later portDate? | 25 % |
Does a recent HlrProbe confirm one of the two MNOs as current? | 30 % |
Does fraud-intel-service flag one side as a known conflict-prone MNO for this MSISDN range? | 15 % |
Weighted score > 0.7 → platform can propose an auto-resolution; lower → surfaces for manual review. Auto-resolutions are still reviewable (5-day undo window).
4. SLA Budgets (summary)
| Use case | P50 | P95 | P99 |
|---|---|---|---|
| UC-Lookup cache-hit | 1 ms | 5 ms | 10 ms |
| UC-Lookup PG fallback | 6 ms | 15 ms | 30 ms |
| UC-Lookup live HLR forced | 250 ms | 600 ms | 1200 ms |
| UC-BulkLookup (500 entries, 95 % cache) | 40 ms | 80 ms | 150 ms |
| UC-EirCheck | 2 ms | 8 ms | 20 ms |
| UC-MnpReconciliationDaily per MNO | 30 min | 90 min | 4 h |
| UC-AuditChainVerify 24 h window | 60 s | 120 s | 5 min |
| Public Lookup GET (REST) | 8 ms | 200 ms | 500 ms |
| Public Lookup batch POST (100 entries) | 200 ms | 800 ms | 2000 ms |