Skip to main content

Number Intelligence Service — Application Logic

Version: 1.0 Status: Draft Owner: Messaging Core Last Updated: 2026-04-21 Companion: DOMAIN_MODEL · API_CONTRACTS · SYNC_CONTRACT · SECURITY_MODEL

1. Use Cases

Use cases are organised by caller plane: hot path (gRPC, sub-15 ms P95), batch (cron, SFTP intake), and tenant-facing (REST via Kong, billable).


UC-Lookup: ResolveMsisdn (gRPC hot path)

Trigger: Any authorised internal caller (routing-engine, sms-firewall-service, compliance-engine, channel-router-service, fraud-intel-service) invokes NumberIntelligenceService/ResolveMsisdn(e164, opts).

Input: ResolveMsisdnRequest { e164, scope, opts: { maxStalenessSeconds, forceFresh, tpsWaitMs }, traceId }

Output: MsisdnAttribution { mno, originalMno?, lineType, country, mnpStatus, riskFlags[], source, confidence, cachedAt, stalenessSeconds, tier }

SLA: P95 ≤ 15 ms aggregate (assumes ≥ 99 % cache hit), P99 ≤ 50 ms. Worst-case under forced live_hlr is bounded by the MAP timeout (1500 ms) or REST timeout (800 ms) per SERVICE_OVERVIEW §10.

Steps:

  1. Input validation. e164 must match ^\+[1-9]\d{6,14}$; normalise to NFKC; reject with INVALID_ARGUMENT on failure.
  2. Compute msisdnHash with the platform pepper.
  3. LRU probe. lru.get(msisdnHash) — in-process lru-cache v10; TTL 60 s; 100 000 entries/pod. On HIT return immediately with tier = LRU, source preserved from the cached record.
  4. Redis probe. GET numint:lookup:{hash} — hash-per-record containing { mno, lineType, country, mnpStatus, vlr, cachedAt, source, confidence }. On HIT populate LRU and return with tier = REDIS.
  5. Postgres probe. SELECT * FROM numint.number_records WHERE msisdn_hash = $1 on the replica pool. On HIT with cachedAt >= now() - ttl_by_class(source):
    • Populate Redis with per-class TTL (LINE_TYPE 30 d, MNO 24 h, VLR 5 min; see SERVICE_OVERVIEW §9).
    • Populate LRU; return tier = PG.
  6. Stale or missing — apply refresh policy:
    • If opts.forceFresh = true OR opts.maxStalenessSeconds is violated by the PG row, proceed to UC-HlrProbe.
    • Else return PG row with confidence = LOW and tier = PG; do not probe live HLR.
  7. MNP overlay. Before returning, consult PortabilityRecord most-recent row for this msisdnHash: if a port is recorded after NumberRecord.cachedAt, overwrite mnoId with recipientMnoId and emit numint.mnp.divergence.v1 for fraud correlation (async).
  8. MNP divergence detection. When step 7 triggers, mark riskFlags += MNP_DIVERGENCE on the response.
  9. Write-through (async). On any live/MNP update, enqueue an outbox write (UC-WriteThrough) so the durable store grows toward steady-state coverage.
  10. Metric side-effects. Increment numint_lookup_total{tier, source, confidence}, record numint_lookup_duration_seconds histogram.
  11. Return to caller.

Error codes:

gRPC statusCondition
INVALID_ARGUMENTMalformed E.164
RESOURCE_EXHAUSTEDPer-pod concurrency cap (default 10 000 in-flight)
DEADLINE_EXCEEDED> 1 s default deadline (rare — usually cache returns sub-ms)
UNAVAILABLEPG and Redis both down AND no live HLR fallback succeeds → { source: "PREFIX_FALLBACK", confidence: UNKNOWN } preferred over error; only if even the prefix table load fails do we return UNAVAILABLE
INTERNALUnhandled exception

Fail-degraded rationale. routing-engine has a prefix-table fallback; sms-firewall-service has allow-by-default for unknown origins; compliance-engine GEO_RESTRICTION treats UNKNOWN country as the most-restrictive class. A typed answer with LOW/UNKNOWN confidence is always more useful than an error.


UC-BulkLookup: ResolveBatch (gRPC server-streaming)

Trigger: sms-orchestrator bulk-submit pipeline or tenant SDK calls ResolveBatch(repeated e164).

Input: ResolveBatchRequest { entries: string[ ≤ 1000 ], opts, traceId }

Output: Stream of MsisdnAttribution in input order, one message per entry.

SLA: P95 ≤ 80 ms for 500-entry batch at ≥ 95 % cache-hit.

Steps:

  1. Validate size (> 1000 → RESOURCE_EXHAUSTED); validate each entry (invalid → emit one error slot for that index, not a whole-batch fail).
  2. Deduplicate. Group by msisdnHash; consult cascade once per unique hash; replicate into each slot.
  3. Parallel cascade. For unique entries: LRU batch-get → Redis MGET → Postgres WHERE msisdn_hash = ANY($1).
  4. Live-HLR fan-out (bounded). If forceFresh or stale beyond limit, enqueue live probes through the per-MNO TPS governor; entries exceeding the 2 s internal deadline return with source = FALLBACK_PREFIX, confidence = LOW, tier = FALLBACK.
  5. Emit in input order.

Error handling: Per-slot errors are returned inline; a partial failure never fails the whole batch unless input validation of the request envelope fails.


UC-HlrProbe: Live HLR lookup via ni-hlr-gateway

Trigger: UC-Lookup or UC-BulkLookup escalates to live, OR admin explicitly invokes ProbeHlr(e164).

Steps:

  1. TPS gate. EVAL Lua against Redis token bucket numint:tps:hlr:{mnoHint} (capacity = MnoSnapshot.tpsLimit, refill = same / sec). On bucket empty, wait up to opts.tpsWaitMs (default 200 ms). On exhaustion, emit HlrProbe { status: THROTTLED } and return the most-recent persisted answer with source = stale_throttled, confidence = MEDIUM.
  2. Transport selection. Read MnoSnapshot.hlrEndpoint.kind:
    • MAP → dispatch through ni-hlr-gateway LiveLookup(e164, mnoHint) gRPC. Gateway builds a MAP SendRoutingInfoForSM per 3GPP TS 29.002; application context shortMsgGatewayContext-v3; timeout 1500 ms.
    • REST → gateway issues POST {endpoint}/v1/hlr/lookup with client-credentials JWT; timeout 800 ms.
  3. Response normalisation. The gateway returns { imsi, vlr, lineType, mnoId }. NI derives country from the E.164 CC (MCC derivation from IMSI first 3 digits is used as a secondary confirmation).
  4. PCAP sampling. At 0.1 % sampling, the gateway captures the full MAP TCAP+SCCP PDU (encrypted with KMS key numint-pcap-kek) to MinIO numint-hlr-pcap/ for post-incident review.
  5. Write-through to PG + Redis + LRU (UC-WriteThrough).
  6. Record HlrProbe row (append-only) with status, durationMs, resultSnapshot.
  7. Emit numint.hlr_probe.completed.v1 (async, for fraud-intel VLR-change correlation).

Error codes:

ConditionBehaviour
DEADLINE_EXCEEDED (MAP timeout)Status TIMEOUT; caller gets last-known answer with confidence = LOW
MAP_ABORT (SS7 MAP abort)Status MAP_ABORT; same as timeout
REST 5xxStatus REST_5XX; retry once then stale fallback
ADAPTER_DOWNGateway DaemonSet pod unreachable → retry against sibling pod; alert NumIntHlrAdapterDown

Mermaid sequence:


UC-WriteThrough: Authoritative attribution UPSERT

Trigger: Any successful live HLR or MNP reconciliation result.

Steps:

  1. Begin PG transaction.
  2. INSERT INTO numint.number_records (...) ON CONFLICT (msisdn_hash) DO UPDATE SET mno_id = EXCLUDED.mno_id, line_type = EXCLUDED.line_type, vlr = EXCLUDED.vlr, imsi_prefix = EXCLUDED.imsi_prefix, last_seen = now(), lookup_count = lookup_count + 1, version = version + 1, cached_at = now() WHERE number_records.version = $expected_version.
  3. If mno_id or mnp_status changed vs prior, insert numint.outbox row for numint.attribution.changed.v1.
  4. SETEX numint:lookup:{hash} with per-class TTL.
  5. Commit; LRU update happens outside the transaction.
  6. The caller-facing response does not wait on write; failures retry through the outbox with exponential backoff (max 6 attempts).

Idempotency. Same (msisdn_hash, mno_id, line_type) as the current row → version bump only; no event emitted.


UC-MnpReconciliationDaily: Per-MNO SFTP intake

Trigger: Kubernetes CronJob mnp-recon at 02:30 Asia/Kabul daily; one job per MNO (fan-out). Runs only in kbl region.

Steps:

  1. Distributed lock. SET numint:lock:mnp_recon:{mnoId} NX EX 1800 — prevents concurrent runs during cron re-runs.
  2. Fetch. Pull sftp://{mno-sftp}/mnp/{yyyy-mm-dd}.csv (CSV: msisdn,donor_mno,recipient_mno,port_date,direction).
  3. Archive raw file. PUT s3://numint-mnp-raw/{mnoId}/{yyyy}/{mm}/{dd}.csv with sha256:{hash} tag.
  4. Validate each row: E.164 regex; valid mno_id; port_date <= today; CSV schema version header.
  5. Conflict detection pre-insert. For each row compute msisdnHash; read the most-recent existing PortabilityRecord for this MSISDN. If a different recipientMnoId with port_date within ±2 days exists, this is a conflict — insert into numint.reconciliation_conflicts and SKIP the port insert (the active NumberRecord.mnpStatus is not updated).
  6. Insert valid, non-conflicting rows into numint.portability_history via INSERT … ON CONFLICT DO NOTHING keyed on (msisdn_hash, port_date, recipient_mno_id, source_feed).
  7. Materialise NumberRecord. For each new port, UPDATE numint.number_records SET mno_id = :recipient, original_mno_id = COALESCE(original_mno_id, :donor), mnp_status = 'PORTED_IN', version = version + 1, cached_at = now().
  8. Chain hash. Update ReconciliationRun.recordHash = sha256(canonical(run-payload) || prevChainHash) — per-MNO chain.
  9. Invalidate cache. DEL numint:lookup:{hash} for every changed MSISDN; emit numint.attribution.changed.v1 per row to warm subscriber caches (routing-engine, sms-firewall-service).
  10. Summary event. Publish numint.reconciliation.completed.v1 { runId, mnoId, totalRecords, accepted, rejected, conflictsCount, durationMs, fileSha256 }.
  11. Failure handling. If SFTP fetch fails, the job retries hourly until 23:00 same day; after that it escalates to P1 via NumIntMnpReconciliationStale alert.

Budget. Each MNO's daily file is typically < 100 k rows; whole run completes P99 ≤ 4 h end-to-end per SERVICE_OVERVIEW §1.


UC-MnpConflictResolve: Admin dispute resolution

Trigger: Platform admin invokes POST /v1/admin/mnp/conflicts/{conflictId}/resolve with { resolution, note }.

Steps:

  1. Load ReconciliationConflict by id; reject if resolution IS NOT NULL (already resolved).
  2. Apply the chosen resolution:
    • A_WINS → insert the candidate A PortabilityRecord; materialise NumberRecord.
    • B_WINS → insert B.
    • KEEP_BOTH_PENDING_VENDOR_CONFIRM → leave state as-is; mark conflict as deferred; create a follow-up ticket.
    • DISCARDED → ignore both (e.g., both turned out to be erroneous reports).
  3. Write AuditLog entry { entityType: 'MNP_CONFLICT', action: 'RESOLVE', before, after }.
  4. Emit numint.mnp.changed.v1 if a record was committed.

UC-EirCheck: LookupEir(imei)

Trigger: sms-firewall-service or fraud-intel-service calls LookupEir(imei).

Steps:

  1. Validate Luhn per 3GPP TS 23.003 §6.2.1; reject malformed with INVALID_ARGUMENT.
  2. GET numint:eir:{imeiHash} (Redis) → HIT returns { status, reasonCode, reportedBy[], lastUpdated }.
  3. On MISS: SELECT … FROM numint.eir_records WHERE imei_hash = $1. On row return, populate Redis with 24 h TTL.
  4. On row absent: return { status: UNKNOWN }never error (known-unknowns are a legitimate response shape).
  5. Emit numint_eir_lookup_total{outcome} metric.

UC-LookupBillingEvent: Per-call billing meter

Trigger: Public Lookup API REST call (GET /v1/lookup/{msisdn} or POST /v1/lookup/batch).

Steps:

  1. After successful response, determine SKU:
    • lookup.v1 — standard (cache or PG path; maxStaleness >= 86400).
    • lookup.fresh.v1 — forced live probe (maxStaleness < 86400 AND a live HLR was actually issued).
  2. Compute msisdnHash with per-tenant salt (tenantSalt loaded from Vault KV secret/ghasi/numint/tenant-salts/{tenantId}).
  3. Insert numint.outbox row with subject billing.metering.recorded.v1 and payload { tenantId, sku, quantity: 1, occurredAt, requestId, msisdnHash }.
  4. Insert matching numint.lookup_audit row (hash-chained).
  5. Response header X-Metering-Status: ok (or degraded if outbox insert failed; the call still returns 200 but the tenant is not billed — SRE alert fires).
  6. Internal gRPC callers (SPIFFE SAN in {routing-engine, sms-firewall-service, compliance-engine, channel-router-service, fraud-intel-service}) are not metered.

UC-CacheWarmCold: Warm-on-deploy

Trigger: Pod startup (readiness gate OFF until warm completes) or cron numint-cache-warm hourly.

Steps:

  1. Query top-N MSISDNs by lookup_count from numint.number_records (default N = 500 000, tuned per pod capacity).
  2. Load into Redis via SETEX in pipelined batches of 1 000.
  3. Emit numint.cache.refreshed.v1 { kind: "warm_on_deploy", keys: N, durationMs }.
  4. Flip readiness to ready once ≥ 80 % of the target is loaded.

UC-TenantQuotaEnforce: Rate limit + monthly cap

Trigger: Every Public Lookup API call.

Steps:

  1. Plan snapshot. Load TenantLookupQuota from Redis cache numint:quota:{tenantId} (TTL 60 s); on MISS read PG.
  2. RPS bucket. Lua-eval Redis token bucket numint:tps:lookup:{tenantId}; capacity = plan RPS (default 10). On empty, 429 + Retry-After.
  3. Monthly counter. INCR numint:quota:lookup:{tenantId}:{yyyymm}; expire at 1st of next month 00:00 Asia/Kabul; if result > monthlyQuota, 429.
  4. Fresh-lookup separate bucket. For maxStaleness < 86400, enforce freshLookupRpsLimit separately (lower cap — defends SS7 quota).
  5. Plan changes. Consume billing.tenant.plan.changed.v1 → invalidate local cache → new caps take effect within 60 s.
  6. Quota audit. Emit audit.lookup.quota_exceeded.v1 when 429 occurs (for tenant churn analysis).

UC-AuditChainVerify: Daily hash-chain integrity check

Trigger: CronJob numint-audit-verifier at 04:30 Asia/Kabul daily (and on demand via admin endpoint).

Steps:

  1. Acquire distributed lock numint:lock:audit_verifier.
  2. For LookupAuditEntry and PortabilityRecord partitions updated in last 24 h: re-compute chain tail-to-head; compare recordHash with stored values.
  3. On mismatch: insert AuditLog { action: AUDIT_INTEGRITY_BROKEN }; page on-call with NumIntAuditChainBroken CRITICAL; freeze writes (manual un-freeze only).
  4. On success: emit numint.audit.chain_verified.v1 (ops marker).

2. Performance Optimisation

2.1 Fast-path ordering (sub-5 ms budget)

Evaluation order in UC-Lookup is chosen so the majority of calls terminate at step 3:

  1. LRU (P50 0.2 ms) — absorbs OTP-storm repeat lookups (same MSISDN hit 3-10× within seconds).
  2. Redis (P50 1.5 ms) — per-region hot cache; 6-node cluster sized for ≥ 95 % hit ratio.
  3. Postgres (P50 6 ms) — replica pool; monthly-partitioned number_records.
  4. Live HLR (P50 250 ms MAP / 80 ms REST) — MNO-facing; TPS-governed.

2.2 Budget enforcement

Each ResolveMsisdn call has an internal 12 ms budget (leaves 3 ms for gRPC serialisation). Per-step budget sub-allocation:

StepBudget
Validation + hash0.1 ms
LRU get0.2 ms
Redis get3 ms
PG select8 ms
MNP overlay2 ms

If budget exhausts mid-step, step 6 default behaviour (return PG with confidence = LOW) applies.

2.3 Redis-Lua atomic TPS gate

-- KEYS[1] = numint:tps:hlr:{mno}
-- ARGV[1] = capacity, ARGV[2] = refill_per_sec, ARGV[3] = now_ms
local bucket = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens = tonumber(bucket[1]) or tonumber(ARGV[1])
local ts = tonumber(bucket[2]) or tonumber(ARGV[3])
local elapsed = math.max(0, tonumber(ARGV[3]) - ts) / 1000.0
tokens = math.min(tonumber(ARGV[1]), tokens + elapsed * tonumber(ARGV[2]))
if tokens < 1 then
redis.call('HSET', KEYS[1], 'tokens', tokens, 'ts', ARGV[3])
return 0
end
tokens = tokens - 1
redis.call('HSET', KEYS[1], 'tokens', tokens, 'ts', ARGV[3])
redis.call('EXPIRE', KEYS[1], 3600)
return 1

3. MNP Conflict Resolution Heuristics

When two MNOs claim the same ported number within ±2 days:

HeuristicWeight
Which MNO's file has a more recent sourceFeed timestamp?30 %
Which candidate has the later portDate?25 %
Does a recent HlrProbe confirm one of the two MNOs as current?30 %
Does fraud-intel-service flag one side as a known conflict-prone MNO for this MSISDN range?15 %

Weighted score > 0.7 → platform can propose an auto-resolution; lower → surfaces for manual review. Auto-resolutions are still reviewable (5-day undo window).

4. SLA Budgets (summary)

Use caseP50P95P99
UC-Lookup cache-hit1 ms5 ms10 ms
UC-Lookup PG fallback6 ms15 ms30 ms
UC-Lookup live HLR forced250 ms600 ms1200 ms
UC-BulkLookup (500 entries, 95 % cache)40 ms80 ms150 ms
UC-EirCheck2 ms8 ms20 ms
UC-MnpReconciliationDaily per MNO30 min90 min4 h
UC-AuditChainVerify 24 h window60 s120 s5 min
Public Lookup GET (REST)8 ms200 ms500 ms
Public Lookup batch POST (100 entries)200 ms800 ms2000 ms