Skip to main content

Number Intelligence Service — Failure Modes

Version: 1.0 Status: Draft Owner: Messaging Core / Platform SRE Last Updated: 2026-04-21 Companion: APPLICATION_LOGIC · OBSERVABILITY · SERVICE_RISK_REGISTER

1. Operating principle: fail-degraded

number-intelligence-service operates fail-degraded for all read paths:

A typed answer with confidence: LOW or UNKNOWN is always preferred over an error.

Downstream callers (routing-engine, sms-firewall-service, compliance-engine, channel-router-service, fraud-intel-service) each have their own fallbacks:

  • routing-engine falls back to its E.164 prefix-table.
  • sms-firewall-service defaults to allow-by-default for unknown origins.
  • compliance-engine GEO_RESTRICTION treats UNKNOWN country as the most-restrictive class.
  • channel-router-service rejects only on lineType = FIXED (deterministic from prefix); UNKNOWN passes the capability gate.

The single exception is LookupEir for BLACKLIST checking — sms-firewall-service may configure FAIL_CLOSED_ON_UNKNOWN for high-risk tenants (default OFF).

2. Failure-mode summary

#FailureLikelihoodImpactRatingMitigation summary
FM-01Postgres unavailable (primary + standbys)LowHighHIGHStale-cache fallback; promote standby; degraded confidence
FM-02Redis unavailableMediumMediumMEDIUMPG direct; latency degrades; no functional break
FM-03Both Redis and Postgres unavailableVery LowHighHIGHPrefix-table fallback; confidence: UNKNOWN
FM-04MNO MNP SFTP unreachableMediumMediumMEDIUMUse last-known MNP state; alert at 26 h stale; escalate at 48 h
FM-05HLR probe (SS7) unavailableMediumMediumMEDIUMFall back to MNO MNP state + last persisted attribution
FM-06MNP reconciliation conflict spikeLowMediumMEDIUMTriage queue; AI-assisted ranking; admin manual resolution
FM-07Audit hash-chain break detectedVery LowCriticalHIGHFreeze writes; investigate; never auto-resolve
FM-08Per-MNO TPS quota exhaustedMediumMediumMEDIUMToken-bucket throttle returns stale answer; alert at 5 % denial rate
FM-09Tenant enumeration / quota abuse on Public LookupMediumMediumMEDIUMRPS quota + monthly cap; Kong bot-detection; per-tenant audit
FM-10MSISDN normalisation edge case (Unicode confusable)LowMediumMEDIUMNFKC + strict regex at VO; property tests in CI
FM-11Cross-region replication conflictLowMediumMEDIUMSource-priority + version conflict policy; cross-region verifier
FM-12EIR source unreachableLowMediumMEDIUMLast-known EIR; LookupEir returns UNKNOWN for new IMEIs
FM-13Cache poisoning during MNP transitionLowMediumMEDIUMDEL-on-MNP-change + outbox-driven event; subscriber warmers
FM-14NetworkPolicy mis-config exposes egress to offshoreVery LowCriticalHIGHDeploy-time residency test; Istio AuthorizationPolicy
FM-15Outbox publish stuckLowMediumMEDIUMRetry; manual replay; alert
FM-16ni-hlr-gateway adapter pod downLowMediumMEDIUMSibling pod takes over; synthetic probe recovery < 30 s

3. Detailed failure modes

FM-01 — Postgres unavailable (primary + standbys)

Scenario. All Patroni primaries and standbys for the numint schema unreachable in the active region.

Detection:

  • numint_pg_query_duration_seconds errors; /health/ready 503.
  • Patroni events; PostgresPrimaryDown alert.

Impact:

  • Hot path: serves from Redis only; cache misses fall through to in-memory prefix-table; confidence drops to LOW/UNKNOWN.
  • Writes (live HLR write-through, MNP recon) fail with 503.
  • Audit chain pauses.

Recovery:

  • Patroni auto-promotes a standby; RTO ≤ 90 s.
  • If Kabul region is fully isolated, manual cutover routes RW traffic to Mazar (Mazar already serves hot reads — no visible outage on the read path).

Mitigation:

  • 3-node Patroni per region; PgBouncer in transaction mode shields per-pod connection churn.
  • Circuit breaker on PG client opens after 5 consecutive errors / 30 s; half-open after 60 s.
  • Mazar warm-active for hot reads.

Runbook: numint-pg-down.md


FM-02 — Redis unavailable

Scenario. Redis cluster for the active region unreachable.

Detection:

  • numint_redis_op_duration_seconds errors; cache hit ratio drops to 0 %.
  • numint_cache_misses_total near-100 %.

Impact:

  • Cascade falls through to Postgres on every cache miss; latency from 5 ms → 15 ms P95.
  • Distributed locks (audit verifier, MNP recon, cache warm) cannot be acquired → workers skip cycle and emit numint_worker_skipped_no_lock_total.
  • Per-tenant rate-limit buckets cannot be enforced → temporary tenant unrestricted access (mitigated by Kong's own rate-limit-advanced fallback).

Recovery:

  • Redis cluster recovers; cache warms naturally on TTL-driven re-fill; manual numint-cache-warmer run accelerates.
  • HPA may scale up replicas under elevated PG load.

Mitigation:

  • 6-node Sentinel cluster per region; multi-AZ.
  • Postgres replica pool sized for 2 000 RPS sustained without cache (load-tested with 80 % headroom).

Runbook: numint-redis-down.md


FM-03 — Both Redis and Postgres unavailable

Scenario. Catastrophic dual-dependency outage in the active region.

Detection:

  • Both FM-01 and FM-02 alerts firing.
  • numint_lookup_total{tier="fallback"} rises sharply.

Impact:

  • ResolveMsisdn falls back to in-memory prefix table; source = PREFIX_FALLBACK, confidence = UNKNOWN.
  • routing-engine uses its own prefix table → routing continues with reduced MNP awareness.
  • Write paths fail entirely; outbox accumulates in process memory? No — outbox requires PG write; live HLR results are dropped on the floor for the duration (acceptable: live HLR is opportunistic).
  • MNP & EIR reconciliation paused.

Recovery:

  • Restore PG; restore Redis; re-warm.

Mitigation:

  • Cross-region active-active means a regional outage simply reduces capacity; the other region serves.
  • Prefix table loaded at startup from Vault-pinned ATRA CSV; resident in process memory.

FM-04 — MNO MNP SFTP unreachable

Scenario. A specific MNO's SFTP endpoint refuses connection or returns no file.

Detection:

  • numint_mnp_recon_runs_total{mno_id, outcome="failed"} increments.
  • numint_mnp_recon_last_success_timestamp_seconds{mno_id} stops advancing.
  • NumIntMnpReconciliationStale HIGH at 26 h, CRITICAL at 48 h.

Impact:

  • New ports for that MNO not reflected in attribution; routing-engine may continue to dispatch to the donor MNO until next successful run.
  • Live HLR probes still work (independently); they may catch some new ports.

Recovery:

  • Retry hourly until 23:00 same day.
  • After 23:00 escalate to MNO operations contact (per MNO MoU).
  • Manual SFTP re-fetch via admin endpoint POST /v1/admin/numint/mnp/runs once the MNO restores service.

Mitigation:

  • MNO SLA agreements include MNP file delivery cadence and escalation path.
  • Backup MNP delivery via secure email (PGP-signed) for emergencies.

Tenant impact: Misrouting risk for ported numbers — message may be sent to donor MNO and rejected; orchestrator retries automatically. End-user may see a brief delivery delay (seconds) on ported numbers.

Runbook: numint-mnp-stale.md


FM-05 — HLR probe (SS7) unavailable

Scenario. ni-hlr-gateway adapter for an MNO is down (SIGTRAN association failed; REST endpoint 5xx).

Detection:

  • numint_hlr_adapter_health{mno_id} == 0.
  • numint_hlr_probes_total{mno_id, status!="OK"} rises.

Impact:

  • Forced-fresh requests cannot reach live HLR; receive stale persisted answer with confidence = LOW.
  • fraud-intel-service loses real-time VLR-change signal.
  • Tenant Public Lookup maxStaleness < 86400 returns stale; SKU adjusted.

Recovery:

  • Sibling DaemonSet pod takes over (DNS round-robin via headless service).
  • For SS7, M3UA association re-establishment takes 30-60 s.
  • Manual restart of the affected pod via runbook.

Mitigation:

  • DaemonSet ensures multiple pods; SS7 stack uses M3UA failover ASP groups.
  • Per-MNO REST adapter has multiple endpoints configured (primary + secondary).

FM-06 — MNP reconciliation conflict spike

Scenario. Daily MNP runs produce > 50 conflicts per MNO (baseline ≤ 10) — likely an MNO file format change or systemic regulatory event.

Detection:

  • numint_mnp_recon_conflicts_total{mno_id, severity="HIGH"} spike.
  • NumIntReconciliationConflictSpike HIGH alert.

Impact:

  • MSISDNs in conflict do not transition until manually resolved → routing may be wrong for those numbers.
  • Trust & safety analyst backlog.

Recovery:

  • AI-assisted triage ranks conflicts (see AI_INTEGRATION §2).
  • Admin reviews and resolves via POST /v1/admin/numint/mnp/conflicts/{conflictId}/resolve.
  • Coordinate with MNO if file format / process changed.

Mitigation:

  • Defensible audit trail per SECURITY_MODEL §4.
  • Conflict aging dashboard panels in numint-mnp-eir.json.

FM-07 — Audit hash-chain break detected

Scenario. Daily verifier finds record_hash mismatch in lookup_audit or portability_history.

Detection:

  • numint_audit_chain_breaks_detected_total > 0.
  • NumIntAuditChainBroken CRITICAL pages on-call.

Impact:

  • Loss of regulator-defensibility for the affected partition.
  • Possible undetected tampering — must be treated as a security incident.

Recovery:

  • Freeze writes (manual flag numint.write_freeze=true via ConfigMap).
  • Identify break boundary; preserve database state.
  • Forensic review by Security; possible Postgres-replica byte-comparison.
  • Restore from cold archive S3 if needed; re-link chain manually with documented gap.

Mitigation:

  • Append-only DB rules.
  • Two independent chain implementations (producer in TS, verifier in Python) cross-check; a divergence between them prevents shipping a bug.

Runbook: numint-audit-chain-broken.md


FM-08 — Per-MNO TPS quota exhausted

Scenario. Tenant traffic + fraud-intel forced-fresh together exceed an MNO's contracted SS7 TPS.

Detection:

  • numint_hlr_tps_denied_total{mno_id} rises.
  • Per-MNO denial rate > 5 % over 5 min → NumIntHlrThrottling MEDIUM.

Impact:

  • Live HLR results return as STALE_THROTTLED; callers receive last-known persisted answer with confidence = MEDIUM.
  • Some divergence from real-world state for low-volume MNOs.

Recovery:

  • Tune freshLookupRpsLimit lower for the offending tenants.
  • Negotiate higher TPS with MNO if sustained business need.

Mitigation:

  • Token bucket per (mno, op) is mandatory.
  • Tenant SDK strongly discourages forceFresh on every call (documented).

FM-09 — Tenant enumeration / quota abuse on Public Lookup

Scenario. Malicious tenant or stolen credentials send 10× plan RPS attempting MSISDN enumeration.

Detection:

  • numint_public_lookup_quota_breach_total{tenant_id} sustained > 10/15 min → NumIntPublicLookupQuotaAbuse MEDIUM.
  • Kong bot-detection plugin flags JA3 fingerprint.

Impact:

  • Other tenants unaffected (per-tenant buckets isolate).
  • Audit log fills with the attacker's hash sequence (tenant-salted; not cross-tenant correlatable).

Recovery:

  • Tenant-side: 429 with Retry-After returned for the duration of breach.
  • Platform-side: alert prompts manual review; suspend tenant if abuse confirmed.
  • Rotate tenant API key + JWT issuer revocation if credentials suspected stolen.

Mitigation:

  • Per-tenant RPS + monthly + fresh-lookup buckets.
  • Anti-enumeration: response time uniform whether MSISDN is known or unknown.
  • Tenant-salted audit hash limits intelligence value of any leak.

FM-10 — MSISDN normalisation edge case

Scenario. Tenant submits an MSISDN with Unicode confusable characters (RTL marks, Arabic-Indic digits, look-alikes) that the regex initially accepts.

Detection:

  • Property-based test in CI catches first.
  • Production: would surface as cache key explosion (semantically same MSISDN keyed differently).

Impact:

  • Correctness regression — same MSISDN looked up twice produces different results / billing.

Recovery:

  • Regex/NFKC normalisation fix; redeploy.
  • Re-bill affected tenant calls (rare).

Mitigation:

  • NFKC normalisation BEFORE regex match.
  • Strict ASCII digit class ([0-9], not \d which matches Unicode digits).
  • Property-based test asserts parse(canonicalise(s)) == parse(s) for any conformant input.

FM-11 — Cross-region replication conflict

Scenario. Concurrent writes in Kabul and Mazar to the same NumberRecord (rare — typically only on admin override or simultaneous MNP recon).

Detection:

  • numint_pg_replication_conflict_total increments.
  • Cross-region audit verifier reports divergent recordHash.

Impact:

  • One write may be lost (LWW on version); audit chain may fork.

Recovery:

  • Freeze writes; reconcile manually using source-priority.
  • Re-derive winning row; replay outbox.

Mitigation:

  • Per-aggregate conflict policy (see SYNC_CONTRACT §4).
  • Batch jobs run in Kabul only with leader-election lock.
  • Admin overrides require dual-control + are rare.

FM-12 — EIR source unreachable

Scenario. ATRA SFTP or per-MNO CEIR endpoint unreachable for a daily run.

Detection:

  • numint_eir_recon_runs_total{outcome="failed"} increments.
  • NumIntEirSyncStale MEDIUM at 26 h.

Impact:

  • New stolen-IMEI flags not reflected; LookupEir returns last-known.
  • sms-firewall-service may not block traffic to newly-flagged devices.

Recovery:

  • Retry hourly until 23:00.
  • Manual resync via POST /v1/admin/numint/eir/runs.

Mitigation:

  • Multi-source aggregation (ATRA + per-MNO) means single-source failure does not lose all signal.

FM-13 — Cache poisoning during MNP transition

Scenario. A live HLR probe lands a stale answer in cache milliseconds before MNP recon writes the new state, leaving cache poisoned for the cache TTL.

Detection:

  • numint.mnp.divergence.v1 events spike for newly-ported MSISDNs in the hours after recon.

Impact:

  • For up to 24 h (MNP TTL) some lookups return wrong MNO.

Recovery:

  • MNP recon writes both invalidate Redis (DEL) and emit numint.attribution.changed.v1 to warm subscribers.
  • If poisoning is detected, manual numint-cache-flush cron forces full warm.

Mitigation:

  • Atomic UPSERT in PG with version conditional prevents the live-HLR write from clobbering a fresher MNP row.
  • MNP overlay step in UC-Lookup checks PortabilityRecord post-cache and overrides.

FM-14 — NetworkPolicy mis-config exposes egress to offshore

Scenario. A NetworkPolicy or AuthorizationPolicy change accidentally permits egress to a non-Afghan IP (cloud LLM, third-party telemetry).

Detection:

  • Deploy-time residency test fails.
  • Runtime: numint_egress_offshore_total (synthetic check) > 0.

Impact:

  • Potential PII leak to non-Afghan jurisdiction → regulatory violation.

Recovery:

  • Rollback the NetworkPolicy; re-deploy.
  • Audit which calls (if any) touched the offshore endpoint; report to DPO if PII was in transit.

Mitigation:

  • Deploy-time residency test is mandatory CI gate.
  • Istio AuthorizationPolicy as second layer.
  • Egress NetworkPolicy explicitly denies non-10.0.0.0/8 egress on the hot-path Deployment.

FM-15 — Outbox publish stuck

Scenario. NATS unreachable or specific subject misconfigured; outbox rows accumulate.

Detection:

  • numint_outbox_oldest_unpublished_age_seconds > 60NumIntOutboxStuck HIGH.

Impact:

  • Subscriber services (routing-engine, sms-firewall, billing) miss state-change events; their caches drift.

Recovery:

  • Restore NATS connectivity; outbox relay drains.
  • Manual replay via POST /v1/admin/numint/outbox/replay.

Mitigation:

  • Outbox relay is per-replica with SELECT … FOR UPDATE SKIP LOCKED so multiple workers pick up backlog.
  • Dead-letter subjects retain failed messages for SRE inspection.

FM-16 — ni-hlr-gateway adapter pod down

Same as FM-05; documented separately because the gateway is the single network hop that owns SIGTRAN sockets — a pod restart drops live MAP dialogs (orphaned invokes return TIMEOUT to the caller; replay logic in NI handles this).


4. Failure-mode interaction matrix

Concurrent failureCombined impact
FM-01 + FM-02FM-03 (prefix-table fallback)
FM-04 + FM-05 (MNP file + live HLR for same MNO)Attribution stale for that MNO; routing-engine prefix-table fallback handles
FM-07 + FM-11Catastrophic — freeze all writes; declare incident; engage Security + Platform Arch
FM-09 + FM-08Tenant abuse drives SS7 quota exhaustion → throttle both at the gateway and at the per-tenant fresh-lookup bucket