Number Intelligence Service — Failure Modes

Version: 1.0 Status: Draft Owner: Messaging Core / Platform SRE Last Updated: 2026-04-21 Companion: APPLICATION_LOGIC · OBSERVABILITY · SERVICE_RISK_REGISTER

1. Operating principle: fail-degraded

number-intelligence-service operates fail-degraded for all read paths:

A typed answer with confidence: LOW or UNKNOWN is always preferred over an error.

Downstream callers (routing-engine, sms-firewall-service, compliance-engine, channel-router-service, fraud-intel-service) each have their own fallbacks:

routing-engine falls back to its E.164 prefix-table.
sms-firewall-service defaults to allow-by-default for unknown origins.
compliance-engine GEO_RESTRICTION treats UNKNOWN country as the most-restrictive class.
channel-router-service rejects only on lineType = FIXED (deterministic from prefix); UNKNOWN passes the capability gate.

The single exception is LookupEir for BLACKLIST checking — sms-firewall-service may configure FAIL_CLOSED_ON_UNKNOWN for high-risk tenants (default OFF).

2. Failure-mode summary

#	Failure	Likelihood	Impact	Rating	Mitigation summary
FM-01	Postgres unavailable (primary + standbys)	Low	High	HIGH	Stale-cache fallback; promote standby; degraded confidence
FM-02	Redis unavailable	Medium	Medium	MEDIUM	PG direct; latency degrades; no functional break
FM-03	Both Redis and Postgres unavailable	Very Low	High	HIGH	Prefix-table fallback; `confidence: UNKNOWN`
FM-04	MNO MNP SFTP unreachable	Medium	Medium	MEDIUM	Use last-known MNP state; alert at 26 h stale; escalate at 48 h
FM-05	HLR probe (SS7) unavailable	Medium	Medium	MEDIUM	Fall back to MNO MNP state + last persisted attribution
FM-06	MNP reconciliation conflict spike	Low	Medium	MEDIUM	Triage queue; AI-assisted ranking; admin manual resolution
FM-07	Audit hash-chain break detected	Very Low	Critical	HIGH	Freeze writes; investigate; never auto-resolve
FM-08	Per-MNO TPS quota exhausted	Medium	Medium	MEDIUM	Token-bucket throttle returns stale answer; alert at 5 % denial rate
FM-09	Tenant enumeration / quota abuse on Public Lookup	Medium	Medium	MEDIUM	RPS quota + monthly cap; Kong bot-detection; per-tenant audit
FM-10	MSISDN normalisation edge case (Unicode confusable)	Low	Medium	MEDIUM	NFKC + strict regex at VO; property tests in CI
FM-11	Cross-region replication conflict	Low	Medium	MEDIUM	Source-priority + version conflict policy; cross-region verifier
FM-12	EIR source unreachable	Low	Medium	MEDIUM	Last-known EIR; `LookupEir` returns UNKNOWN for new IMEIs
FM-13	Cache poisoning during MNP transition	Low	Medium	MEDIUM	DEL-on-MNP-change + outbox-driven event; subscriber warmers
FM-14	NetworkPolicy mis-config exposes egress to offshore	Very Low	Critical	HIGH	Deploy-time residency test; Istio AuthorizationPolicy
FM-15	Outbox publish stuck	Low	Medium	MEDIUM	Retry; manual replay; alert
FM-16	`ni-hlr-gateway` adapter pod down	Low	Medium	MEDIUM	Sibling pod takes over; synthetic probe recovery < 30 s

3. Detailed failure modes

FM-01 — Postgres unavailable (primary + standbys)

Scenario. All Patroni primaries and standbys for the numint schema unreachable in the active region.

Detection:

numint_pg_query_duration_seconds errors; /health/ready 503.
Patroni events; PostgresPrimaryDown alert.

Impact:

Hot path: serves from Redis only; cache misses fall through to in-memory prefix-table; confidence drops to LOW/UNKNOWN.
Writes (live HLR write-through, MNP recon) fail with 503.
Audit chain pauses.

Recovery:

Patroni auto-promotes a standby; RTO ≤ 90 s.
If Kabul region is fully isolated, manual cutover routes RW traffic to Mazar (Mazar already serves hot reads — no visible outage on the read path).

Mitigation:

3-node Patroni per region; PgBouncer in transaction mode shields per-pod connection churn.
Circuit breaker on PG client opens after 5 consecutive errors / 30 s; half-open after 60 s.
Mazar warm-active for hot reads.

Runbook: numint-pg-down.md

FM-02 — Redis unavailable

Scenario. Redis cluster for the active region unreachable.

Detection:

numint_redis_op_duration_seconds errors; cache hit ratio drops to 0 %.
numint_cache_misses_total near-100 %.

Impact:

Cascade falls through to Postgres on every cache miss; latency from 5 ms → 15 ms P95.
Distributed locks (audit verifier, MNP recon, cache warm) cannot be acquired → workers skip cycle and emit numint_worker_skipped_no_lock_total.
Per-tenant rate-limit buckets cannot be enforced → temporary tenant unrestricted access (mitigated by Kong's own rate-limit-advanced fallback).

Recovery:

Redis cluster recovers; cache warms naturally on TTL-driven re-fill; manual numint-cache-warmer run accelerates.
HPA may scale up replicas under elevated PG load.

Mitigation:

6-node Sentinel cluster per region; multi-AZ.
Postgres replica pool sized for 2 000 RPS sustained without cache (load-tested with 80 % headroom).

Runbook: numint-redis-down.md

FM-03 — Both Redis and Postgres unavailable

Scenario. Catastrophic dual-dependency outage in the active region.

Detection:

Both FM-01 and FM-02 alerts firing.
numint_lookup_total{tier="fallback"} rises sharply.

Impact:

ResolveMsisdn falls back to in-memory prefix table; source = PREFIX_FALLBACK, confidence = UNKNOWN.
routing-engine uses its own prefix table → routing continues with reduced MNP awareness.
Write paths fail entirely; outbox accumulates in process memory? No — outbox requires PG write; live HLR results are dropped on the floor for the duration (acceptable: live HLR is opportunistic).
MNP & EIR reconciliation paused.

Recovery:

Restore PG; restore Redis; re-warm.

Mitigation:

Cross-region active-active means a regional outage simply reduces capacity; the other region serves.
Prefix table loaded at startup from Vault-pinned ATRA CSV; resident in process memory.

FM-04 — MNO MNP SFTP unreachable

Scenario. A specific MNO's SFTP endpoint refuses connection or returns no file.

Detection:

numint_mnp_recon_runs_total{mno_id, outcome="failed"} increments.
numint_mnp_recon_last_success_timestamp_seconds{mno_id} stops advancing.
NumIntMnpReconciliationStale HIGH at 26 h, CRITICAL at 48 h.

Impact:

New ports for that MNO not reflected in attribution; routing-engine may continue to dispatch to the donor MNO until next successful run.
Live HLR probes still work (independently); they may catch some new ports.

Recovery:

Retry hourly until 23:00 same day.
After 23:00 escalate to MNO operations contact (per MNO MoU).
Manual SFTP re-fetch via admin endpoint POST /v1/admin/numint/mnp/runs once the MNO restores service.

Mitigation:

MNO SLA agreements include MNP file delivery cadence and escalation path.
Backup MNP delivery via secure email (PGP-signed) for emergencies.

Tenant impact: Misrouting risk for ported numbers — message may be sent to donor MNO and rejected; orchestrator retries automatically. End-user may see a brief delivery delay (seconds) on ported numbers.

Runbook: numint-mnp-stale.md

FM-05 — HLR probe (SS7) unavailable

Scenario. ni-hlr-gateway adapter for an MNO is down (SIGTRAN association failed; REST endpoint 5xx).

Detection:

numint_hlr_adapter_health{mno_id} == 0.
numint_hlr_probes_total{mno_id, status!="OK"} rises.

Impact:

Forced-fresh requests cannot reach live HLR; receive stale persisted answer with confidence = LOW.
fraud-intel-service loses real-time VLR-change signal.
Tenant Public Lookup maxStaleness < 86400 returns stale; SKU adjusted.

Recovery:

Sibling DaemonSet pod takes over (DNS round-robin via headless service).
For SS7, M3UA association re-establishment takes 30-60 s.
Manual restart of the affected pod via runbook.

Mitigation:

DaemonSet ensures multiple pods; SS7 stack uses M3UA failover ASP groups.
Per-MNO REST adapter has multiple endpoints configured (primary + secondary).

FM-06 — MNP reconciliation conflict spike

Scenario. Daily MNP runs produce > 50 conflicts per MNO (baseline ≤ 10) — likely an MNO file format change or systemic regulatory event.

Detection:

numint_mnp_recon_conflicts_total{mno_id, severity="HIGH"} spike.
NumIntReconciliationConflictSpike HIGH alert.

Impact:

MSISDNs in conflict do not transition until manually resolved → routing may be wrong for those numbers.
Trust & safety analyst backlog.

Recovery:

AI-assisted triage ranks conflicts (see AI_INTEGRATION §2).
Admin reviews and resolves via POST /v1/admin/numint/mnp/conflicts/{conflictId}/resolve.
Coordinate with MNO if file format / process changed.

Mitigation:

Defensible audit trail per SECURITY_MODEL §4.
Conflict aging dashboard panels in numint-mnp-eir.json.

FM-07 — Audit hash-chain break detected

Scenario. Daily verifier finds record_hash mismatch in lookup_audit or portability_history.

Detection:

numint_audit_chain_breaks_detected_total > 0.
NumIntAuditChainBroken CRITICAL pages on-call.

Impact:

Loss of regulator-defensibility for the affected partition.
Possible undetected tampering — must be treated as a security incident.

Recovery:

Freeze writes (manual flag numint.write_freeze=true via ConfigMap).
Identify break boundary; preserve database state.
Forensic review by Security; possible Postgres-replica byte-comparison.
Restore from cold archive S3 if needed; re-link chain manually with documented gap.

Mitigation:

Append-only DB rules.
Two independent chain implementations (producer in TS, verifier in Python) cross-check; a divergence between them prevents shipping a bug.

Runbook: numint-audit-chain-broken.md

FM-08 — Per-MNO TPS quota exhausted

Scenario. Tenant traffic + fraud-intel forced-fresh together exceed an MNO's contracted SS7 TPS.

Detection:

numint_hlr_tps_denied_total{mno_id} rises.
Per-MNO denial rate > 5 % over 5 min → NumIntHlrThrottling MEDIUM.

Impact:

Live HLR results return as STALE_THROTTLED; callers receive last-known persisted answer with confidence = MEDIUM.
Some divergence from real-world state for low-volume MNOs.

Recovery:

Tune freshLookupRpsLimit lower for the offending tenants.
Negotiate higher TPS with MNO if sustained business need.

Mitigation:

Token bucket per (mno, op) is mandatory.
Tenant SDK strongly discourages forceFresh on every call (documented).

FM-09 — Tenant enumeration / quota abuse on Public Lookup

Scenario. Malicious tenant or stolen credentials send 10× plan RPS attempting MSISDN enumeration.

Detection:

numint_public_lookup_quota_breach_total{tenant_id} sustained > 10/15 min → NumIntPublicLookupQuotaAbuse MEDIUM.
Kong bot-detection plugin flags JA3 fingerprint.

Impact:

Other tenants unaffected (per-tenant buckets isolate).
Audit log fills with the attacker's hash sequence (tenant-salted; not cross-tenant correlatable).

Recovery:

Tenant-side: 429 with Retry-After returned for the duration of breach.
Platform-side: alert prompts manual review; suspend tenant if abuse confirmed.
Rotate tenant API key + JWT issuer revocation if credentials suspected stolen.

Mitigation:

Per-tenant RPS + monthly + fresh-lookup buckets.
Anti-enumeration: response time uniform whether MSISDN is known or unknown.
Tenant-salted audit hash limits intelligence value of any leak.

FM-10 — MSISDN normalisation edge case

Scenario. Tenant submits an MSISDN with Unicode confusable characters (RTL marks, Arabic-Indic digits, look-alikes) that the regex initially accepts.

Detection:

Property-based test in CI catches first.
Production: would surface as cache key explosion (semantically same MSISDN keyed differently).

Impact:

Correctness regression — same MSISDN looked up twice produces different results / billing.

Recovery:

Regex/NFKC normalisation fix; redeploy.
Re-bill affected tenant calls (rare).

Mitigation:

NFKC normalisation BEFORE regex match.
Strict ASCII digit class ([0-9], not \d which matches Unicode digits).
Property-based test asserts parse(canonicalise(s)) == parse(s) for any conformant input.

FM-11 — Cross-region replication conflict

Scenario. Concurrent writes in Kabul and Mazar to the same NumberRecord (rare — typically only on admin override or simultaneous MNP recon).

Detection:

numint_pg_replication_conflict_total increments.
Cross-region audit verifier reports divergent recordHash.

Impact:

One write may be lost (LWW on version); audit chain may fork.

Recovery:

Freeze writes; reconcile manually using source-priority.
Re-derive winning row; replay outbox.

Mitigation:

Per-aggregate conflict policy (see SYNC_CONTRACT §4).
Batch jobs run in Kabul only with leader-election lock.
Admin overrides require dual-control + are rare.

FM-12 — EIR source unreachable

Scenario. ATRA SFTP or per-MNO CEIR endpoint unreachable for a daily run.

Detection:

numint_eir_recon_runs_total{outcome="failed"} increments.
NumIntEirSyncStale MEDIUM at 26 h.

Impact:

New stolen-IMEI flags not reflected; LookupEir returns last-known.
sms-firewall-service may not block traffic to newly-flagged devices.

Recovery:

Retry hourly until 23:00.
Manual resync via POST /v1/admin/numint/eir/runs.

Mitigation:

Multi-source aggregation (ATRA + per-MNO) means single-source failure does not lose all signal.

FM-13 — Cache poisoning during MNP transition

Scenario. A live HLR probe lands a stale answer in cache milliseconds before MNP recon writes the new state, leaving cache poisoned for the cache TTL.

Detection:

numint.mnp.divergence.v1 events spike for newly-ported MSISDNs in the hours after recon.

Impact:

For up to 24 h (MNP TTL) some lookups return wrong MNO.

Recovery:

MNP recon writes both invalidate Redis (DEL) and emit numint.attribution.changed.v1 to warm subscribers.
If poisoning is detected, manual numint-cache-flush cron forces full warm.

Mitigation:

Atomic UPSERT in PG with version conditional prevents the live-HLR write from clobbering a fresher MNP row.
MNP overlay step in UC-Lookup checks PortabilityRecord post-cache and overrides.

FM-14 — NetworkPolicy mis-config exposes egress to offshore

Scenario. A NetworkPolicy or AuthorizationPolicy change accidentally permits egress to a non-Afghan IP (cloud LLM, third-party telemetry).

Detection:

Deploy-time residency test fails.
Runtime: numint_egress_offshore_total (synthetic check) > 0.

Impact:

Potential PII leak to non-Afghan jurisdiction → regulatory violation.

Recovery:

Rollback the NetworkPolicy; re-deploy.
Audit which calls (if any) touched the offshore endpoint; report to DPO if PII was in transit.

Mitigation:

Deploy-time residency test is mandatory CI gate.
Istio AuthorizationPolicy as second layer.
Egress NetworkPolicy explicitly denies non-10.0.0.0/8 egress on the hot-path Deployment.

FM-15 — Outbox publish stuck

Scenario. NATS unreachable or specific subject misconfigured; outbox rows accumulate.

Detection:

numint_outbox_oldest_unpublished_age_seconds > 60 → NumIntOutboxStuck HIGH.

Impact:

Subscriber services (routing-engine, sms-firewall, billing) miss state-change events; their caches drift.

Recovery:

Restore NATS connectivity; outbox relay drains.
Manual replay via POST /v1/admin/numint/outbox/replay.

Mitigation:

Outbox relay is per-replica with SELECT … FOR UPDATE SKIP LOCKED so multiple workers pick up backlog.
Dead-letter subjects retain failed messages for SRE inspection.

FM-16 — `ni-hlr-gateway` adapter pod down

Same as FM-05; documented separately because the gateway is the single network hop that owns SIGTRAN sockets — a pod restart drops live MAP dialogs (orphaned invokes return TIMEOUT to the caller; replay logic in NI handles this).

4. Failure-mode interaction matrix

Concurrent failure	Combined impact
FM-01 + FM-02	FM-03 (prefix-table fallback)
FM-04 + FM-05 (MNP file + live HLR for same MNO)	Attribution stale for that MNO; routing-engine prefix-table fallback handles
FM-07 + FM-11	Catastrophic — freeze all writes; declare incident; engage Security + Platform Arch
FM-09 + FM-08	Tenant abuse drives SS7 quota exhaustion → throttle both at the gateway and at the per-tenant fresh-lookup bucket

1. Operating principle: fail-degraded​

2. Failure-mode summary​

3. Detailed failure modes​

FM-01 — Postgres unavailable (primary + standbys)​

FM-02 — Redis unavailable​

FM-03 — Both Redis and Postgres unavailable​

FM-04 — MNO MNP SFTP unreachable​

FM-05 — HLR probe (SS7) unavailable​

FM-06 — MNP reconciliation conflict spike​

FM-07 — Audit hash-chain break detected​

FM-08 — Per-MNO TPS quota exhausted​

FM-09 — Tenant enumeration / quota abuse on Public Lookup​

FM-10 — MSISDN normalisation edge case​

FM-11 — Cross-region replication conflict​

FM-12 — EIR source unreachable​

FM-13 — Cache poisoning during MNP transition​

FM-14 — NetworkPolicy mis-config exposes egress to offshore​

FM-15 — Outbox publish stuck​

FM-16 — ni-hlr-gateway adapter pod down​

4. Failure-mode interaction matrix​

1. Operating principle: fail-degraded

2. Failure-mode summary

3. Detailed failure modes

FM-01 — Postgres unavailable (primary + standbys)

FM-02 — Redis unavailable

FM-03 — Both Redis and Postgres unavailable

FM-04 — MNO MNP SFTP unreachable

FM-05 — HLR probe (SS7) unavailable

FM-06 — MNP reconciliation conflict spike

FM-07 — Audit hash-chain break detected

FM-08 — Per-MNO TPS quota exhausted

FM-09 — Tenant enumeration / quota abuse on Public Lookup

FM-10 — MSISDN normalisation edge case

FM-11 — Cross-region replication conflict

FM-12 — EIR source unreachable

FM-13 — Cache poisoning during MNP transition

FM-14 — NetworkPolicy mis-config exposes egress to offshore

FM-15 — Outbox publish stuck

FM-16 — `ni-hlr-gateway` adapter pod down

4. Failure-mode interaction matrix