Number Intelligence Service — Observability
Version: 1.0 Status: Draft Owner: Messaging Core / Platform SRE Last Updated: 2026-04-21 Companion: FAILURE_MODES · SERVICE_READINESS · APPLICATION_LOGIC
1. SLIs / SLOs
Bound to the platform NFR catalog (EP-PLAT-NB-09) and SERVICE_OVERVIEW §10.
| SLI | SLO | Window | Burn-rate alert |
|---|---|---|---|
ResolveMsisdn cache-hit P95 latency | ≤ 5 ms | 28 d | 14 × 1 h, 6 × 6 h |
ResolveMsisdn end-to-end P99 latency (any path, excluding forced fresh) | ≤ 50 ms | 28 d | 14 × 1 h, 6 × 6 h |
ResolveMsisdn availability (non-error, any-confidence) | ≥ 99.95 % | 28 d | 14 × 1 h, 6 × 6 h |
| Aggregate cache hit ratio (LRU+Redis / total) | ≥ 95 % sustained | 7 d | < 85 % for 30 min → MEDIUM |
| LRU hit ratio | ≥ 60 % sustained | 7 d | < 40 % for 30 min → LOW (tune) |
LookupPorting P95 latency | ≤ 8 ms | 28 d | 14 × 1 h |
LookupEir P95 latency | ≤ 8 ms | 28 d | 14 × 1 h |
| Public Lookup REST P95 latency | ≤ 200 ms | 28 d | 14 × 1 h |
| Public Lookup REST P99 latency | ≤ 500 ms | 28 d | 14 × 1 h |
| MNP reconciliation freshness (per MNO) | last successful run < 26 h | continuous | 26 h stale → HIGH; 48 h → CRITICAL |
| EIR reconciliation freshness | last successful run < 26 h | continuous | 26 h stale → HIGH |
| HLR probe success rate (per MNO) | ≥ 98 % | 7 d | < 95 % for 15 min → MEDIUM |
| MNP reconciliation conflict count per run | ≤ 10 per MNO per day (baseline) | 7 d | > 50 → HIGH (spike investigation) |
| Audit hash-chain integrity | Daily verifier returns OK | continuous | break → CRITICAL (page) |
| Outbox publish lag P95 | ≤ 5 s | 28 d | > 60 s for 5 min → HIGH |
| Tenant lookup quota breach rate | < 1 % of total RPS (steady-state) | 1 d | > 5 % → tenant-abuse investigation |
2. Prometheus metrics
All metrics exposed at GET /metrics on port 3073. Service label service="number-intelligence-service".
2.1 ResolveMsisdn (hot path)
| Metric | Type | Labels | Description |
|---|---|---|---|
numint_lookup_total | Counter | tier (lru/redis/pg/live/fallback), source, confidence | Lookups by outcome |
numint_lookup_duration_seconds | Histogram | tier, result | End-to-end latency |
numint_lookup_concurrency_in_flight | Gauge | — | In-flight gRPC calls |
numint_lookup_invalid_e164_total | Counter | — | INVALID_ARGUMENT responses |
Histogram buckets for numint_lookup_duration_seconds:
[0.0005, 0.001, 0.002, 0.003, 0.005, 0.008, 0.012, 0.020, 0.050, 0.100, 0.250, 0.500, 1.000, 2.000]
2.2 Cache tiers
| Metric | Type | Labels | Description |
|---|---|---|---|
numint_cache_hits_total | Counter | tier (lru/redis/pg) | Cache hits by tier |
numint_cache_misses_total | Counter | tier | Cache misses |
numint_cache_hit_ratio | Gauge | tier | Rolling 5-min ratio (derived) |
numint_lru_size | Gauge | — | Current LRU entry count per pod |
numint_lru_evictions_total | Counter | — | LRU evictions |
numint_redis_op_duration_seconds | Histogram | op (get/set/del/eval) | |
numint_cache_warm_duration_seconds | Histogram | kind (cold/hourly) | |
numint_cache_warm_keys_loaded | Counter | kind |
2.3 MNP reconciliation
| Metric | Type | Labels | Description |
|---|---|---|---|
numint_mnp_recon_runs_total | Counter | mno_id, outcome (success/failed) | Reconciliation runs |
numint_mnp_recon_duration_seconds | Histogram | mno_id | Per-run duration |
numint_mnp_recon_records_total | Counter | mno_id, result (accepted/rejected) | |
numint_mnp_recon_conflicts_total | Counter | mno_id, severity | |
numint_mnp_recon_last_success_timestamp_seconds | Gauge | mno_id | Unix epoch — drives (time() - this) > 26h alert |
numint_mnp_registry_size | Gauge | mno_id | Current PortabilityRecord rows per recipient MNO |
numint_mnp_registry_growth_rate | Gauge | mno_id | Rolling 7-day daily port delta |
2.4 HLR probes
| Metric | Type | Labels | Description |
|---|---|---|---|
numint_hlr_probes_total | Counter | mno_id, transport, status | Per-probe outcome |
numint_hlr_probe_duration_seconds | Histogram | mno_id, transport | |
numint_hlr_tps_admitted_total | Counter | mno_id | Token bucket admits |
numint_hlr_tps_denied_total | Counter | mno_id | Token bucket denials |
numint_hlr_adapter_health | Gauge (0/1) | mno_id, transport | DaemonSet adapter reachability |
numint_hlr_map_dialog_active | Gauge | mno_id | Open TCAP dialogs on SIGTRAN |
2.5 EIR
| Metric | Type | Labels | Description |
|---|---|---|---|
numint_eir_lookups_total | Counter | outcome (blacklist/greylist/whitelist/unknown) | |
numint_eir_recon_runs_total | Counter | source (atra/mno_id), outcome | |
numint_eir_recon_last_success_timestamp_seconds | Gauge | source | |
numint_eir_blacklist_size | Gauge | — | Current BLACKLIST IMEI count |
2.6 Public Lookup API (tenant-facing)
| Metric | Type | Labels | Description |
|---|---|---|---|
numint_public_lookup_total | Counter | tenant_id, result_class, sku | |
numint_public_lookup_duration_seconds | Histogram | tenant_id, tier | |
numint_public_lookup_quota_breach_total | Counter | tenant_id, scope (rps/month/fresh_rps) | |
numint_public_lookup_forced_fresh_total | Counter | tenant_id | maxStaleness < 86400 calls |
numint_tenant_monthly_quota_used | Gauge | tenant_id | Current month usage |
2.7 Audit + outbox
| Metric | Type | Labels | Description |
|---|---|---|---|
numint_lookup_audit_rows_total | Counter | tenant_id | |
numint_audit_chain_verifier_runs_total | Counter | chain_kind, outcome (ok/broken) | |
numint_audit_chain_verifier_last_ok_timestamp_seconds | Gauge | chain_kind | |
numint_audit_chain_breaks_detected_total | Counter | chain_kind, partition | MUST equal 0 |
numint_outbox_unpublished_count | Gauge | — | |
numint_outbox_oldest_unpublished_age_seconds | Gauge | — | |
numint_outbox_publish_total | Counter | subject, outcome | |
numint_outbox_publish_duration_seconds | Histogram | subject |
2.8 Dependencies / health
| Metric | Type | Labels | Description |
|---|---|---|---|
numint_pg_pool_in_use | Gauge | pool (primary/replica) | PgBouncer pool |
numint_pg_query_duration_seconds | Histogram | statement | |
numint_vault_op_duration_seconds | Histogram | op | |
numint_vault_errors_total | Counter | op |
3. Alert catalogue
groups:
- name: numint.slo
rules:
- alert: NumIntLookupLatencyHigh
expr: histogram_quantile(0.95, sum by (le) (rate(numint_lookup_duration_seconds_bucket[5m]))) > 0.015
for: 5m
labels: { severity: HIGH, service: number-intelligence-service }
annotations:
summary: "ResolveMsisdn P95 > 15 ms"
runbook: "https://runbooks.ghasi.gov.af/numint/lookup-latency.md"
- alert: NumIntCacheHitRateLow
expr: sum(rate(numint_cache_hits_total[10m])) / (sum(rate(numint_cache_hits_total[10m])) + sum(rate(numint_cache_misses_total[10m]))) < 0.85
for: 30m
labels: { severity: MEDIUM }
- alert: NumIntMnpReconciliationStale
expr: (time() - max by (mno_id) (numint_mnp_recon_last_success_timestamp_seconds)) > 93600 # 26h
for: 10m
labels: { severity: HIGH }
- alert: NumIntMnpReconciliationCritical
expr: (time() - max by (mno_id) (numint_mnp_recon_last_success_timestamp_seconds)) > 172800 # 48h
for: 10m
labels: { severity: CRITICAL }
- alert: NumIntReconciliationConflictSpike
expr: sum by (mno_id) (increase(numint_mnp_recon_conflicts_total[1h])) > 50
for: 10m
labels: { severity: HIGH }
- alert: NumIntHlrProbeFailureHigh
expr: sum by (mno_id) (rate(numint_hlr_probes_total{status!="OK"}[10m])) / sum by (mno_id) (rate(numint_hlr_probes_total[10m])) > 0.05
for: 15m
labels: { severity: MEDIUM }
- alert: NumIntHlrAdapterDown
expr: numint_hlr_adapter_health == 0
for: 3m
labels: { severity: HIGH }
- alert: NumIntAuditChainBroken
expr: increase(numint_audit_chain_breaks_detected_total[1d]) > 0
for: 0m
labels: { severity: CRITICAL }
annotations: { runbook: "https://runbooks.ghasi.gov.af/numint/audit-chain-broken.md" }
- alert: NumIntOutboxStuck
expr: numint_outbox_oldest_unpublished_age_seconds > 60
for: 5m
labels: { severity: HIGH }
- alert: NumIntPublicLookupQuotaAbuse
expr: sum by (tenant_id) (rate(numint_public_lookup_quota_breach_total[15m])) > 10
for: 15m
labels: { severity: MEDIUM }
annotations: { summary: "Tenant {{ $labels.tenant_id }} sustained quota breach — possible enumeration" }
- alert: NumIntEventsDlqGrowing
expr: sum by (subject) (nats_jetstream_stream_messages{subject=~".*\\.deadletter"}) > 100
for: 10m
labels: { severity: HIGH }
4. Structured logs
All logs are valid JSON (Pino), redacted per SECURITY_MODEL §3.3. LOG_LEVEL env var controls verbosity.
{ "level": "info", "time": "2026-04-21T11:00:00.000Z",
"event": "numint.lookup", "tenantId": null, "msisdnHash": "8d4f…", "scope": "BASIC",
"tier": "redis", "source": "redis", "confidence": "high", "latencyMs": 2,
"traceId": "00-abc-def-01" }
{ "level": "info", "event": "numint.mnp.recon.completed", "mnoId": "mtn-afghanistan",
"runId": "rcn_01HZX7…", "fileSha256": "bd…", "accepted": 47298, "rejected": 24,
"conflictsCount": 5, "durationMs": 287340 }
{ "level": "warn", "event": "numint.mnp.divergence", "msisdnMasked": "+93701***",
"mnpMno": "mtn-afghanistan", "hlrMno": "afghan-wireless",
"portDate": "2026-04-10", "severity": "HIGH", "traceId": "…" }
{ "level": "error", "event": "numint.audit.chain_broken", "verifierRunId": "cvr_…",
"chainKind": "LOOKUP_AUDIT", "partition": "lookup_audit_2026_04",
"firstBadSeq": 4837521, "auditId": "nia_…",
"expectedPrevHash": "ab12…", "actualPrevHash": "cd34…" }
{ "level": "info", "event": "numint.hlr_probe", "mnoId": "afghan-wireless",
"transport": "MAP_SRI_SM", "status": "OK", "durationMs": 284,
"msisdnHash": "8d4f…", "vlrChanged": true, "traceId": "…" }
5. Distributed tracing
- OpenTelemetry SDK exporting OTLP to the platform collector.
- Parent span:
grpc.server.ResolveMsisdnorhttp.server GET /v1/lookup/{msisdn}. - Child spans:
cache.lru.get,cache.redis.get,db.pg.select,hlr.gateway.LiveLookup,db.pg.upsert,outbox.insert. - Propagation: W3C traceparent in gRPC metadata and HTTP headers; received trace IDs flow through into logs (
traceIdfield).
6. Grafana dashboards
6.1 numint-hot-path.json
- Row 1 — Lookup throughput & latency: RPS by tier, P50/P95/P99 latency, LRU/Redis/PG hit ratios.
- Row 2 — Hot-path health: in-flight concurrency, pod CPU/memory, replica count, HPA signal.
- Row 3 — Error rates:
INVALID_ARGUMENT,UNAVAILABLE,INTERNALby tier.
6.2 numint-mnp-eir.json
- Row 1 — MNP reconciliation: per-MNO last success timestamp, accepted vs rejected, conflict count.
- Row 2 — MNP divergence:
numint.mnp.divergence.v1event rate by severity. - Row 3 — EIR: daily IMEI flags by reporter; BLACKLIST-size trend.
6.3 numint-adapter.json
- Per-MNO adapter health (MAP dialog count, REST 5xx rate, token bucket admit/deny, latency).
- PCAP sample age / size (ops visibility into MinIO archive).
6.4 numint-public-api.json
- Tenant-facing view: per-tenant RPS, latency, quota usage, 429 rate, billing SKU distribution.
6.5 numint-audit.json
- Audit chain verifier run history; outbox lag; chain-break counter (must stay at 0).
7. Synthetic probes
- Every 30 s (
numint-synth-probecron inkblandmzr): syntheticResolveMsisdn(+93701111111)to a fixed test MSISDN; latency + confidence recorded asnumint_synth_probe_duration_seconds/numint_synth_probe_confidence. - Hourly: synthetic MNP reconciliation fixture against a mock MNO SFTP; verifies end-to-end pipeline.
- Daily: audit-chain tamper-detect drill (deliberately inject a canary tamper row; verifier MUST flag it; operator MUST acknowledge within 1 h).