Skip to main content

numbering-service — Observability

Version: 1.0 Status: Draft Owner: Commerce Engineering + Platform SRE Last Updated: 2026-04-21 Companion: FAILURE_MODES · DEPLOYMENT_TOPOLOGY · SERVICE_READINESS

All metrics exposed at GET /metrics on port 3021 in Prometheus text format. Logs are JSON (Pino). Traces are OTel (W3C Trace Context) exported to SigNoz / Tempo.


1. Prometheus Metrics

1.1 Hot-path (ValidateLease)

MetricTypeLabelsDescription
numbering_validate_lease_requests_totalCountervalid, reason_code, regionTotal ValidateLease calls
numbering_validate_lease_duration_secondsHistogramcache_hit (true/false), regionEnd-to-end gRPC latency
numbering_validate_lease_cache_hits_totalCounterregionRedis cache hits
numbering_validate_lease_cache_misses_totalCounterregionRedis cache misses → PG
numbering_validate_lease_unavailable_totalCounterreason (pg_down, redis_and_pg_down)Fail-closed triggers

Histogram buckets for latency: [0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.25, 0.5, 1.0].

1.2 Lifecycle operations

MetricTypeLabelsDescription
numbering_reserve_totalCountertype, kind (RESERVE/HOLD), outcomeReservation creations
numbering_reserve_duration_secondsHistogramtypeReservation create latency
numbering_assign_totalCountertype, outcome, termLease creations
numbering_assign_duration_secondsHistogramtypeAssign latency
numbering_release_totalCounterreasonExplicit and TTL releases
numbering_recall_totalCounterreason, typeRecall events
numbering_renewal_totalCounteroutcome (SUCCESS/BILLING_REJECTED/TENANT_SUSPENDED)Auto-renewals
numbering_conflict_detected_totalCounterkind (CAS_RACE/ORPHAN_LEASE/PREFIX_OVERLAP)Conflict signals
numbering_invalid_transition_totalCounterfrom, toAttempted illegal transitions

1.3 Inventory health

MetricTypeLabelsDescription
numbering_pool_totalGaugetype, operator_id, statePer-(type, operator, state) counts
numbering_pool_utilisation_pctGaugetype, operator_id, block_prefix(leased + reserved + held) / total
numbering_pool_remainingGaugetype, operator_id, block_prefixAVAILABLE count per block
numbering_short_code_scarcity_pctGaugePlatform-wide: AVAILABLE / total short codes
numbering_vanity_available_totalGaugeprice_tierVanity short codes available
numbering_quarantine_queue_age_secondsGaugeOldest item in QUARANTINE awaiting sweep
numbering_quarantine_backlog_totalGaugeCount past quarantineUntil not yet swept

1.4 Reservations & TTL

MetricTypeLabelsDescription
numbering_reservations_active_totalGaugekindCurrently active reservations/holds
numbering_reservations_expired_totalCounterkindTTL-expiry transitions
numbering_reservation_cleanup_lag_secondsHistogramTime between expiry and released_at set
numbering_reservation_cleanup_tick_duration_secondsHistogramCleanup worker tick latency

1.5 Imports & exports

MetricTypeLabelsDescription
numbering_lease_import_batches_totalCounteroperator_id, statusCompleted imports
numbering_lease_import_rows_totalCounteroperator_id, outcome (inserted/duplicate/invalid)Per-batch row accounting
numbering_lease_import_duration_secondsHistogramoperator_idImport batch duration
numbering_regulator_export_duration_secondsHistogramMonthly export generation time
numbering_regulator_export_rows_totalGaugeperiodRow count in latest export
numbering_outbox_lag_secondsGaugeOldest unpublished outbox row age

1.6 Integrity

MetricTypeLabelsDescription
numbering_audit_hash_chain_violations_totalCounterDaily reconciliation hash-chain breaks
numbering_reconciliation_divergences_totalCounterkindCross-region / orphan-lease divergences
numbering_cross_region_lag_secondsGaugeregionReplication lag kbl ↔ mzr

1.7 Caches

MetricTypeLabelsDescription
numbering_redis_cache_hit_ratioGaugecache_key_prefixComputed per-minute
numbering_quota_cache_misses_totalCountertenant_id_bucket (hashed to reduce cardinality)

2. Structured Logs

JSON Pino format, LOG_LEVEL configurable. All logs include traceId, spanId, tenantId (when present).

2.1 Hot-path validation

{
"level": "info",
"time": "2026-04-21T10:15:00.123Z",
"event": "numbering.validate_lease",
"valid": true,
"cacheHit": true,
"identifierType": "MSISDN",
"identifierHashed": "sha256:abc...",
"tenantId": "a1b2-...",
"latencyMs": 4,
"traceId": "00-abc-..."
}

Identifier value is hashed in logs; full value appears only in numbering.audit and lifecycle events.

2.2 Lifecycle events

{
"level": "info",
"event": "numbering.assigned",
"numberId": "num_...",
"value": "+93701234567",
"type": "MSISDN",
"tenantId": "a1b2-...",
"leaseId": "lease_...",
"term": "P90D",
"autoRenew": true,
"actorUserId": "u1-...",
"traceId": "00-abc-..."
}

2.3 Conflicts

{
"level": "warn",
"event": "numbering.conflict",
"kind": "CAS_RACE",
"identifier": "+93701234567",
"type": "MSISDN",
"winnerTenantId": "a1b2-...",
"loserTenantId": "c3d4-...",
"expectedVersion": 17,
"actualVersion": 18
}

2.4 Errors

{
"level": "error",
"event": "numbering.error",
"errorType": "pg_unavailable",
"op": "Assign",
"failClosed": true,
"err": { "message": "ECONNREFUSED", "code": "ECONNREFUSED" }
}

2.5 Redaction rules

  • Pino redactor masks: identifier.value in trace-only log levels (debug), never in info+.
  • Full MSISDNs appear in info+ logs only for state-change events (operational need).
  • auth.token, authorization, cookie headers masked.
  • ESLint rule forbids logger.info(..., { body }) patterns.

3. OpenTelemetry Tracing

Parent span: numbering.<rpc> (e.g., numbering.ValidateLease, numbering.Assign).

SpanOperationAttributes
numbering.redis.getRedis GETcache.hit, key.prefix
numbering.pg.selectPG readrows, table
numbering.pg.casPG UPDATE with CASrows_affected, expected_version, actual_version
numbering.outbox.writeOutbox insertsubject, aggregate_id
numbering.grpc.externalExternal gRPC (sender-id-registry / billing)target, method, status
numbering.cache.invalidateNATS ephemeral publishkey_pattern

Propagated from grpc-trace-bin (gRPC) and traceparent (REST) headers.


4. Alerts

AlertConditionSeverityRunbook action
NumberingValidateLeaseP95Highhistogram_quantile(0.95, rate(numbering_validate_lease_duration_seconds_bucket[5m])) > 0.05MEDIUMInvestigate PG pool, Redis health
NumberingValidateLeaseP99High> 0.2 for 2mHIGHPage on-call
NumberingUnavailableRetriesrate(numbering_validate_lease_unavailable_total[5m]) > 1HIGHsms-orchestrator is fail-closing; investigate immediately
NumberingShortCodeScarcityCriticalnumbering_short_code_scarcity_pct < 10HIGHRequest additional allocation from ATRA (long lead time; plan ahead)
NumberingPoolExhaustionWarningnumbering_pool_remaining{} / numbering_pool_total{} < 0.05 for 5 mMEDIUMContact MNO for new block allocation
NumberingPoolExhaustionCritical< 0.01 for 2 mCRITICALPage commerce ops; tenants will be blocked imminently
NumberingLeaseImportFailednumbering_lease_import_batches_total{status="FAILED"} > 0HIGHInvestigate CSV signature / format
NumberingConflictSpikerate(numbering_conflict_detected_total[5m]) > 5MEDIUMLikely client bug; check top losing tenants
NumberingQuarantineBacklognumbering_quarantine_backlog_total > 100 for 10 mMEDIUMSweep cron may be stuck
NumberingAuditChainBrokennumbering_audit_hash_chain_violations_total > 0CRITICALSECURITY incident — halt regulator export; page security + on-call
NumberingOutboxLagnumbering_outbox_lag_seconds > 60HIGHOutbox relay stuck; consumers missing events
NumberingRegulatorExportFailednumbering_regulator_export_duration_seconds missing at month start + 24hHIGHGenerate manually via admin endpoint
NumberingRenewalFailureSpikerate(numbering_renewal_total{outcome="BILLING_REJECTED"}[1h]) > 10MEDIUMLikely billing outage or fraud pattern
NumberingCrossRegionLagnumbering_cross_region_lag_seconds > 5HIGHMulti-region replication falling behind; CAS races may surge
NumberingReservationCleanupLaghistogram_quantile(0.95, rate(numbering_reservation_cleanup_lag_seconds_bucket[5m])) > 5MEDIUMKeyspace notifications dropped; rely on safety-net cron

Alerts are delivered via PagerDuty (sev1 / sev2) and Slack (#num-ops) per platform escalation policy.


5. Grafana Dashboard — dashboards/numbering-service.json

PanelQueryVisualisation
ValidateLease raterate(numbering_validate_lease_requests_total[1m])Stacked time-series by valid
ValidateLease P50/P95/P99histogram_quantile(…) with three quantilesLine chart
Reason-code distributionsum by (reason_code) (rate(numbering_validate_lease_requests_total{valid="false"}[5m]))Bar chart
Pool utilisation by operatornumbering_pool_utilisation_pct by operator_idHeatmap
Short-code scarcitynumbering_short_code_scarcity_pctGauge + time-series
Lifecycle throughputrate(numbering_reserve_total[1m]), numbering_assign_total, numbering_recall_total stackedStacked area
Active reservationsnumbering_reservations_active_total by kindTime series
Quarantine agenumbering_quarantine_queue_age_secondsGauge
MNO import rateincrease(numbering_lease_import_rows_total[1h]) by outcomeBar
Outbox lagnumbering_outbox_lag_secondsStat + time series
Cross-region lagnumbering_cross_region_lag_seconds by regionLine
Hash-chain violations (should be flat at 0)numbering_audit_hash_chain_violations_totalStat with red threshold on any non-zero

6. SLOs

Defined in slo/numbering-service.yaml:

SLOTargetWindow
ValidateLease P95 latency≤ 20 ms cache-hitRolling 30 d
ValidateLease availability≥ 99.95 %Rolling 30 d
Assign P95 latency≤ 200 msRolling 30 d
Reserve P95 latency≤ 100 msRolling 30 d
Reservation TTL precisionwithin ±2 sRolling 30 d
Quarantine sweep lag≤ 5 min past quarantineUntilRolling 7 d
Outbox publish lag≤ 5 s P95Rolling 7 d
Monthly regulator export generated on time100 %Per-month
Cross-region replication lag≤ 2 s P95Rolling 7 d

Error budget policy: if any SLO breaches two consecutive measurement periods, the feature-freeze policy for numbering-service triggers per platform SRE policy.


End of OBSERVABILITY.md