numbering-service — Observability
Version: 1.0 Status: Draft Owner: Commerce Engineering + Platform SRE Last Updated: 2026-04-21 Companion: FAILURE_MODES · DEPLOYMENT_TOPOLOGY · SERVICE_READINESS
All metrics exposed at GET /metrics on port 3021 in Prometheus text format. Logs are JSON (Pino). Traces are OTel (W3C Trace Context) exported to SigNoz / Tempo.
1. Prometheus Metrics
1.1 Hot-path (ValidateLease)
| Metric | Type | Labels | Description |
|---|---|---|---|
numbering_validate_lease_requests_total | Counter | valid, reason_code, region | Total ValidateLease calls |
numbering_validate_lease_duration_seconds | Histogram | cache_hit (true/false), region | End-to-end gRPC latency |
numbering_validate_lease_cache_hits_total | Counter | region | Redis cache hits |
numbering_validate_lease_cache_misses_total | Counter | region | Redis cache misses → PG |
numbering_validate_lease_unavailable_total | Counter | reason (pg_down, redis_and_pg_down) | Fail-closed triggers |
Histogram buckets for latency: [0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.25, 0.5, 1.0].
1.2 Lifecycle operations
| Metric | Type | Labels | Description |
|---|---|---|---|
numbering_reserve_total | Counter | type, kind (RESERVE/HOLD), outcome | Reservation creations |
numbering_reserve_duration_seconds | Histogram | type | Reservation create latency |
numbering_assign_total | Counter | type, outcome, term | Lease creations |
numbering_assign_duration_seconds | Histogram | type | Assign latency |
numbering_release_total | Counter | reason | Explicit and TTL releases |
numbering_recall_total | Counter | reason, type | Recall events |
numbering_renewal_total | Counter | outcome (SUCCESS/BILLING_REJECTED/TENANT_SUSPENDED) | Auto-renewals |
numbering_conflict_detected_total | Counter | kind (CAS_RACE/ORPHAN_LEASE/PREFIX_OVERLAP) | Conflict signals |
numbering_invalid_transition_total | Counter | from, to | Attempted illegal transitions |
1.3 Inventory health
| Metric | Type | Labels | Description |
|---|---|---|---|
numbering_pool_total | Gauge | type, operator_id, state | Per-(type, operator, state) counts |
numbering_pool_utilisation_pct | Gauge | type, operator_id, block_prefix | (leased + reserved + held) / total |
numbering_pool_remaining | Gauge | type, operator_id, block_prefix | AVAILABLE count per block |
numbering_short_code_scarcity_pct | Gauge | — | Platform-wide: AVAILABLE / total short codes |
numbering_vanity_available_total | Gauge | price_tier | Vanity short codes available |
numbering_quarantine_queue_age_seconds | Gauge | — | Oldest item in QUARANTINE awaiting sweep |
numbering_quarantine_backlog_total | Gauge | — | Count past quarantineUntil not yet swept |
1.4 Reservations & TTL
| Metric | Type | Labels | Description |
|---|---|---|---|
numbering_reservations_active_total | Gauge | kind | Currently active reservations/holds |
numbering_reservations_expired_total | Counter | kind | TTL-expiry transitions |
numbering_reservation_cleanup_lag_seconds | Histogram | — | Time between expiry and released_at set |
numbering_reservation_cleanup_tick_duration_seconds | Histogram | — | Cleanup worker tick latency |
1.5 Imports & exports
| Metric | Type | Labels | Description |
|---|---|---|---|
numbering_lease_import_batches_total | Counter | operator_id, status | Completed imports |
numbering_lease_import_rows_total | Counter | operator_id, outcome (inserted/duplicate/invalid) | Per-batch row accounting |
numbering_lease_import_duration_seconds | Histogram | operator_id | Import batch duration |
numbering_regulator_export_duration_seconds | Histogram | — | Monthly export generation time |
numbering_regulator_export_rows_total | Gauge | period | Row count in latest export |
numbering_outbox_lag_seconds | Gauge | — | Oldest unpublished outbox row age |
1.6 Integrity
| Metric | Type | Labels | Description |
|---|---|---|---|
numbering_audit_hash_chain_violations_total | Counter | — | Daily reconciliation hash-chain breaks |
numbering_reconciliation_divergences_total | Counter | kind | Cross-region / orphan-lease divergences |
numbering_cross_region_lag_seconds | Gauge | region | Replication lag kbl ↔ mzr |
1.7 Caches
| Metric | Type | Labels | Description |
|---|---|---|---|
numbering_redis_cache_hit_ratio | Gauge | cache_key_prefix | Computed per-minute |
numbering_quota_cache_misses_total | Counter | tenant_id_bucket (hashed to reduce cardinality) |
2. Structured Logs
JSON Pino format, LOG_LEVEL configurable. All logs include traceId, spanId, tenantId (when present).
2.1 Hot-path validation
{
"level": "info",
"time": "2026-04-21T10:15:00.123Z",
"event": "numbering.validate_lease",
"valid": true,
"cacheHit": true,
"identifierType": "MSISDN",
"identifierHashed": "sha256:abc...",
"tenantId": "a1b2-...",
"latencyMs": 4,
"traceId": "00-abc-..."
}
Identifier value is hashed in logs; full value appears only in numbering.audit and lifecycle events.
2.2 Lifecycle events
{
"level": "info",
"event": "numbering.assigned",
"numberId": "num_...",
"value": "+93701234567",
"type": "MSISDN",
"tenantId": "a1b2-...",
"leaseId": "lease_...",
"term": "P90D",
"autoRenew": true,
"actorUserId": "u1-...",
"traceId": "00-abc-..."
}
2.3 Conflicts
{
"level": "warn",
"event": "numbering.conflict",
"kind": "CAS_RACE",
"identifier": "+93701234567",
"type": "MSISDN",
"winnerTenantId": "a1b2-...",
"loserTenantId": "c3d4-...",
"expectedVersion": 17,
"actualVersion": 18
}
2.4 Errors
{
"level": "error",
"event": "numbering.error",
"errorType": "pg_unavailable",
"op": "Assign",
"failClosed": true,
"err": { "message": "ECONNREFUSED", "code": "ECONNREFUSED" }
}
2.5 Redaction rules
- Pino redactor masks:
identifier.valuein trace-only log levels (debug), never ininfo+. - Full MSISDNs appear in
info+logs only for state-change events (operational need). auth.token,authorization,cookieheaders masked.- ESLint rule forbids
logger.info(..., { body })patterns.
3. OpenTelemetry Tracing
Parent span: numbering.<rpc> (e.g., numbering.ValidateLease, numbering.Assign).
| Span | Operation | Attributes |
|---|---|---|
numbering.redis.get | Redis GET | cache.hit, key.prefix |
numbering.pg.select | PG read | rows, table |
numbering.pg.cas | PG UPDATE with CAS | rows_affected, expected_version, actual_version |
numbering.outbox.write | Outbox insert | subject, aggregate_id |
numbering.grpc.external | External gRPC (sender-id-registry / billing) | target, method, status |
numbering.cache.invalidate | NATS ephemeral publish | key_pattern |
Propagated from grpc-trace-bin (gRPC) and traceparent (REST) headers.
4. Alerts
| Alert | Condition | Severity | Runbook action |
|---|---|---|---|
NumberingValidateLeaseP95High | histogram_quantile(0.95, rate(numbering_validate_lease_duration_seconds_bucket[5m])) > 0.05 | MEDIUM | Investigate PG pool, Redis health |
NumberingValidateLeaseP99High | > 0.2 for 2m | HIGH | Page on-call |
NumberingUnavailableRetries | rate(numbering_validate_lease_unavailable_total[5m]) > 1 | HIGH | sms-orchestrator is fail-closing; investigate immediately |
NumberingShortCodeScarcityCritical | numbering_short_code_scarcity_pct < 10 | HIGH | Request additional allocation from ATRA (long lead time; plan ahead) |
NumberingPoolExhaustionWarning | numbering_pool_remaining{} / numbering_pool_total{} < 0.05 for 5 m | MEDIUM | Contact MNO for new block allocation |
NumberingPoolExhaustionCritical | < 0.01 for 2 m | CRITICAL | Page commerce ops; tenants will be blocked imminently |
NumberingLeaseImportFailed | numbering_lease_import_batches_total{status="FAILED"} > 0 | HIGH | Investigate CSV signature / format |
NumberingConflictSpike | rate(numbering_conflict_detected_total[5m]) > 5 | MEDIUM | Likely client bug; check top losing tenants |
NumberingQuarantineBacklog | numbering_quarantine_backlog_total > 100 for 10 m | MEDIUM | Sweep cron may be stuck |
NumberingAuditChainBroken | numbering_audit_hash_chain_violations_total > 0 | CRITICAL | SECURITY incident — halt regulator export; page security + on-call |
NumberingOutboxLag | numbering_outbox_lag_seconds > 60 | HIGH | Outbox relay stuck; consumers missing events |
NumberingRegulatorExportFailed | numbering_regulator_export_duration_seconds missing at month start + 24h | HIGH | Generate manually via admin endpoint |
NumberingRenewalFailureSpike | rate(numbering_renewal_total{outcome="BILLING_REJECTED"}[1h]) > 10 | MEDIUM | Likely billing outage or fraud pattern |
NumberingCrossRegionLag | numbering_cross_region_lag_seconds > 5 | HIGH | Multi-region replication falling behind; CAS races may surge |
NumberingReservationCleanupLag | histogram_quantile(0.95, rate(numbering_reservation_cleanup_lag_seconds_bucket[5m])) > 5 | MEDIUM | Keyspace notifications dropped; rely on safety-net cron |
Alerts are delivered via PagerDuty (sev1 / sev2) and Slack (#num-ops) per platform escalation policy.
5. Grafana Dashboard — dashboards/numbering-service.json
| Panel | Query | Visualisation |
|---|---|---|
| ValidateLease rate | rate(numbering_validate_lease_requests_total[1m]) | Stacked time-series by valid |
| ValidateLease P50/P95/P99 | histogram_quantile(…) with three quantiles | Line chart |
| Reason-code distribution | sum by (reason_code) (rate(numbering_validate_lease_requests_total{valid="false"}[5m])) | Bar chart |
| Pool utilisation by operator | numbering_pool_utilisation_pct by operator_id | Heatmap |
| Short-code scarcity | numbering_short_code_scarcity_pct | Gauge + time-series |
| Lifecycle throughput | rate(numbering_reserve_total[1m]), numbering_assign_total, numbering_recall_total stacked | Stacked area |
| Active reservations | numbering_reservations_active_total by kind | Time series |
| Quarantine age | numbering_quarantine_queue_age_seconds | Gauge |
| MNO import rate | increase(numbering_lease_import_rows_total[1h]) by outcome | Bar |
| Outbox lag | numbering_outbox_lag_seconds | Stat + time series |
| Cross-region lag | numbering_cross_region_lag_seconds by region | Line |
| Hash-chain violations (should be flat at 0) | numbering_audit_hash_chain_violations_total | Stat with red threshold on any non-zero |
6. SLOs
Defined in slo/numbering-service.yaml:
| SLO | Target | Window |
|---|---|---|
ValidateLease P95 latency | ≤ 20 ms cache-hit | Rolling 30 d |
ValidateLease availability | ≥ 99.95 % | Rolling 30 d |
Assign P95 latency | ≤ 200 ms | Rolling 30 d |
Reserve P95 latency | ≤ 100 ms | Rolling 30 d |
| Reservation TTL precision | within ±2 s | Rolling 30 d |
| Quarantine sweep lag | ≤ 5 min past quarantineUntil | Rolling 7 d |
| Outbox publish lag | ≤ 5 s P95 | Rolling 7 d |
| Monthly regulator export generated on time | 100 % | Per-month |
| Cross-region replication lag | ≤ 2 s P95 | Rolling 7 d |
Error budget policy: if any SLO breaches two consecutive measurement periods, the feature-freeze policy for numbering-service triggers per platform SRE policy.
End of OBSERVABILITY.md