Skip to main content

numbering-service — Sync Contract

Version: 1.0 Status: Draft Owner: Commerce Engineering + Platform Engineering Last Updated: 2026-04-21 Companion: DOMAIN_MODEL · API_CONTRACTS · EVENT_SCHEMAS

This document defines the synchronous contracts other services depend on from numbering-service, the dependencies the service itself has, and the multi-region sync/conflict strategy per ADR-0004 §14.


1. Consumers of numbering-service

ConsumerInterfaceDependency typeSLA expectation
sms-orchestratorgRPC ValidateLeaseSynchronous hot path per-messageP95 ≤ 20 ms cache-hit, ≤ 50 ms PG fallback; availability 99.95 %
routing-enginegRPC LookupSynchronous per-message routing decisionP95 ≤ 15 ms; availability 99.9 %
number-intelligence-servicegRPC LookupSynchronous enrichmentP95 ≤ 15 ms; availability 99.9 %
sender-id-registry-servicegRPC Recall, LookupSynchronous on alpha-ID revocationP95 ≤ 250 ms; availability 99.9 %
compliance-enginegRPC RecallAsynchronous via NATS compliance.tenant.suspended.v1 → worker → RecallBest-effort within 60 s of event
billing-servicegRPC RecallAsynchronous via NATS billing.account.delinquent.v1 → workerBest-effort within 60 s
customer-portal-bffgRPC Reserve/Assign/Release + RESTSynchronous user-facingP95 ≤ 200 ms; availability 99.5 %
admin-dashboard-bffREST /v1/admin/numbering/*Synchronous adminP95 ≤ 500 ms; availability 99.5 %

Fail-closed semantics

sms-orchestrator treats UNAVAILABLE, DEADLINE_EXCEEDED, and INTERNAL as do-not-send outcomes: the NATS consumer does not ACK the message, JetStream redelivers up to 3 times, and after retry exhaustion the message moves to sms.outbound.deadletter with reason numbering_unavailable. No message is dispatched without a positive valid: true response.


2. Dependencies of numbering-service

DependencyInterfaceFailure mode
PostgreSQL numbering schemaRead/write SQL (pg pool, 15+)Writes fail with UNAVAILABLE; reads degrade to Redis cache (60 s validity)
Redis (DB 7)GET/SET/SETEX/DEL/ZADD, keyspace notificationsCache miss fallback to PG; reservation cleanup relies on safety-net cron
NATS JetStreamOutbox relay publishes; COMPLIANCE_TENANT, BILLING_ACCOUNT, AUTH_TENANT consumer groupsPublished events buffer in outbox; relay retries until ack
auth-service gRPCGetTenant(tenantId)5 s timeout; on failure, fall back to cached tenant metadata (5 min TTL)
sender-id-registry-service gRPCIsVerified(alphaId, tenantId)500 ms timeout; on failure, reject Assign with FAILED_PRECONDITION (fail-closed for alpha-ID path)
billing-service gRPCPreviewCharge(tenantId, item, ref) (renewal)2 s timeout; on failure, skip renewal and emit number.renewal.failed.v1 {BILLING_REJECTED}
S3 / object-storePUT (monthly export)Retry with exponential backoff; on persistent failure, export row stays PENDING and operator-dashboard alert fires
Vault PKImTLS server + client cert issuanceCert auto-rotates; on failure, existing certs serve until next rotation
Vault TransitSigning monthly export3 retries; on failure, export stays GENERATED not SIGNED

3. Proto Definition

syntax = "proto3";
package ghasi.sms.numbering.v1;
option go_package = "github.com/ghasi/sms-gateway/numbering/v1";

import "google/protobuf/timestamp.proto";

service NumberingService {
// Hot-path validation. Called per outbound message by sms-orchestrator.
// P95 ≤ 20 ms cache-hit; ≤ 50 ms PG fallback. mTLS required.
rpc ValidateLease(ValidateLeaseRequest) returns (ValidateLeaseResponse);

// Metadata lookup for routing / intelligence. No tenant scope.
rpc Lookup(LookupRequest) returns (LookupResponse);

// Lifecycle transitions — all idempotent on idempotency_key.
rpc Reserve(ReserveRequest) returns (ReserveResponse);
rpc Assign(AssignRequest) returns (AssignResponse);
rpc Release(ReleaseRequest) returns (ReleaseResponse);
rpc Recall(RecallRequest) returns (RecallResponse);
}

// (Messages and enums as in API_CONTRACTS §1.)

Full message and enum definitions in API_CONTRACTS §1.


4. Integration Example — sms-orchestrator

// Inside the sms-orchestrator NATS consumer, after ingestion validation
async function validateSender(
ctx: MessageContext,
): Promise<'ALLOW' | { reason: string }> {
try {
const res = await numberingClient.validateLease(
{
identifier: ctx.senderId,
type: ctx.senderKind, // MSISDN | SHORT_CODE | ALPHA_ID
tenantId: ctx.tenantId,
},
{ deadline: Date.now() + 1_000 }, // 1 s deadline
);
if (res.valid) return 'ALLOW';
return { reason: res.reasonCode };
} catch (err) {
// Fail-closed — do not ACK; JetStream redelivers
metrics.increment('numbering_unavailable_retry_total');
throw err;
}
}

5. Conflict Resolution Policy (Multi-Region)

Per ADR-0004 §14, numbering-service operates multi-master across kbl and mzr for control-plane resilience. Conflict policy per table:

TablePolicyRationale
numbers (state transitions)server_authoritative with strict CASEvery state transition includes version = :expected; the writing region uses synchronous cross-region quorum on the numbers row. No silent LWW — losers get CONFLICT. Prevents double-assignment.
leasesserver_authoritative with strict CASActive-lease partial unique index is enforced across regions via synchronous replication on the number_id FK key.
reservationsregion-local with anti-affinityReservations are TTL-bounded advisory state; a region failover may lose in-flight reservations, which is acceptable (TTL-expiry semantics). Tenants are advised that reservations are "best effort during maintenance".
tenant_poolslast-writer-wins (LWW) by updated_atQuota changes are rare and idempotent; no harm in LWW on race.
lease_contractsserver_authoritative with approval workflowContract CRUD is admin-only and serialised through a single primary region per contract.
auditappend-only, partition-per-regionAudit rows are written only to the region that performed the state change; cross-region read aggregates on demand. Hash chain is per-region.
quarantine_recordsserver_authoritativePartitioned by number_id; the same region that owns the number owns the quarantine row.
regulator_exportssingle-region primaryOnly kbl generates monthly exports; mzr reads for read-only admin display.

CAS example (multi-region)

-- Region kbl issues the Reserve
BEGIN;
UPDATE numbering.numbers
SET state = 'RESERVED',
assigned_tenant_id = $1,
version = version + 1,
updated_at = now()
WHERE number_id = $2
AND state = 'AVAILABLE'
AND version = $3;
-- Postgres logical replication propagates the row mutation to mzr
-- synchronously (quorum) before the UPDATE returns.
COMMIT;

If another Reserve arrives at mzr for the same number in the race window, the second write sees version != expected and gets zero rows affected → CONFLICT → client retries on a different candidate.


6. Outbox Pattern

All state mutations write to numbering.outbox inside the same transaction as the aggregate row. A dedicated relay process:

  1. SELECT event_id, subject, payload FROM numbering.outbox WHERE published_at IS NULL ORDER BY created_at LIMIT 500 FOR UPDATE SKIP LOCKED;
  2. Publish to NATS JetStream with explicit ack.
  3. UPDATE numbering.outbox SET published_at = now() WHERE event_id = :id;
  4. On failure, increment attempts and record last_error; exponential backoff.
  5. After 10 attempts → move to outbox_deadletter and fire OutboxRelayStuck alert.

Ordering guarantee: outbox relay uses aggregate_id = number_id as an ordering key — events for the same number are published in created_at order, so consumers observe lifecycle transitions correctly.


7. Cache Invalidation Contract

numbering-service publishes an ephemeral NATS subject num.cache.invalidate.v1 (no persistence, 3-replica fanout, no durable consumers) after every state mutation:

{ "type": "MSISDN", "value": "+93701234567", "tenantIds": ["t_...","t_..."] }

Subscribers (primarily sms-orchestrator) delete matching Redis keys num:valid:*. Backstop: 60 s TTL ensures eventual consistency even if the invalidate message is lost.


8. Schema Stability Guarantees

gRPC

ElementStability
ValidateLeaseRequest required fieldsStable v1
Response reason_code enum (string)Additive only — consumers must treat unknown codes as UNSPECIFIED_FAIL
NumberType / NumberState / NumberSubtype enumsAdditive only
New fields with proto3 default valuesNon-breaking

REST

  • /v1/admin/numbering/*, /v1/portal/numbering/* maintained for the major version.
  • Breaking changes require /v2/... with 90-day deprecation.

Events


9. Versioning Policy

  • gRPC package: ghasi.sms.numbering.v1.
  • Breaking → v2 in parallel with ≥ 90-day deprecation.
  • Client libraries: TypeScript + Go generated from .proto, published to internal registry; pinned in consumer package manifests.

End of SYNC_CONTRACT.md