numbering-service — Sync Contract
Version: 1.0 Status: Draft Owner: Commerce Engineering + Platform Engineering Last Updated: 2026-04-21 Companion: DOMAIN_MODEL · API_CONTRACTS · EVENT_SCHEMAS
This document defines the synchronous contracts other services depend on from numbering-service, the dependencies the service itself has, and the multi-region sync/conflict strategy per ADR-0004 §14.
1. Consumers of numbering-service
| Consumer | Interface | Dependency type | SLA expectation |
|---|---|---|---|
sms-orchestrator | gRPC ValidateLease | Synchronous hot path per-message | P95 ≤ 20 ms cache-hit, ≤ 50 ms PG fallback; availability 99.95 % |
routing-engine | gRPC Lookup | Synchronous per-message routing decision | P95 ≤ 15 ms; availability 99.9 % |
number-intelligence-service | gRPC Lookup | Synchronous enrichment | P95 ≤ 15 ms; availability 99.9 % |
sender-id-registry-service | gRPC Recall, Lookup | Synchronous on alpha-ID revocation | P95 ≤ 250 ms; availability 99.9 % |
compliance-engine | gRPC Recall | Asynchronous via NATS compliance.tenant.suspended.v1 → worker → Recall | Best-effort within 60 s of event |
billing-service | gRPC Recall | Asynchronous via NATS billing.account.delinquent.v1 → worker | Best-effort within 60 s |
customer-portal-bff | gRPC Reserve/Assign/Release + REST | Synchronous user-facing | P95 ≤ 200 ms; availability 99.5 % |
admin-dashboard-bff | REST /v1/admin/numbering/* | Synchronous admin | P95 ≤ 500 ms; availability 99.5 % |
Fail-closed semantics
sms-orchestrator treats UNAVAILABLE, DEADLINE_EXCEEDED, and INTERNAL as do-not-send outcomes: the NATS consumer does not ACK the message, JetStream redelivers up to 3 times, and after retry exhaustion the message moves to sms.outbound.deadletter with reason numbering_unavailable. No message is dispatched without a positive valid: true response.
2. Dependencies of numbering-service
| Dependency | Interface | Failure mode |
|---|---|---|
PostgreSQL numbering schema | Read/write SQL (pg pool, 15+) | Writes fail with UNAVAILABLE; reads degrade to Redis cache (60 s validity) |
| Redis (DB 7) | GET/SET/SETEX/DEL/ZADD, keyspace notifications | Cache miss fallback to PG; reservation cleanup relies on safety-net cron |
| NATS JetStream | Outbox relay publishes; COMPLIANCE_TENANT, BILLING_ACCOUNT, AUTH_TENANT consumer groups | Published events buffer in outbox; relay retries until ack |
auth-service gRPC | GetTenant(tenantId) | 5 s timeout; on failure, fall back to cached tenant metadata (5 min TTL) |
sender-id-registry-service gRPC | IsVerified(alphaId, tenantId) | 500 ms timeout; on failure, reject Assign with FAILED_PRECONDITION (fail-closed for alpha-ID path) |
billing-service gRPC | PreviewCharge(tenantId, item, ref) (renewal) | 2 s timeout; on failure, skip renewal and emit number.renewal.failed.v1 {BILLING_REJECTED} |
| S3 / object-store | PUT (monthly export) | Retry with exponential backoff; on persistent failure, export row stays PENDING and operator-dashboard alert fires |
| Vault PKI | mTLS server + client cert issuance | Cert auto-rotates; on failure, existing certs serve until next rotation |
| Vault Transit | Signing monthly export | 3 retries; on failure, export stays GENERATED not SIGNED |
3. Proto Definition
syntax = "proto3";
package ghasi.sms.numbering.v1;
option go_package = "github.com/ghasi/sms-gateway/numbering/v1";
import "google/protobuf/timestamp.proto";
service NumberingService {
// Hot-path validation. Called per outbound message by sms-orchestrator.
// P95 ≤ 20 ms cache-hit; ≤ 50 ms PG fallback. mTLS required.
rpc ValidateLease(ValidateLeaseRequest) returns (ValidateLeaseResponse);
// Metadata lookup for routing / intelligence. No tenant scope.
rpc Lookup(LookupRequest) returns (LookupResponse);
// Lifecycle transitions — all idempotent on idempotency_key.
rpc Reserve(ReserveRequest) returns (ReserveResponse);
rpc Assign(AssignRequest) returns (AssignResponse);
rpc Release(ReleaseRequest) returns (ReleaseResponse);
rpc Recall(RecallRequest) returns (RecallResponse);
}
// (Messages and enums as in API_CONTRACTS §1.)
Full message and enum definitions in API_CONTRACTS §1.
4. Integration Example — sms-orchestrator
// Inside the sms-orchestrator NATS consumer, after ingestion validation
async function validateSender(
ctx: MessageContext,
): Promise<'ALLOW' | { reason: string }> {
try {
const res = await numberingClient.validateLease(
{
identifier: ctx.senderId,
type: ctx.senderKind, // MSISDN | SHORT_CODE | ALPHA_ID
tenantId: ctx.tenantId,
},
{ deadline: Date.now() + 1_000 }, // 1 s deadline
);
if (res.valid) return 'ALLOW';
return { reason: res.reasonCode };
} catch (err) {
// Fail-closed — do not ACK; JetStream redelivers
metrics.increment('numbering_unavailable_retry_total');
throw err;
}
}
5. Conflict Resolution Policy (Multi-Region)
Per ADR-0004 §14, numbering-service operates multi-master across kbl and mzr for control-plane resilience. Conflict policy per table:
| Table | Policy | Rationale |
|---|---|---|
numbers (state transitions) | server_authoritative with strict CAS | Every state transition includes version = :expected; the writing region uses synchronous cross-region quorum on the numbers row. No silent LWW — losers get CONFLICT. Prevents double-assignment. |
leases | server_authoritative with strict CAS | Active-lease partial unique index is enforced across regions via synchronous replication on the number_id FK key. |
reservations | region-local with anti-affinity | Reservations are TTL-bounded advisory state; a region failover may lose in-flight reservations, which is acceptable (TTL-expiry semantics). Tenants are advised that reservations are "best effort during maintenance". |
tenant_pools | last-writer-wins (LWW) by updated_at | Quota changes are rare and idempotent; no harm in LWW on race. |
lease_contracts | server_authoritative with approval workflow | Contract CRUD is admin-only and serialised through a single primary region per contract. |
audit | append-only, partition-per-region | Audit rows are written only to the region that performed the state change; cross-region read aggregates on demand. Hash chain is per-region. |
quarantine_records | server_authoritative | Partitioned by number_id; the same region that owns the number owns the quarantine row. |
regulator_exports | single-region primary | Only kbl generates monthly exports; mzr reads for read-only admin display. |
CAS example (multi-region)
-- Region kbl issues the Reserve
BEGIN;
UPDATE numbering.numbers
SET state = 'RESERVED',
assigned_tenant_id = $1,
version = version + 1,
updated_at = now()
WHERE number_id = $2
AND state = 'AVAILABLE'
AND version = $3;
-- Postgres logical replication propagates the row mutation to mzr
-- synchronously (quorum) before the UPDATE returns.
COMMIT;
If another Reserve arrives at mzr for the same number in the race window, the second write sees version != expected and gets zero rows affected → CONFLICT → client retries on a different candidate.
6. Outbox Pattern
All state mutations write to numbering.outbox inside the same transaction as the aggregate row. A dedicated relay process:
SELECT event_id, subject, payload FROM numbering.outbox WHERE published_at IS NULL ORDER BY created_at LIMIT 500 FOR UPDATE SKIP LOCKED;- Publish to NATS JetStream with explicit ack.
UPDATE numbering.outbox SET published_at = now() WHERE event_id = :id;- On failure, increment
attemptsand recordlast_error; exponential backoff. - After 10 attempts → move to
outbox_deadletterand fireOutboxRelayStuckalert.
Ordering guarantee: outbox relay uses aggregate_id = number_id as an ordering key — events for the same number are published in created_at order, so consumers observe lifecycle transitions correctly.
7. Cache Invalidation Contract
numbering-service publishes an ephemeral NATS subject num.cache.invalidate.v1 (no persistence, 3-replica fanout, no durable consumers) after every state mutation:
{ "type": "MSISDN", "value": "+93701234567", "tenantIds": ["t_...","t_..."] }
Subscribers (primarily sms-orchestrator) delete matching Redis keys num:valid:*. Backstop: 60 s TTL ensures eventual consistency even if the invalidate message is lost.
8. Schema Stability Guarantees
gRPC
| Element | Stability |
|---|---|
ValidateLeaseRequest required fields | Stable v1 |
Response reason_code enum (string) | Additive only — consumers must treat unknown codes as UNSPECIFIED_FAIL |
NumberType / NumberState / NumberSubtype enums | Additive only |
| New fields with proto3 default values | Non-breaking |
REST
/v1/admin/numbering/*,/v1/portal/numbering/*maintained for the major version.- Breaking changes require
/v2/...with 90-day deprecation.
Events
- Detailed in EVENT_SCHEMAS §7.
9. Versioning Policy
- gRPC package:
ghasi.sms.numbering.v1. - Breaking →
v2in parallel with ≥ 90-day deprecation. - Client libraries: TypeScript + Go generated from
.proto, published to internal registry; pinned in consumer package manifests.
End of SYNC_CONTRACT.md