Skip to main content

routing-engine — Application Logic

Status: populated | Last updated: 2026-04-18

1. Use Cases

UC-01: SelectOperator (gRPC Handler)

Trigger: sms-orchestrator calls RoutingService/SelectOperator Input: { to: string, accountId: string, messageType: MessageType } Output: OperatorConfig { operatorId, host, port, systemId, tpsLimit }

Steps:

  1. Validate input fields; reject with INVALID_ARGUMENT if to is not a valid E.164 number.
  2. Build cache key: route:decision:{prefix}:{accountId}:{messageType} — where prefix is computed via longest-prefix match (step 4 if cache miss).
  3. Attempt Redis GET on the cache key.
  4. Cache HIT → deserialize and return OperatorConfig immediately.
  5. Cache MISS: a. Run longest-prefix match against destination_prefixes table for the to number. b. If no prefix matches → return gRPC status NOT_FOUND with message "No routing rule for destination". c. Load all active routing_rules for (prefixId, accountId OR global). Prefer account-scoped over global rules. d. Load operator health states from Redis (MGET operator:health:{operatorId} for all candidate operators). e. Filter candidates to BOUND or FAILBACK status. If none → return UNAVAILABLE. f. Apply routing strategy:
    • COST: Sort by cost ASC; pick first.
    • PRIORITY: Sort by priority ASC; pick first.
    • FAILOVER: Iterate in priority order; return first healthy operator. g. Fetch full OperatorConfig from operators table for selected operatorId. h. Serialise decision and SET Redis cache key with TTL 300 s.
  6. Return OperatorConfig.

Error codes:

gRPC statusCondition
INVALID_ARGUMENTto not E.164, missing accountId
NOT_FOUNDNo prefix match or no routing rule
UNAVAILABLEAll candidate operators unhealthy
INTERNALUnexpected DB/Redis error (logged, not surfaced)

UC-02: ConsumeOperatorHealthEvent (NATS Consumer)

Trigger: Message on NATS subject operator.health Input: OperatorHealthEvent { operatorId, status, timestamp }

Steps:

  1. Deserialize and validate the event payload.
  2. Compute Redis key: operator:health:{operatorId}.
  3. Write { status, updatedAt } as a JSON string with SET ... EX 60.
  4. If status transitions to UNBOUND: proactively invalidate all route:decision:*:*:* cache keys that reference this operatorId (scan pattern route:decision:* and delete matching entries). This is a best-effort operation — cache TTL will eventually self-correct.
  5. Acknowledge NATS message (manual ack with JetStream).
  6. Emit structured log: { event: "health.updated", operatorId, status }.

UC-03: Health / Readiness Probe

Trigger: Kubernetes kubelet HTTP GET /health or /ready

Steps for /ready:

  1. Attempt a PING to Redis.
  2. Attempt a lightweight SELECT 1 on the PostgreSQL read replica.
  3. Return HTTP 200 if both pass; HTTP 503 otherwise with { reason }.

Steps for /health (liveness):

  1. Return HTTP 200 with { status: "ok", uptime } — no dependency checks (avoid false restarts).

2. Longest-Prefix Match Algorithm

The service must resolve a full E.164 number (e.g. +447911123456) to its most specific matching prefix (e.g. +4479 before +447 before +44).

function findLongestPrefixMatch(to: string, prefixes: DestinationPrefix[]): DestinationPrefix | null {
return prefixes
.filter(p => to.startsWith(p.prefix))
.sort((a, b) => b.prefix.length - a.prefix.length)[0] ?? null;
}

In production, prefixes are loaded once at startup and re-fetched on a 60 s background interval. They are not stored in Redis — the working set is small enough to fit in process memory.


3. Cache Invalidation Policy

TriggerAction
operator.healthUNBOUNDScan and delete route:decision:* entries referencing that operatorId
Routing rule updated (future webhook)Delete all route:decision:* keys for affected prefix+account pair
Cache TTL expiry (300 s)Natural expiry; next call re-resolves from DB
Health TTL expiry (60 s)Natural expiry; next health check re-reads or NATS event refreshes