Skip to main content

routing-engine — Service Overview

Status: populated Owner: Platform Engineering Last updated: 2026-04-18 Companion: DOMAIN_MODEL · API_CONTRACTS · EVENT_SCHEMAS

1. Purpose

routing-engine is a low-latency gRPC microservice responsible for selecting the optimal downstream SMPP operator for every outbound SMS. It sits in the hot path between sms-orchestrator and the pool of smpp-connector instances and must return a routing decision within P95 ≤ 50 ms.

The service applies one of three configurable routing strategies — COST, PRIORITY, or FAILOVER — to route a message to the cheapest, most preferred, or first available healthy operator for a given destination prefix, account, and message type.

2. Bounded Context

DimensionValue
DomainOperator Selection / Routing
Owner squadPlatform Engineering
Deployment unitKubernetes Deploymentrouting-engine
Communication styleInbound: gRPC (synchronous) · Inbound consumer: NATS JetStream (operator.health)
StoragePostgreSQL schema ops_routing (read-mostly) · Redis cache (write-through decisions)

3. Responsibilities

#Responsibility
R1Accept SelectOperator gRPC calls from sms-orchestrator
R2Match the destination phone number to a destination_prefix record
R3Filter the candidate operator set to healthy operators (Redis health cache)
R4Apply the active routing strategy (COST / PRIORITY / FAILOVER)
R5Return a fully resolved OperatorConfig (host, port, credentials, TPS limit)
R6Cache routing decisions in Redis with 300 s TTL
R7Consume operator.health NATS events and update the Redis health cache
R8Expose /health, /metrics, /ready HTTP endpoints for Kubernetes probes

4. Non-Responsibilities

  • Does not transmit SMS (handled by smpp-connector)
  • Does not manage or persist operator credentials (owned by operator-management-service)
  • Does not publish any NATS events — pure gRPC responder + NATS consumer
  • Does not perform billing or rate-limit enforcement

5. Upstream / Downstream Dependencies

DirectionServiceProtocolPurpose
Inbound callersms-orchestratorgRPCRequests operator selection per message
Inbound eventoperator-management-serviceNATS JetStream operator.healthOperator health state changes
Outbound readPostgreSQL ops_routing schemaTCP (pg driver)Read routing rules, prefixes, operators
Outbound cacheRedisTCPCache routing decisions and health state

6. High-Level Flow

7. Key Design Decisions

DecisionRationale
gRPC-only API (no REST)sms-orchestrator calls this on every message — binary framing and HTTP/2 multiplexing keep latency minimal
Redis decision cache (TTL 300 s)Routing rules change infrequently; cache absorbs DB read amplification at scale
Health cache separate TTL (60 s)Operator health is volatile; short TTL ensures stale health is not used for more than 60 s
Read-only PostgreSQL accessRouting rules are managed via operator-management-service; this service never writes to ops_routing
NATS consumer (not polling)Push-based health updates via NATS eliminate polling overhead and reduce detection latency
NestJS + @grpc/grpc-jsConsistent with the platform stack; @nestjs/microservices gRPC transport is first-class