routing-engine — Service Overview
Status: populated
Owner: Platform Engineering
Last updated: 2026-04-18
Companion: DOMAIN_MODEL · API_CONTRACTS · EVENT_SCHEMAS
1. Purpose
routing-engine is a low-latency gRPC microservice responsible for selecting the optimal downstream SMPP operator for every outbound SMS. It sits in the hot path between sms-orchestrator and the pool of smpp-connector instances and must return a routing decision within P95 ≤ 50 ms.
The service applies one of three configurable routing strategies — COST, PRIORITY, or FAILOVER — to route a message to the cheapest, most preferred, or first available healthy operator for a given destination prefix, account, and message type.
2. Bounded Context
| Dimension | Value |
|---|
| Domain | Operator Selection / Routing |
| Owner squad | Platform Engineering |
| Deployment unit | Kubernetes Deployment — routing-engine |
| Communication style | Inbound: gRPC (synchronous) · Inbound consumer: NATS JetStream (operator.health) |
| Storage | PostgreSQL schema ops_routing (read-mostly) · Redis cache (write-through decisions) |
3. Responsibilities
| # | Responsibility |
|---|
| R1 | Accept SelectOperator gRPC calls from sms-orchestrator |
| R2 | Match the destination phone number to a destination_prefix record |
| R3 | Filter the candidate operator set to healthy operators (Redis health cache) |
| R4 | Apply the active routing strategy (COST / PRIORITY / FAILOVER) |
| R5 | Return a fully resolved OperatorConfig (host, port, credentials, TPS limit) |
| R6 | Cache routing decisions in Redis with 300 s TTL |
| R7 | Consume operator.health NATS events and update the Redis health cache |
| R8 | Expose /health, /metrics, /ready HTTP endpoints for Kubernetes probes |
4. Non-Responsibilities
- Does not transmit SMS (handled by
smpp-connector)
- Does not manage or persist operator credentials (owned by
operator-management-service)
- Does not publish any NATS events — pure gRPC responder + NATS consumer
- Does not perform billing or rate-limit enforcement
5. Upstream / Downstream Dependencies
| Direction | Service | Protocol | Purpose |
|---|
| Inbound caller | sms-orchestrator | gRPC | Requests operator selection per message |
| Inbound event | operator-management-service | NATS JetStream operator.health | Operator health state changes |
| Outbound read | PostgreSQL ops_routing schema | TCP (pg driver) | Read routing rules, prefixes, operators |
| Outbound cache | Redis | TCP | Cache routing decisions and health state |
6. High-Level Flow
7. Key Design Decisions
| Decision | Rationale |
|---|
| gRPC-only API (no REST) | sms-orchestrator calls this on every message — binary framing and HTTP/2 multiplexing keep latency minimal |
| Redis decision cache (TTL 300 s) | Routing rules change infrequently; cache absorbs DB read amplification at scale |
| Health cache separate TTL (60 s) | Operator health is volatile; short TTL ensures stale health is not used for more than 60 s |
| Read-only PostgreSQL access | Routing rules are managed via operator-management-service; this service never writes to ops_routing |
| NATS consumer (not polling) | Push-based health updates via NATS eliminate polling overhead and reduce detection latency |
| NestJS + @grpc/grpc-js | Consistent with the platform stack; @nestjs/microservices gRPC transport is first-class |