SMS Orchestrator — Service Overview
Status: populated Owner: Platform Engineering Last updated: 2026-04-18 Companion: DOMAIN_MODEL · API_CONTRACTS · EVENT_SCHEMAS · ADR-0001 Kong edge · 01 Architecture
1. Purpose
sms-orchestrator is the central processing engine for outbound SMS in Ghasi-SMS-Gateway. It:
- Accepts HTTP submission of outbound SMS from external clients (via Kong) — per ADR-0001 this responsibility moved from the retired custom
api-gatewayto this service. - Validates, normalizes, and idempotency-checks each request at the HTTP boundary.
- Publishes
sms.outbound.requestto NATS JetStream for asynchronous pipeline processing. - Consumes the same subject (it is both producer and consumer) and executes the five-stage pipeline: idempotency → validation → routing → operator publish → state persistence.
- Emits status domain events (
sms.events.status) and DLQ events (sms.outbound.deadletter) on terminal failures.
2. Bounded Context
Outbound Messaging Pipeline — authoritative lifecycle owner of every SMS message from HTTP accept through carrier-bound publish. Classified as Core (business-critical; financial correctness + SLA depend on it).
3. Responsibilities
| Area | What orchestrator owns |
|---|---|
| HTTP submit API | POST /v1/sms/send, POST /v1/sms/bulk, GET /v1/sms/{messageId} — fronted by Kong |
| Idempotency | Idempotency-Key header replay window (Redis, 48h TTL) |
| Input validation | Zod schema for SMS payload; E.164 phone normalization; body length + segment count |
| Submission accept | Persists sms_messages row with QUEUED, publishes sms.outbound.request to NATS |
| Pipeline orchestration | Idempotency check → domain validation → routing (gRPC to routing-engine) → operator publish → PG state update |
| Retry logic | Exponential backoff (1s → 2s → 4s), max 3 attempts, stored in PG |
| DLQ routing | Publish sms.outbound.deadletter on terminal failure; ACK original NATS |
| Status events | Publish sms.events.status on every state transition |
4. Non-Responsibilities
| Area | Owner | Why not orchestrator |
|---|---|---|
| AuthN (JWT / API key) | Kong (+ auth-service JWKS/consumer lookup) | Edge gateway layer |
| Rate limiting | Kong (rate-limiting-advanced plugin, Redis) | Edge concern |
| TLS termination | Cloudflare + Kong | Edge concern |
| Route selection (operator) | routing-engine (gRPC) | Separate bounded context (LCR + QoS) |
| SMPP submission | smpp-connector | Separate bounded context (protocol) |
| DLR correlation | dlr-processor | Separate bounded context (ingest) |
| Rating + billing | billing-service | Consumes domain events |
| Customer webhook delivery | webhook-dispatcher | Consumes domain events |
5. Dependencies
| Dependency | Kind | Purpose |
|---|---|---|
| Kong Gateway | Upstream (HTTP) | Proxies /v1/sms/* routes to this service |
| NATS JetStream | Event bus | Publish + consume sms.outbound.request; publish smpp.operator.*, sms.events.status, sms.outbound.retry, sms.outbound.deadletter |
PostgreSQL (schema orch) | Data store | sms_messages, idempotency_keys |
| Redis | Cache | Idempotency-Key storage (orch:idem:*) |
| routing-engine | gRPC | Operator selection (P95 ≤ 50 ms) |
| auth-service | HTTP | Account metadata lookup when needed |
6. High-Level Flow
7. Key Design Decisions
| Decision | Rationale | Trade-off |
|---|---|---|
| HTTP submit lives here, not in an edge gateway | Kong cannot own idempotency storage + Zod validation cleanly; these are application concerns | One more responsibility for this service — acceptable; it already owns the message lifecycle |
| Async pipeline after HTTP accept | Sub-second 202 response; pipeline absorbs operator latency | Extra NATS hop; mitigated by idempotency replay |
Redis SET NX for idempotency, not PG | Atomic + fast (~1 ms) | Redis outage → fail open on idempotency; mitigated by NATS AckWait |
| Application-level retry (not NATS MaxDeliver) | NATS lacks per-attempt backoff | Must persist attempt_count to survive restarts |
| Status update AFTER operator publish ACK | Operator publish is the point of no return | Small crash window → reconciliation job detects ROUTED stuck rows |
| gRPC (sync) to routing-engine | Tight latency budget (P95 500 ms E2E) | routing-engine outage handled by orchestrator retry |
8. Status
Pipeline + pre-Kong HTTP submit: design approved (moved here from retired api-gateway per ADR-0001). See MIGRATION_PLAN for cutover, SERVICE_READINESS for gate checklist.