SMS Orchestrator — Jira-Ready Epics & User Stories

Status: populated Owner: Platform Engineering Last updated: 2026-04-18 Service prefix: ORCH Scope: New epics/stories covering HTTP submit migration (from retired api-gateway per ADR-0001), pipeline orchestration, idempotency, retry/DLQ, and observability.

Epic Summary

Epic ID	Title	Stories	Points
EP-ORCH-01	HTTP Submit API (Kong-Fronted)	US-ORCH-001 – US-ORCH-006	34
EP-ORCH-02	Outbound Pipeline Orchestration	US-ORCH-010 – US-ORCH-016	40
EP-ORCH-03	Idempotency & Deduplication	US-ORCH-020 – US-ORCH-023	18
EP-ORCH-04	Retry & Dead-Letter Handling	US-ORCH-030 – US-ORCH-034	22
EP-ORCH-05	Observability & Readiness	US-ORCH-040 – US-ORCH-044	16

EP-ORCH-01 · HTTP Submit API (Kong-Fronted)

Context: Per ADR-0001 the retired custom api-gateway is replaced by Kong. HTTP submit responsibility moves to sms-orchestrator. This epic covers implementing the HTTP-facing submit endpoints that Kong proxies.

US-ORCH-001 · Implement POST /v1/sms/send endpoint

Type: Feature | Points: 5

Description:
As a Kong upstream target, I need a POST /v1/sms/send endpoint in sms-orchestrator that accepts a single outbound SMS request so that clients can submit messages through Kong.

Acceptance Criteria:

POST /v1/sms/send accepts { to, from, body, messageId?, metadata? } JSON payload
Reads X-Tenant-Id and X-Request-Id headers injected by Kong
Returns 202 Accepted with { messageId, status: "QUEUED", acceptedAt } on success
Returns 400 Bad Request with structured error body on Zod validation failure
Returns 409 Conflict on idempotency key collision (duplicate Idempotency-Key header within 48h)
messageId auto-generated as UUID v4 if not provided by client
Integration test: valid payload → 202 with messageId

US-ORCH-002 · Implement POST /v1/sms/bulk endpoint

Type: Feature | Points: 8

Description:
As a Kong upstream target, I need a POST /v1/sms/bulk endpoint accepting up to 1,000 SMS submissions per request so that clients can submit bulk campaigns efficiently.

Acceptance Criteria:

Accepts { messages: Array<{ to, from, body, messageId? }> } with max 1,000 items
Returns 202 Accepted with { batchId, accepted: N, rejected: N, results: [...] }
Each message in results includes messageId and status (QUEUED or INVALID)
Invalid messages within a batch are rejected individually; valid ones proceed
Returns 400 if all messages fail validation
Returns 413 if array exceeds 1,000 items
E2E test: 500 messages, mix of valid/invalid → correct accepted/rejected counts

US-ORCH-003 · Implement GET /v1/sms/{messageId} status endpoint

Type: Feature | Points: 3

Description:
As a client, I need to poll GET /v1/sms/{messageId} to check the current status of a submitted message.

Acceptance Criteria:

Returns { messageId, status, tenantId, to, from, createdAt, updatedAt } for known message
Returns 404 for unknown messageId
Returns 403 if X-Tenant-Id header does not match message's tenantId
Response time P95 ≤ 50 ms (PG indexed query on messageId)

US-ORCH-004 · Zod schema validation middleware

Type: Feature | Points: 5

Description:
As the submit pipeline, I need all incoming payloads validated against Zod schemas before processing so that malformed requests are rejected at the HTTP boundary.

Acceptance Criteria:

E.164 regex validation on to field; returns field-level error path on failure
from (sender ID): 1–11 chars alphanumeric or 1–15 digit numeric
body: 1–1600 characters; segment count computed and returned in 202 response
messageId: UUID v4 format when provided
Validation errors return { errors: [{ field, message, code }] } array
Unit tests for all validation rules including boundary values

US-ORCH-005 · Idempotency-Key header processing (HTTP layer)

Type: Feature | Points: 8

Description:
As the HTTP submit layer, I need to process Idempotency-Key headers so that duplicate requests within 48 hours return the original response without reprocessing.

Acceptance Criteria:

On first request: compute sha256(tenantId + ":" + Idempotency-Key), store in Redis orch:submit-idem:{hash} with 48h TTL, value = serialized 202 response
On replay: return stored 202 response with Idempotency-Replayed: true header, skip pipeline
On Redis unavailable: process request normally (fail open) + emit warn log
Key collision scenario tested: two concurrent requests with same key → only one processed
SET NX EX used atomically

US-ORCH-006 · Kong route configuration for /v1/sms/* routes

Type: Configuration | Points: 5

Description:
As a platform operator, I need Kong routes for /v1/sms/send, /v1/sms/bulk, and /v1/sms/{messageId} pointing to sms-orchestrator so that client traffic reaches the correct upstream.

Acceptance Criteria:

Kong Service resource: sms-orchestrator, upstream http://sms-orchestrator:3001
Kong Route resources for all three paths with correct methods (POST, POST, GET)
jwt plugin applied (validates Bearer token from auth-service JWKS)
correlation-id plugin injects X-Request-Id
request-transformer plugin injects X-Tenant-Id from JWT sub claim
Configuration stored in services/api-gateway/kong/ declarative config
Integration test through Kong: 401 without token, 202 with valid token + payload

EP-ORCH-02 · Outbound Pipeline Orchestration

Context: Core NATS consumer pipeline: idempotency → validation → routing → operator publish → state persistence.

US-ORCH-010 · NATS consumer setup (sms.outbound.request)

Type: Feature | Points: 5

Description:
As the pipeline, I need a durable NATS JetStream consumer on sms.outbound.request so that submitted messages are processed reliably with at-least-once delivery.

Acceptance Criteria:

Durable consumer name: orch-consumer
AckExplicit mode — NATS message acknowledged only after pipeline completion
AckWait 30s; MaxDeliver 3 (application handles retries, not NATS)
Configurable MAX_CONCURRENCY (default 10 in-flight messages)
Reconnect on NATS disconnect without losing in-flight messages
Metrics: nats_consumer_lag, nats_messages_in_flight exposed on /metrics

US-ORCH-011 · Pipeline idempotency check (NATS layer)

Type: Feature | Points: 3

Description:
As the NATS consumer pipeline, I need to check Redis for a processed messageId before executing pipeline stages so that NATS redeliveries don't double-process messages.

Acceptance Criteria:

Key pattern: orch:idem:{messageId} checked with Redis GET
On key present: ACK NATS message, emit warn log with duplicate: true, return
On key absent: SET NX with 48h TTL before processing
On Redis unavailable: proceed with processing, emit warn

US-ORCH-012 · Domain validation (pipeline stage)

Type: Feature | Points: 3

Description:
As the pipeline, I need domain-level validation of the NATS message payload so that structurally invalid messages are terminated early without retrying.

Acceptance Criteria:

E.164 to validation, non-empty from, body length ≤ 1600 chars, valid UUID messageId, non-empty tenantId
On failure: update PG status to FAILED, publish sms.outbound.deadletter, ACK NATS, no retry
Segment count computed and stored in sms_messages.segment_count

US-ORCH-013 · gRPC routing stage (routing-engine integration)

Type: Feature | Points: 8

Description:
As the pipeline, I need to call the routing-engine via gRPC to select an operator for each message so that messages are dispatched to the correct SMPP connector.

Acceptance Criteria:

gRPC call: SelectOperator(tenantId, to, from, messageType, messageId) → {operatorId, operatorSubject}
NO_ROUTE_FOUND error → permanent failure: FAILED status + DLQ, no retry
Transient gRPC error (timeout, UNAVAILABLE) → triggers retry mechanism (EP-ORCH-04)
P95 gRPC call latency ≤ 50 ms (measured via span)
operatorId and routeId stored in sms_messages on success
Update PG status to ROUTING before gRPC call, ROUTED on success

US-ORCH-014 · Operator NATS publish stage

Type: Feature | Points: 5

Description:
As the pipeline, I need to publish the SMS payload to smpp.operator.{operatorId} after routing so that the SMPP connector receives the message for carrier submission.

Acceptance Criteria:

Published subject: smpp.operator.{operatorId} with SmppOutboundMessage schema
X-Correlation-ID NATS header set to messageId
Original messageId, tenantId, and routing metadata included in payload
Update PG status to SENT only after successful NATS publish ACK
On NATS publish failure: triggers retry mechanism

US-ORCH-015 · Domain event emission (sms.events.status)

Type: Feature | Points: 8

Description:
As downstream consumers (billing, webhooks), I need status change events published to sms.events.status after every PG write so that consumers can react to message lifecycle transitions.

Acceptance Criteria:

Payload: { messageId, tenantId, previousStatus, newStatus, timestamp, metadata? }
Published after PG commit, not before
Publish failure logged but does not fail the pipeline stage
All transitions emitted: QUEUED→ROUTING, ROUTING→ROUTED, ROUTED→SENT, *→FAILED, *→DEAD_LETTER, *→RETRY
TypeScript interface defined in event-schemas.ts

US-ORCH-016 · Message state persistence (PostgreSQL)

Type: Feature | Points: 8

Description:
As the audit trail, I need all message state transitions written to orch.sms_messages atomically so that message history is reliable and queryable.

Acceptance Criteria:

INSERT on QUEUED (HTTP layer); UPDATE on all subsequent transitions
All status updates wrapped in PG transactions
status_updated_at updated on every transition
processed_at set on first SENT or DEAD_LETTER terminal transition
attempt_count incremented on each RETRY
last_error updated with failure reason on each failed attempt
PG partitioned by month (PARTITION BY RANGE (created_at)), 90-day retention policy

EP-ORCH-03 · Idempotency & Deduplication

US-ORCH-020 · Redis SET NX idempotency key creation

Type: Feature | Points: 3

Description:
As the pipeline, I need atomic SET NX operations for idempotency keys so that concurrent message deliveries don't result in double-processing in multi-replica deployments.

Acceptance Criteria:

SET NX EX 172800 used for all idempotency key writes
Key pattern orch:idem:{messageId} for pipeline-level; orch:submit-idem:{hash} for HTTP-level
Race condition test: two concurrent goroutines/workers with same messageId → only one proceeds
Redis MULTI/EXEC not required (SET NX is atomic)

US-ORCH-021 · Idempotency TTL and expiry behaviour

Type: Feature | Points: 3

Description:
As the platform, I need idempotency keys to expire after 48 hours so that Redis memory is bounded and replay protection windows are well-defined.

Acceptance Criteria:

TTL = 172800 seconds (48h) set at key creation
Expired keys allow re-processing (new submission treated as fresh request)
TTL visible in Redis key inspection (TTL orch:idem:*)
Redis key count monitored in Prometheus: redis_key_count{prefix="orch:idem"}

US-ORCH-022 · Idempotency replay response for HTTP clients

Type: Feature | Points: 5

Description:
As an HTTP client, I need replayed requests to return the original 202 response body so that retrying clients receive consistent responses.

Acceptance Criteria:

Original 202 response body serialized and stored in Redis alongside idempotency key
Replay returns identical { messageId, status, acceptedAt } body
Response includes Idempotency-Replayed: true header
Storage overhead bounded: response body stored as compact JSON string

US-ORCH-023 · Idempotency Redis failover behaviour

Type: Feature | Points: 7

Description:
As the platform, I need graceful degradation when Redis is unavailable so that idempotency failures don't block message submission.

Acceptance Criteria:

Redis connection failure → warn log emitted with component: idempotency
Message processing continues (fail open) — no 503 to client
redis_idempotency_skip_total counter incremented on each skip
Alert rule: redis_idempotency_skip_total > 10 in 5m window → PagerDuty warning

EP-ORCH-04 · Retry & Dead-Letter Handling

US-ORCH-030 · Exponential backoff retry policy

Type: Feature | Points: 5

Description:
As the pipeline, I need transient failures to trigger an exponential backoff retry so that temporary operator or routing outages don't immediately dead-letter messages.

Acceptance Criteria:

Max 3 attempts (1 initial + 2 retries)
Delays: attempt 1 → 1s, attempt 2 → 2s, attempt 3 → 4s
attempt_count incremented in PG on each attempt
Status updated to RETRY with last_error populated
Retry timing enforced via NATS delayed NAK (nak(delay))

US-ORCH-031 · sms.outbound.retry event on each retry

Type: Feature | Points: 3

Description:
As downstream consumers and operations, I need a sms.outbound.retry event emitted on each retry attempt so that retry patterns are observable.

Acceptance Criteria:

Published to sms.outbound.retry subject with { messageId, tenantId, attemptNumber, failureReason, nextRetryAt }
TypeScript interface defined in event-schemas
Published before delayed NAK
Unit test: 3 consecutive failures → 3 retry events emitted

US-ORCH-032 · Dead-letter queue routing after max retries

Type: Feature | Points: 5

Description:
As the platform, I need exhausted messages routed to sms.outbound.deadletter so that no message is silently lost and dead-letter consumers can handle reprocessing or alerting.

Acceptance Criteria:

After 3 failed attempts: publish to sms.outbound.deadletter
DLQ payload: { messageId, tenantId, to, from, body, attemptCount, failureReason, failedAt }
PG status updated to DEAD_LETTER
Original NATS message ACK'd only after successful DLQ publish
DLQ publish failure retried up to 3 times independently before logging error

US-ORCH-033 · Permanent failure handling (validation + no-route)

Type: Feature | Points: 5

Description:
As the pipeline, I need certain failure types (invalid payload, no route found) to skip retries and go directly to DLQ so that unretryable messages don't consume retry budget.

Acceptance Criteria:

Validation failure → immediate FAILED + DLQ, attemptCount: 1
NO_ROUTE_FOUND gRPC error → immediate FAILED + DLQ, attemptCount: 1
Failure reason encoded as structured object { code: "VALIDATION_FAILED" | "NO_ROUTE" | ..., detail: string }
Unit tests for each permanent failure path

US-ORCH-034 · Stuck ROUTED row reconciliation job

Type: Feature | Points: 4

Description:
As the platform, I need a periodic reconciliation job that detects messages stuck in ROUTED status (operator publish confirmed but crash before PG update) so that they can be recovered or alerting triggered.

Acceptance Criteria:

Cron job runs every 5 minutes
Queries orch.sms_messages WHERE status = 'ROUTED' AND status_updated_at < NOW() - INTERVAL '2 minutes'
Emits warn log per stuck message; increments orch_stuck_routed_total counter
Configurable threshold: STUCK_ROUTED_THRESHOLD_SECONDS env var (default 120)
Does NOT auto-retry (manual intervention or separate DLQ consumer)

EP-ORCH-05 · Observability & Readiness

US-ORCH-040 · Health and readiness endpoints

Type: Feature | Points: 2

Description:
As Kubernetes, I need /health/live and /health/ready endpoints so that liveness and readiness probes work correctly.

Acceptance Criteria:

GET /health/live → 200 always if process is running
GET /health/ready → 200 only if NATS, PG, Redis, and routing-engine gRPC are reachable
GET /health/ready → 503 with { dependencies: { nats: "down", ... } } if any dependency unhealthy
Response time ≤ 200 ms (with 1s timeout per dependency check)

US-ORCH-041 · Prometheus metrics endpoint

Type: Feature | Points: 3

Description:
As Prometheus, I need a /metrics endpoint in OpenMetrics format so that all pipeline metrics are scrapable.

Acceptance Criteria:

Metrics exposed: orch_messages_submitted_total, orch_pipeline_duration_seconds, orch_retry_total, orch_dlq_total, orch_idempotency_hit_total, nats_consumer_lag
All metrics labeled with tenant_id, status, operator_id where applicable
Histogram buckets for orch_pipeline_duration_seconds: 50ms, 100ms, 200ms, 500ms, 1s, 2s
/metrics endpoint not exposed via Kong (internal only)

US-ORCH-042 · Structured JSON logging

Type: Feature | Points: 3

Description:
As the operations team, I need all logs emitted as structured JSON so that Loki can index and query them efficiently.

Acceptance Criteria:

Log fields: level, timestamp, messageId, tenantId, traceId, spanId, service: "sms-orchestrator", action, durationMs
pino logger (or equivalent) configured with JSON transport
Sensitive fields redacted: body (SMS content) not logged at INFO; only at DEBUG with explicit opt-in
Log level configurable via LOG_LEVEL env var

US-ORCH-043 · OpenTelemetry trace propagation

Type: Feature | Points: 5

Description:
As the observability platform, I need trace context propagated from Kong through sms-orchestrator to downstream services so that end-to-end request traces are available in Grafana Tempo.

Acceptance Criteria:

W3C TraceContext header propagated from Kong X-Request-Id → OTel span
Spans created for each pipeline stage: validate, idempotency, route, publish, persist
gRPC call to routing-engine propagates trace context
NATS publish includes traceparent header
OTel exporter configured via OTEL_EXPORTER_OTLP_ENDPOINT env var

US-ORCH-044 · Kubernetes deployment manifest and HPA

Type: DevOps | Points: 3

Description:
As the platform, I need a Kubernetes Deployment, Service, and HPA for sms-orchestrator so that it scales horizontally under load.

Acceptance Criteria:

Deployment with minReplicas: 2, resource requests cpu: 100m mem: 128Mi, limits cpu: 500m mem: 512Mi
HPA scaling on nats_consumer_lag > 1000 (KEDA) or CPU > 70%
Service ClusterIP on port 3001
livenessProbe and readinessProbe wired to health endpoints
PodDisruptionBudget ensuring at least 1 pod available during rolling updates

EP-ORCH-06 · Priority-Lane Routing on Inbound Submit

Context: Per EP-PLAT-NB-06 (national priority lanes), every inbound submit must be tagged into one of P0 (emergency), P1 (OTP), P2 (transactional), P3 (marketing), P4 (broadcast). The orchestrator is the single point where lane assignment happens, and lane assignment determines NATS subject (lane.p0.* … lane.p4.*), TPS budget, compliance treatment, and SLA target.

US-ORCH-050 · Accept and validate `X-Priority-Lane` header

Type: Feature | Points: 5

Description: As a tenant, I need to optionally hint the priority lane via X-Priority-Lane: P0|P1|P2|P3|P4 so that traffic class is explicit. The orchestrator must validate the tenant is authorised for the requested lane.

Acceptance Criteria:

Header values constrained to P0..P4; invalid values return 400 with code: "INVALID_LANE".
Authorisation table auth.tenant_lane_grants (tenantId, allowedLanes[]); enforced in orchestrator. Unauthorised lane → 403 code: "LANE_NOT_GRANTED".
Default lane assignment when header absent: per tenant tier (tier=ENTERPRISE → P2, tier=STANDARD → P3, tier=TRIAL → P3).
P0 requests require both X-Priority-Lane: P0 and a valid government PKI signature in X-Gov-Signature (else 403).
Unit test covers: each lane × authorised vs. unauthorised matrix (10 cases minimum).

US-ORCH-051 · Lane-aware NATS subject routing

Type: Feature | Points: 5

Description: As the orchestrator, I need to publish sms.outbound.request to a lane-specific subject so that downstream consumers (compliance, routing, smpp) can apply lane-aware processing.

Acceptance Criteria:

Subjects: lane.p0.outbound.request, lane.p1.outbound.request, …, lane.p4.outbound.request.
Subject choice driven by the resolved lane (US-ORCH-050).
Each lane subject is a distinct JetStream consumer with independent ack/lag metrics.
Metric orch_publish_lane_total{lane} increments per publish.

US-ORCH-052 · Per-lane back-pressure and 429-shaping

Type: Feature | Points: 5

Description: As the orchestrator, I need to shed inbound P3/P4 traffic before P0/P1 if NATS lag spikes so that high-priority traffic always wins.

Acceptance Criteria:

Redis gauge lane:lag:{P} updated every 5 s by a sidecar that reads JetStream consumer lag.
If lag.P0 > 0 (any backlog) for 30 s → reject all P3/P4 with 429 code: "LANE_SHED".
If lag.P1 > 1000 for 30 s → reject P4 with 429 code: "LANE_SHED".
Shed responses include Retry-After: 60.
Telemetry: orch_lane_shed_total{shedLane,protectedLane} counter.

US-ORCH-053 · Lane carried in NATS payload and DB row

Type: Feature | Points: 3

Description: The lane assignment is a permanent property of the message and must be persisted on orch.sms_messages.lane and included in every downstream NATS payload so it can be enforced and reported on.

Acceptance Criteria:

Schema migration adds lane VARCHAR(2) NOT NULL DEFAULT 'P3' to orch.sms_messages.
All downstream NATS payloads (sms.outbound.request, compliance.evaluate.request, sms.dispatch.command) include lane field.
GET /v1/sms/{messageId} response includes lane in the payload.

US-ORCH-054 · Lane-SLO Prometheus instrumentation

Type: Feature | Points: 3

Description: As SRE, I need Prometheus histograms per lane so that lane-specific SLO alerts can be wired (P1 OTP submit→DLR P95 ≤ 3 s, P2 transactional ≤ 10 s, …).

Acceptance Criteria:

Histograms orch_submit_to_dlr_seconds{lane} and orch_submit_to_ack_seconds{lane} registered.
Buckets tuned per lane (P1 buckets 0.5–10 s, P3 buckets 5–300 s).
Recording rules emit per-lane SLO compliance percentages.
Alert OrchLaneSloBreach{lane=P1} fires when 5-min P95 > 3 s.

EP-ORCH-07 · Trusted-Tenant Fast-Path Submit (signed-template short-circuit)

Context: Per EP-PLAT-NB-08 and EP-CE-13 (trusted-tenant fast-path), pre-vetted regulated tenants (banks, ministries, healthcare) submit using a pre-approved signed template; the orchestrator verifies fingerprint and routes with compliance in shadow mode rather than blocking. This delivers OTP-class latency without losing compliance evidence.

US-ORCH-060 · Accept `X-Template-Id` and variable bindings

Type: Feature | Points: 5

Description: As a trusted tenant, I want to submit by template ID + variable bindings so that the message body is reconstructed server-side from a pre-approved template.

Acceptance Criteria:

POST /v1/sms/send accepts { to, from, templateId, variables } instead of body.
Template fetched from compliance.approved_templates (cached in Redis 5 min); 404 with code: "TEMPLATE_NOT_FOUND" if missing or revoked.
Variables substituted via Mustache; unsupplied variables → 400 with code: "TEMPLATE_VARIABLE_MISSING".
Resulting body length recomputed; segment count returned in 202 response.

US-ORCH-061 · Verify content fingerprint and tenant approval

Type: Feature | Points: 5

Description: As the orchestrator, I need to verify that the rendered body fingerprint matches the approved template hash and that the tenant is on the trusted-tenant allow-list for that template.

Acceptance Criteria:

Fingerprint = sha256(templateId || normalised(rendered_body)).
compliance.approved_templates.fingerprint_pattern is a regex (allows variable spans); orchestrator verifies the rendered body matches.
compliance.template_tenant_grants (templateId, tenantId, expiresAt); enforced.
Mismatch → fall back to full compliance evaluation; emit orch.fastpath.fallback.v1 event.

US-ORCH-062 · Compliance shadow-mode evaluation in fast path

Type: Feature | Points: 5

Description: Even when the fast path is taken, compliance must be evaluated in shadow mode so that any drift is detected without blocking delivery.

Acceptance Criteria:

Orchestrator publishes compliance.evaluate.shadow.v1 (non-blocking, fire-and-forget within ack budget).
Compliance verdict on shadow path is logged but does not alter delivery decision.
If shadow verdict ≠ ALLOW, alert ComplianceFastpathDrift{tenantId,templateId} fires.
1-in-1000 sample of fast-path messages is re-evaluated in blocking mode for drift detection.

US-ORCH-063 · Fast-path metrics and audit

Type: Feature | Points: 3

Description: As an auditor, I need metrics distinguishing fast-path from full-compliance traffic so that the share of bypass is monitorable.

Acceptance Criteria:

Metric orch_submit_path_total{path="fastpath"|"full"} increments per submit.
orch.sms_messages.compliance_path column (FAST_PATH | FULL) populated.
Audit event orch.fastpath.taken.v1 emitted with tenantId, templateId, messageId, fingerprintMatched.
Grafana panel: fast-path share by tenant per hour.

US-ORCH-064 · Per-tenant fast-path kill-switch

Type: Feature | Points: 3

Description: As a security incident responder, I need to disable fast-path for a specific tenant or template within 30 s so that an abused fast-path can be revoked instantly.

Acceptance Criteria:

POST /v1/internal/orch/fastpath/disable accepts { tenantId?, templateId? } and pushes a Redis key with TTL 24 h.
Orchestrator checks the kill-switch on every submit before granting fast path.
Re-enable via DELETE or TTL expiry.
Audit event orch.fastpath.killed.v1 published.

Epic Summary​

EP-ORCH-01 · HTTP Submit API (Kong-Fronted)​

US-ORCH-001 · Implement POST /v1/sms/send endpoint​

US-ORCH-002 · Implement POST /v1/sms/bulk endpoint​

US-ORCH-003 · Implement GET /v1/sms/{messageId} status endpoint​

US-ORCH-004 · Zod schema validation middleware​

US-ORCH-005 · Idempotency-Key header processing (HTTP layer)​

US-ORCH-006 · Kong route configuration for /v1/sms/* routes​

EP-ORCH-02 · Outbound Pipeline Orchestration​

US-ORCH-010 · NATS consumer setup (sms.outbound.request)​

US-ORCH-011 · Pipeline idempotency check (NATS layer)​

US-ORCH-012 · Domain validation (pipeline stage)​

US-ORCH-013 · gRPC routing stage (routing-engine integration)​

US-ORCH-014 · Operator NATS publish stage​

US-ORCH-015 · Domain event emission (sms.events.status)​

US-ORCH-016 · Message state persistence (PostgreSQL)​

EP-ORCH-03 · Idempotency & Deduplication​

US-ORCH-020 · Redis SET NX idempotency key creation​

US-ORCH-021 · Idempotency TTL and expiry behaviour​

US-ORCH-022 · Idempotency replay response for HTTP clients​

US-ORCH-023 · Idempotency Redis failover behaviour​

EP-ORCH-04 · Retry & Dead-Letter Handling​

US-ORCH-030 · Exponential backoff retry policy​

US-ORCH-031 · sms.outbound.retry event on each retry​

US-ORCH-032 · Dead-letter queue routing after max retries​

US-ORCH-033 · Permanent failure handling (validation + no-route)​

US-ORCH-034 · Stuck ROUTED row reconciliation job​

EP-ORCH-05 · Observability & Readiness​

US-ORCH-040 · Health and readiness endpoints​

US-ORCH-041 · Prometheus metrics endpoint​

US-ORCH-042 · Structured JSON logging​

US-ORCH-043 · OpenTelemetry trace propagation​

US-ORCH-044 · Kubernetes deployment manifest and HPA​

EP-ORCH-06 · Priority-Lane Routing on Inbound Submit​

US-ORCH-050 · Accept and validate X-Priority-Lane header​

US-ORCH-051 · Lane-aware NATS subject routing​

US-ORCH-052 · Per-lane back-pressure and 429-shaping​

US-ORCH-053 · Lane carried in NATS payload and DB row​

US-ORCH-054 · Lane-SLO Prometheus instrumentation​

EP-ORCH-07 · Trusted-Tenant Fast-Path Submit (signed-template short-circuit)​

US-ORCH-060 · Accept X-Template-Id and variable bindings​

US-ORCH-061 · Verify content fingerprint and tenant approval​

US-ORCH-062 · Compliance shadow-mode evaluation in fast path​

US-ORCH-063 · Fast-path metrics and audit​

US-ORCH-064 · Per-tenant fast-path kill-switch​

Epic Summary

EP-ORCH-01 · HTTP Submit API (Kong-Fronted)

US-ORCH-001 · Implement POST /v1/sms/send endpoint

US-ORCH-002 · Implement POST /v1/sms/bulk endpoint

US-ORCH-003 · Implement GET /v1/sms/{messageId} status endpoint

US-ORCH-004 · Zod schema validation middleware

US-ORCH-005 · Idempotency-Key header processing (HTTP layer)

US-ORCH-006 · Kong route configuration for /v1/sms/* routes

EP-ORCH-02 · Outbound Pipeline Orchestration

US-ORCH-010 · NATS consumer setup (sms.outbound.request)

US-ORCH-011 · Pipeline idempotency check (NATS layer)

US-ORCH-012 · Domain validation (pipeline stage)

US-ORCH-013 · gRPC routing stage (routing-engine integration)

US-ORCH-014 · Operator NATS publish stage

US-ORCH-015 · Domain event emission (sms.events.status)

US-ORCH-016 · Message state persistence (PostgreSQL)

EP-ORCH-03 · Idempotency & Deduplication

US-ORCH-020 · Redis SET NX idempotency key creation

US-ORCH-021 · Idempotency TTL and expiry behaviour

US-ORCH-022 · Idempotency replay response for HTTP clients

US-ORCH-023 · Idempotency Redis failover behaviour

EP-ORCH-04 · Retry & Dead-Letter Handling

US-ORCH-030 · Exponential backoff retry policy

US-ORCH-031 · sms.outbound.retry event on each retry

US-ORCH-032 · Dead-letter queue routing after max retries

US-ORCH-033 · Permanent failure handling (validation + no-route)

US-ORCH-034 · Stuck ROUTED row reconciliation job

EP-ORCH-05 · Observability & Readiness

US-ORCH-040 · Health and readiness endpoints

US-ORCH-041 · Prometheus metrics endpoint

US-ORCH-042 · Structured JSON logging

US-ORCH-043 · OpenTelemetry trace propagation

US-ORCH-044 · Kubernetes deployment manifest and HPA

EP-ORCH-06 · Priority-Lane Routing on Inbound Submit

US-ORCH-050 · Accept and validate `X-Priority-Lane` header

US-ORCH-051 · Lane-aware NATS subject routing

US-ORCH-052 · Per-lane back-pressure and 429-shaping

US-ORCH-053 · Lane carried in NATS payload and DB row

US-ORCH-054 · Lane-SLO Prometheus instrumentation

EP-ORCH-07 · Trusted-Tenant Fast-Path Submit (signed-template short-circuit)

US-ORCH-060 · Accept `X-Template-Id` and variable bindings

US-ORCH-061 · Verify content fingerprint and tenant approval

US-ORCH-062 · Compliance shadow-mode evaluation in fast path

US-ORCH-063 · Fast-path metrics and audit

US-ORCH-064 · Per-tenant fast-path kill-switch