routing-engine — Epics & User Stories
Last updated: 2026-04-18 Story point scale: 1 (trivial) · 2 (small) · 3 (medium) · 5 (large) · 8 (XL)
EP-RE-01: gRPC Service Foundation
Description: Bootstrap the NestJS gRPC microservice with proto definition, transport configuration, mTLS, and health/metrics endpoints.
US-RE-001 — Scaffold NestJS gRPC application
Title: As a platform engineer, I want a NestJS application bootstrapped with @nestjs/microservices gRPC transport so that routing-engine can receive gRPC calls.
Description: Create the NestJS app with the gRPC microservice transport configured on port 50051. Include the RoutingService proto definition and generate TypeScript bindings with ts-proto or @grpc/proto-loader.
Acceptance Criteria:
-
main.tsbootstraps both a gRPC microservice (port 50051) and an HTTP app (port 3001) - Proto file at
src/proto/routing.protomatches the agreed service definition - TypeScript types generated and committed to
src/generated/ -
grpcurlcall against the local service returns a valid (empty/mock) response - ESLint and TypeScript compiler report 0 errors
Story Points: 3
US-RE-002 — Implement mTLS for gRPC transport
Title: As a security engineer, I want gRPC connections to require mutual TLS so that only authorised callers (sms-orchestrator) can invoke SelectOperator.
Description: Configure the gRPC server with credentials.createSsl() using a server cert, server key, and CA bundle loaded from mounted Kubernetes Secret files. Verify that a call without a valid client cert is rejected.
Acceptance Criteria:
- gRPC server rejects connections without a valid client certificate
- Cert paths are read from environment variables
TLS_CERT_PATH,TLS_KEY_PATH,TLS_CA_PATH - Service starts with
GRPC_TLS_ENABLED=falsein local dev (plain TCP) - Integration test verifies TLS rejection with a self-signed test cert
Story Points: 3
US-RE-003 — Health, readiness, and metrics endpoints
Title: As a platform engineer, I want /health, /ready, and /metrics HTTP endpoints so that Kubernetes probes and Prometheus scraping work correctly.
Description: Implement the three HTTP management endpoints. /ready checks PostgreSQL and Redis connectivity. /metrics exposes the initial set of Prometheus counters and histograms.
Acceptance Criteria:
-
GET /healthreturns 200{ status: "ok" }without checking dependencies -
GET /readyreturns 200 when both Redis and PostgreSQL are reachable; 503 otherwise -
GET /metricsreturns valid Prometheus text format - Initial metrics registered:
grpc_requests_total,grpc_request_duration_seconds - Kubernetes liveness and readiness probes pass in staging deployment
Story Points: 2
EP-RE-02: Routing Rule Engine
Description: Implement the core routing logic — prefix matching, rule loading, strategy execution, and Redis caching.
US-RE-004 — Longest-prefix match for E.164 destination numbers
Title: As a routing engineer, I want the service to resolve a full E.164 number to its most specific matching destination prefix so that the correct routing rule is applied.
Description: Implement PrefixMatchingService with longest-prefix matching. Load all destination_prefixes into an in-process cache at startup and refresh every 60 s.
Acceptance Criteria:
-
+447911123456matches+4479when both+447and+4479exist - Returns
nullwhen no prefix matches any row indestination_prefixes - Prefix cache is refreshed every 60 s via a background
@Cronjob - Unit tests cover: exact match, longest-prefix selection, no-match, single-digit prefix
- Prefix cache size emitted as
routing_engine_prefix_cache_sizegauge
Story Points: 3
US-RE-005 — COST routing strategy
Title: As a platform engineer, I want a COST routing strategy that selects the operator with the lowest cost per message so that the platform minimises transmission costs.
Description: Implement the COST strategy in RoutingStrategyService. Accept a list of RoutingRuleOperator records and return the one with the minimum cost value (ties broken by priority ASC).
Acceptance Criteria:
- Given 3 operators with costs 0.005, 0.003, 0.007, returns the 0.003 operator
- Unhealthy operators (UNBOUND) are excluded before strategy evaluation
- Tie-breaking by priority number when costs are equal
- Unit tests cover: single operator, multiple operators, tie, all unhealthy
Story Points: 2
US-RE-006 — PRIORITY routing strategy
Title: As a platform engineer, I want a PRIORITY routing strategy that selects the operator with the lowest priority number so that preferred operators are used first.
Acceptance Criteria:
- Given operators with priorities 3, 1, 2, returns priority-1 operator
- Unhealthy operators excluded
- Unit tests mirror US-RE-005 pattern
Story Points: 1
US-RE-007 — FAILOVER routing strategy
Title: As a platform engineer, I want a FAILOVER routing strategy that tries operators in priority order and returns the first healthy one so that message delivery continues when primary operators are down.
Acceptance Criteria:
- Priority-1 operator is UNBOUND → falls through to priority-2
- Priority-1 and priority-2 UNBOUND → returns priority-3
- All operators UNBOUND → returns null (caller returns gRPC UNAVAILABLE)
- Integration test: seed 3 operators, mark top 2 UNBOUND via health cache, verify priority-3 selected
Story Points: 3
US-RE-008 — Redis routing decision cache
Title: As a platform engineer, I want routing decisions cached in Redis with a 300 s TTL so that repeated calls for the same prefix/account/messageType return in < 5 ms.
Acceptance Criteria:
- Cache key format:
route:decision:{prefix}:{accountId}:{messageType} - Cache HIT returns serialised
OperatorConfigwithout DB query - Cache MISS resolves from DB, writes result to Redis with EX 300
-
routing_engine_cache_hits_totalandrouting_engine_cache_misses_totalcounters increment correctly - Integration test verifies second call is served from cache (DB mock not called)
Story Points: 3
US-RE-009 — Complete SelectOperator gRPC handler
Title: As sms-orchestrator, I want to call SelectOperator and receive a resolved OperatorConfig so that I know which SMPP operator to dispatch the message through.
Description: Wire together PrefixMatchingService, routing rule loading, RoutingStrategyService, and Redis cache into the final SelectOperator handler. Implement correct gRPC error codes for all failure paths.
Acceptance Criteria:
- Happy path returns
OperatorConfigwith all required fields populated - Invalid
toreturnsINVALID_ARGUMENT - No prefix match returns
NOT_FOUND - All operators UNBOUND returns
UNAVAILABLE - Unexpected error returns
INTERNAL(error details logged, not surfaced) - P95 latency ≤ 50 ms under 500 RPS load test
Story Points: 5
EP-RE-03: Operator Health Cache Subscription
Description: Consume NATS operator.health events and keep the Redis health cache up-to-date.
US-RE-010 — NATS JetStream consumer setup
Title: As a platform engineer, I want routing-engine to subscribe to operator.health on NATS JetStream with a durable consumer so that health events survive service restarts.
Acceptance Criteria:
- Consumer group name:
routing-engine-health - Durable consumer configured with
AckExplicitpolicy - Service reconnects to NATS with exponential backoff on disconnect
- NATS credential file path read from
NATS_CREDS_PATHenv var - Log event emitted on connect, disconnect, and reconnect
Story Points: 3
US-RE-011 — Update operator health cache from NATS events
Title: As a routing engineer, I want operator health events to update the Redis health cache so that SelectOperator reflects the latest operator status within 60 s.
Acceptance Criteria:
-
BOUNDevent:SET operator:health:{id} '{"status":"BOUND",...}' EX 60 -
UNBOUNDevent: same write + triggers cache invalidation sweep for decision keys -
FAILBACKevent: same write as BOUND - NAK message on Redis write failure (NATS redelivers)
- Manual ACK on success
- Integration test: publish UNBOUND event → verify Redis key updated + affected decision cache keys deleted
Story Points: 3
EP-RE-04: Observability & Operational Readiness
Description: Full metrics, structured logging, distributed tracing, alerts, and deployment hardening.
US-RE-012 — Complete Prometheus metrics instrumentation
Title: As a platform engineer, I want all defined Prometheus metrics emitting correctly so that I can monitor routing-engine performance and health in Grafana.
Acceptance Criteria:
- All 9 metrics from OBSERVABILITY.md are registered and emitting
-
routing_engine_grpc_request_duration_secondshistogram has correct buckets - Metrics verified in staging Prometheus scrape
- Grafana dashboard panel definitions added to
dashboards/routing-engine.json
Story Points: 3
US-RE-013 — Structured JSON logging with PII masking
Title: As a security engineer, I want all log events to be structured JSON with the destination phone number masked so that we meet data protection requirements.
Acceptance Criteria:
- All log output is valid JSON (verified by log aggregation parsing rules)
-
tofield appears as+447***(prefix + asterisks) in all log events -
traceIdandspanIdpropagated from incoming gRPC metadata to log fields - Log level controllable via
LOG_LEVELenv var at runtime
Story Points: 2
US-RE-014 — OpenTelemetry distributed tracing
Title: As a platform engineer, I want SelectOperator calls to produce OpenTelemetry traces so that I can identify latency bottlenecks in Jaeger/Tempo.
Acceptance Criteria:
- Parent span
routing-engine.SelectOperatorcreated per gRPC call - Child spans for Redis GET, PostgreSQL query, Redis SET
- Trace context propagated from
grpc-trace-binheader - Traces visible in staging tracing backend
Story Points: 3
US-RE-015 — Kubernetes deployment and HPA configuration
Title: As a platform engineer, I want routing-engine deployed to Kubernetes with HPA so that the service scales automatically under load.
Acceptance Criteria:
- Deployment runs 3 replicas minimum in production
- HPA scales to 10 replicas at 60% CPU utilisation
- Rolling update completes with zero dropped gRPC requests (verified with ghz during deployment)
- NetworkPolicy restricts inbound to sms-orchestrator pods only
- Resource requests (200m CPU / 256Mi RAM) and limits (1000m CPU / 512Mi RAM) applied
Story Points: 3
EP-RE-05: Quality-Adaptive Routing (live operator quality scoring + ML-assisted weights)
Description: Move beyond static COST/PRIORITY/FAILOVER strategies to a quality-adaptive strategy that consumes live signals (delivery rate, DLR latency, ESME error rate, cost) and produces routing weights every 60 s. ML-assisted re-weighting is opt-in per route.
US-RE-016 — Operator quality signals collector
Title: As the routing-engine, I want to consume operator.quality.v1 events from analytics-service every 60 s so that route weights reflect current operator performance.
Acceptance Criteria:
- NATS consumer for
operator.quality.v1(durable, queue grouprouting-engine-quality). - Quality fields ingested:
deliveryRate,dlrLatencyP95,esmeErrorRate,inflightSubmits. - Per-operator metrics persisted in Redis hash
quality:{operatorId}with TTL 300 s. - Stale data (TTL expired) downgrades operator to
weight=0.5of static value. - Unit test: ingest 10 events; verify Redis hash matches.
Story Points: 5
US-RE-017 — Quality-weighted routing strategy
Title: As a platform engineer, I want a QUALITY_WEIGHTED routing strategy that picks operators using score = baseWeight × deliveryRate × (1 / (1 + dlrLatencyP95)) so that higher-quality operators get more traffic.
Acceptance Criteria:
- Strategy registered in
RoutingStrategyService. - Selection: weighted random sampling from
score(Smooth Weighted Round Robin). - Operators with
deliveryRate < 0.8excluded for that selection round. - Unit tests cover: all healthy, one degraded, all degraded → fallback to FAILOVER.
Story Points: 5
US-RE-018 — Per-operator quality decay window
Title: As a platform engineer, I want a 5-minute exponentially-weighted moving average on quality signals so that transient outliers don't flip routing.
Acceptance Criteria:
- EWMA with α = 0.3 applied per signal.
- Decay computed in-process (not in Redis).
- Quality dashboard panel shows raw vs. EWMA.
- Unit tests verify EWMA correctness over 60 samples.
Story Points: 3
US-RE-019 — ML-assisted weight learning (opt-in per route)
Title: As a platform engineer, I want an opt-in ML model that adjusts routing weights based on historical delivery outcomes per (destinationPrefix, hourOfDay, operatorId) tuple.
Acceptance Criteria:
- Model: gradient-boosted regressor trained nightly on the prior 14 days from ClickHouse.
- Model artifact stored in
routing.ml_modelstable; loaded on service start. - Feature flag
ROUTING_ML_ENABLEDper route inrouting.routes.ml_enabled. - Predictions clamped to
[0.5×, 2×]baseline weights to prevent runaway shifts. - Shadow-mode comparison metric
routing_ml_lift_ratio{prefix,hour}so impact can be measured before enabling.
Story Points: 8
US-RE-020 — Cost-quality joint optimisation
Title: As a finance stakeholder, I want a JOINT_COST_QUALITY strategy that balances cost and quality with a configurable weight λ so that we don't blindly route to the cheapest unhealthy operator.
Acceptance Criteria:
-
score = (1 - λ) × (1 / cost) + λ × deliveryRate × (1 / dlrLatencyP95). -
λconfigurable per route (default 0.5). - Unit tests cover λ=0 (pure cost), λ=1 (pure quality), λ=0.5 (balanced).
Story Points: 3
US-RE-021 — Live quality dashboard
Title: As the NOC, I want a Grafana panel showing live operator quality scores and the routing weights derived from them so that I can debug routing decisions.
Acceptance Criteria:
- Panel: per-operator
scoreover 24h with annotations on tier transitions. - Panel: per-route weight distribution (heatmap).
- Linked alert
RoutingQualityScoreCollapsewhen ≥ 2 operators drop below 0.5 simultaneously.
Story Points: 3
US-RE-022 — Strategy override audit trail
Title: As a compliance auditor, I want every QUALITY_WEIGHTED selection logged with the inputs used so that any disputed routing decision is reproducible.
Acceptance Criteria:
- Sampled at 1-in-100; logged to
routing.selection_audit(persistOperatorScores JSONB). - 30-day retention; archived to ClickHouse for long-term.
- Query
GET /v1/internal/routing/audit?messageId=...returns the snapshot.
Story Points: 3
EP-RE-06: Per-Tenant Route Preferences, Exclusions, and Regulatory Restrictions
Description: Tenants need to declare preferred operators, excluded operators (e.g., for compliance reasons), and route restrictions (e.g., regulator-mandated paths for certain destination ranges).
US-RE-023 — Tenant preference table and resolution
Title: As an enterprise tenant, I want to declare preferred operators per destination prefix so that my traffic prefers a specific route.
Acceptance Criteria:
- Table
routing.tenant_preferences(tenantId, prefix, preferredOperatorIds[], excludedOperatorIds[]). - Resolution order: tenant preference → strategy → default.
- Excluded operators removed from candidate set before strategy runs.
- Admin REST CRUD with audit log.
Story Points: 5
US-RE-024 — Regulatory route restrictions
Title: As a compliance officer, I want regulator-mandated routes for specific destination ranges so that legal mandates are enforced at routing time.
Acceptance Criteria:
- Table
routing.regulatory_routes(countryCode, destinationPrefix, mandatoryOperatorIds[], reason, regulatorRef). - Regulatory routes override tenant preferences and strategy selection.
- Selection-audit log includes
regulatoryOverride: truewhen applied. - CRUD restricted to
platform.compliance.admin.
Story Points: 5
US-RE-025 — Grey-route exclusion list
Title: As a Trust & Safety engineer, I want a grey-route exclusion list maintained by fraud-intel-service so that operators identified as grey routes are temporarily removed.
Acceptance Criteria:
- Consumer for
fraud.grey_route.added.v1andfraud.grey_route.removed.v1. - In-memory grey-route set refreshed on event; persisted to Redis.
- Excluded operators removed from selection regardless of preferences.
Story Points: 3
US-RE-026 — Tenant-scoped strategy override
Title: As an enterprise tenant, I want to override the default routing strategy for my traffic (e.g., always FAILOVER for critical OTP).
Acceptance Criteria:
- Field
routing.tenant_preferences.strategyOverride(COST|PRIORITY|FAILOVER|QUALITY_WEIGHTED|JOINT_COST_QUALITY). - Validated against tenant tier (only
ENTERPRISEmay pinFAILOVER). - Surfaced in customer portal route preferences page.
Story Points: 3
US-RE-027 — Per-priority-lane strategy mapping
Title: As the routing-engine, I want priority lanes (P0..P4) to map to default strategies so that emergency and OTP traffic always favours quality.
Acceptance Criteria:
- Lane→strategy defaults:
P0 → FAILOVER,P1 → QUALITY_WEIGHTED,P2 → JOINT_COST_QUALITY (λ=0.7),P3 → COST,P4 → COST. - Tenant strategy override may not relax beyond tier (e.g., P3 cannot select FAILOVER).
- Unit tests for lane→strategy mapping.
Story Points: 3
US-RE-028 — Route exclusion explanations in selection-audit
Title: As an auditor, I want every excluded operator paired with a reason so that I can explain why a specific operator was not chosen.
Acceptance Criteria:
- Selection-audit JSONB includes
excluded: [{operatorId, reason}]array. - Reasons enumerated:
UNHEALTHY,GREY_ROUTE,TENANT_EXCLUDED,REGULATORY_BLOCKED,QUALITY_BELOW_THRESHOLD.
Story Points: 3
EP-RE-07: Time-of-Day / Hour-Bucket Cost Tables and Quiet-Window Honour
Description: MNO costs vary by hour and by day-of-week. Some destinations have regulator-mandated quiet windows (e.g., no marketing 22:00–06:00). The routing-engine must honour both.
US-RE-029 — Hour-bucket cost table
Title: As a finance stakeholder, I want operator costs configurable per hour-of-day per day-of-week so that off-peak savings are realised.
Acceptance Criteria:
- Table
routing.operator_cost_buckets(operatorId, prefix, dayOfWeek, hourOfDay, cost, currency). - Lookup at routing time uses tenant's IANA timezone (default
Asia/Kabul). - Fallback to
routing.routes.costif no bucket entry. - Admin REST CRUD with bulk import via CSV.
Story Points: 5
US-RE-030 — Regulator quiet-window honour
Title: As a compliance officer, I want marketing (P3) traffic blocked during regulator quiet windows so that we don't violate national rules.
Acceptance Criteria:
- Table
routing.quiet_windows(countryCode, lane, dayOfWeek, startHour, endHour, regulatorRef). - Selection rejects (deferred to next allowed window) when lane=P3 and current local time is within window.
- Deferred messages re-published to
lane.p3.outbound.deferredwithnotBeforetimestamp. - Customer portal shows expected delivery time when deferred.
Story Points: 5
US-RE-031 — Cost-bucket validation
Title: As a finance stakeholder, I want bucket entries validated to prevent overlap and gaps so that pricing is unambiguous.
Acceptance Criteria:
- On insert/update, system checks no other entry for same (operator, prefix, day, hour).
- Coverage report:
GET /v1/admin/routing/cost-buckets/coverage?operatorId=returns missing day×hour cells. - CI lint pass fails if coverage < 100% for production routes.
Story Points: 2
US-RE-032 — Cost-bucket admin dashboard
Title: As a finance stakeholder, I want a heatmap of operator costs by day×hour so I can spot anomalies.
Acceptance Criteria:
- Heatmap component in admin-dashboard (paired with
EP-ADMDASH-09). - Filters: operator, prefix, currency.
- Export CSV.
Story Points: 3