routing-engine — Epics & User Stories

Last updated: 2026-04-18 Story point scale: 1 (trivial) · 2 (small) · 3 (medium) · 5 (large) · 8 (XL)

EP-RE-01: gRPC Service Foundation

Description: Bootstrap the NestJS gRPC microservice with proto definition, transport configuration, mTLS, and health/metrics endpoints.

US-RE-001 — Scaffold NestJS gRPC application

Title: As a platform engineer, I want a NestJS application bootstrapped with @nestjs/microservices gRPC transport so that routing-engine can receive gRPC calls.

Description: Create the NestJS app with the gRPC microservice transport configured on port 50051. Include the RoutingService proto definition and generate TypeScript bindings with ts-proto or @grpc/proto-loader.

Acceptance Criteria:

main.ts bootstraps both a gRPC microservice (port 50051) and an HTTP app (port 3001)
Proto file at src/proto/routing.proto matches the agreed service definition
TypeScript types generated and committed to src/generated/
grpcurl call against the local service returns a valid (empty/mock) response
ESLint and TypeScript compiler report 0 errors

Story Points: 3

US-RE-002 — Implement mTLS for gRPC transport

Title: As a security engineer, I want gRPC connections to require mutual TLS so that only authorised callers (sms-orchestrator) can invoke SelectOperator.

Description: Configure the gRPC server with credentials.createSsl() using a server cert, server key, and CA bundle loaded from mounted Kubernetes Secret files. Verify that a call without a valid client cert is rejected.

Acceptance Criteria:

gRPC server rejects connections without a valid client certificate
Cert paths are read from environment variables TLS_CERT_PATH, TLS_KEY_PATH, TLS_CA_PATH
Service starts with GRPC_TLS_ENABLED=false in local dev (plain TCP)
Integration test verifies TLS rejection with a self-signed test cert

Story Points: 3

US-RE-003 — Health, readiness, and metrics endpoints

Title: As a platform engineer, I want /health, /ready, and /metrics HTTP endpoints so that Kubernetes probes and Prometheus scraping work correctly.

Description: Implement the three HTTP management endpoints. /ready checks PostgreSQL and Redis connectivity. /metrics exposes the initial set of Prometheus counters and histograms.

Acceptance Criteria:

GET /health returns 200 { status: "ok" } without checking dependencies
GET /ready returns 200 when both Redis and PostgreSQL are reachable; 503 otherwise
GET /metrics returns valid Prometheus text format
Initial metrics registered: grpc_requests_total, grpc_request_duration_seconds
Kubernetes liveness and readiness probes pass in staging deployment

Story Points: 2

EP-RE-02: Routing Rule Engine

Description: Implement the core routing logic — prefix matching, rule loading, strategy execution, and Redis caching.

US-RE-004 — Longest-prefix match for E.164 destination numbers

Title: As a routing engineer, I want the service to resolve a full E.164 number to its most specific matching destination prefix so that the correct routing rule is applied.

Description: Implement PrefixMatchingService with longest-prefix matching. Load all destination_prefixes into an in-process cache at startup and refresh every 60 s.

Acceptance Criteria:

+447911123456 matches +4479 when both +447 and +4479 exist
Returns null when no prefix matches any row in destination_prefixes
Prefix cache is refreshed every 60 s via a background @Cron job
Unit tests cover: exact match, longest-prefix selection, no-match, single-digit prefix
Prefix cache size emitted as routing_engine_prefix_cache_size gauge

Story Points: 3

US-RE-005 — COST routing strategy

Title: As a platform engineer, I want a COST routing strategy that selects the operator with the lowest cost per message so that the platform minimises transmission costs.

Description: Implement the COST strategy in RoutingStrategyService. Accept a list of RoutingRuleOperator records and return the one with the minimum cost value (ties broken by priority ASC).

Acceptance Criteria:

Given 3 operators with costs 0.005, 0.003, 0.007, returns the 0.003 operator
Unhealthy operators (UNBOUND) are excluded before strategy evaluation
Tie-breaking by priority number when costs are equal
Unit tests cover: single operator, multiple operators, tie, all unhealthy

Story Points: 2

US-RE-006 — PRIORITY routing strategy

Title: As a platform engineer, I want a PRIORITY routing strategy that selects the operator with the lowest priority number so that preferred operators are used first.

Acceptance Criteria:

Given operators with priorities 3, 1, 2, returns priority-1 operator
Unhealthy operators excluded
Unit tests mirror US-RE-005 pattern

Story Points: 1

US-RE-007 — FAILOVER routing strategy

Title: As a platform engineer, I want a FAILOVER routing strategy that tries operators in priority order and returns the first healthy one so that message delivery continues when primary operators are down.

Acceptance Criteria:

Priority-1 operator is UNBOUND → falls through to priority-2
Priority-1 and priority-2 UNBOUND → returns priority-3
All operators UNBOUND → returns null (caller returns gRPC UNAVAILABLE)
Integration test: seed 3 operators, mark top 2 UNBOUND via health cache, verify priority-3 selected

Story Points: 3

US-RE-008 — Redis routing decision cache

Title: As a platform engineer, I want routing decisions cached in Redis with a 300 s TTL so that repeated calls for the same prefix/account/messageType return in < 5 ms.

Acceptance Criteria:

Cache key format: route:decision:{prefix}:{accountId}:{messageType}
Cache HIT returns serialised OperatorConfig without DB query
Cache MISS resolves from DB, writes result to Redis with EX 300
routing_engine_cache_hits_total and routing_engine_cache_misses_total counters increment correctly
Integration test verifies second call is served from cache (DB mock not called)

Story Points: 3

US-RE-009 — Complete SelectOperator gRPC handler

Title: As sms-orchestrator, I want to call SelectOperator and receive a resolved OperatorConfig so that I know which SMPP operator to dispatch the message through.

Description: Wire together PrefixMatchingService, routing rule loading, RoutingStrategyService, and Redis cache into the final SelectOperator handler. Implement correct gRPC error codes for all failure paths.

Acceptance Criteria:

Happy path returns OperatorConfig with all required fields populated
Invalid to returns INVALID_ARGUMENT
No prefix match returns NOT_FOUND
All operators UNBOUND returns UNAVAILABLE
Unexpected error returns INTERNAL (error details logged, not surfaced)
P95 latency ≤ 50 ms under 500 RPS load test

Story Points: 5

EP-RE-03: Operator Health Cache Subscription

Description: Consume NATS operator.health events and keep the Redis health cache up-to-date.

US-RE-010 — NATS JetStream consumer setup

Title: As a platform engineer, I want routing-engine to subscribe to operator.health on NATS JetStream with a durable consumer so that health events survive service restarts.

Acceptance Criteria:

Consumer group name: routing-engine-health
Durable consumer configured with AckExplicit policy
Service reconnects to NATS with exponential backoff on disconnect
NATS credential file path read from NATS_CREDS_PATH env var
Log event emitted on connect, disconnect, and reconnect

Story Points: 3

US-RE-011 — Update operator health cache from NATS events

Title: As a routing engineer, I want operator health events to update the Redis health cache so that SelectOperator reflects the latest operator status within 60 s.

Acceptance Criteria:

BOUND event: SET operator:health:{id} '{"status":"BOUND",...}' EX 60
UNBOUND event: same write + triggers cache invalidation sweep for decision keys
FAILBACK event: same write as BOUND
NAK message on Redis write failure (NATS redelivers)
Manual ACK on success
Integration test: publish UNBOUND event → verify Redis key updated + affected decision cache keys deleted

Story Points: 3

EP-RE-04: Observability & Operational Readiness

Description: Full metrics, structured logging, distributed tracing, alerts, and deployment hardening.

US-RE-012 — Complete Prometheus metrics instrumentation

Title: As a platform engineer, I want all defined Prometheus metrics emitting correctly so that I can monitor routing-engine performance and health in Grafana.

Acceptance Criteria:

All 9 metrics from OBSERVABILITY.md are registered and emitting
routing_engine_grpc_request_duration_seconds histogram has correct buckets
Metrics verified in staging Prometheus scrape
Grafana dashboard panel definitions added to dashboards/routing-engine.json

Story Points: 3

US-RE-013 — Structured JSON logging with PII masking

Title: As a security engineer, I want all log events to be structured JSON with the destination phone number masked so that we meet data protection requirements.

Acceptance Criteria:

All log output is valid JSON (verified by log aggregation parsing rules)
to field appears as +447*** (prefix + asterisks) in all log events
traceId and spanId propagated from incoming gRPC metadata to log fields
Log level controllable via LOG_LEVEL env var at runtime

Story Points: 2

US-RE-014 — OpenTelemetry distributed tracing

Title: As a platform engineer, I want SelectOperator calls to produce OpenTelemetry traces so that I can identify latency bottlenecks in Jaeger/Tempo.

Acceptance Criteria:

Parent span routing-engine.SelectOperator created per gRPC call
Child spans for Redis GET, PostgreSQL query, Redis SET
Trace context propagated from grpc-trace-bin header
Traces visible in staging tracing backend

Story Points: 3

US-RE-015 — Kubernetes deployment and HPA configuration

Title: As a platform engineer, I want routing-engine deployed to Kubernetes with HPA so that the service scales automatically under load.

Acceptance Criteria:

Deployment runs 3 replicas minimum in production
HPA scales to 10 replicas at 60% CPU utilisation
Rolling update completes with zero dropped gRPC requests (verified with ghz during deployment)
NetworkPolicy restricts inbound to sms-orchestrator pods only
Resource requests (200m CPU / 256Mi RAM) and limits (1000m CPU / 512Mi RAM) applied

Story Points: 3

EP-RE-05: Quality-Adaptive Routing (live operator quality scoring + ML-assisted weights)

Description: Move beyond static COST/PRIORITY/FAILOVER strategies to a quality-adaptive strategy that consumes live signals (delivery rate, DLR latency, ESME error rate, cost) and produces routing weights every 60 s. ML-assisted re-weighting is opt-in per route.

US-RE-016 — Operator quality signals collector

Title: As the routing-engine, I want to consume operator.quality.v1 events from analytics-service every 60 s so that route weights reflect current operator performance.

Acceptance Criteria:

NATS consumer for operator.quality.v1 (durable, queue group routing-engine-quality).
Quality fields ingested: deliveryRate, dlrLatencyP95, esmeErrorRate, inflightSubmits.
Per-operator metrics persisted in Redis hash quality:{operatorId} with TTL 300 s.
Stale data (TTL expired) downgrades operator to weight=0.5 of static value.
Unit test: ingest 10 events; verify Redis hash matches.

Story Points: 5

US-RE-017 — Quality-weighted routing strategy

Title: As a platform engineer, I want a QUALITY_WEIGHTED routing strategy that picks operators using score = baseWeight × deliveryRate × (1 / (1 + dlrLatencyP95)) so that higher-quality operators get more traffic.

Acceptance Criteria:

Strategy registered in RoutingStrategyService.
Selection: weighted random sampling from score (Smooth Weighted Round Robin).
Operators with deliveryRate < 0.8 excluded for that selection round.
Unit tests cover: all healthy, one degraded, all degraded → fallback to FAILOVER.

Story Points: 5

US-RE-018 — Per-operator quality decay window

Title: As a platform engineer, I want a 5-minute exponentially-weighted moving average on quality signals so that transient outliers don't flip routing.

Acceptance Criteria:

EWMA with α = 0.3 applied per signal.
Decay computed in-process (not in Redis).
Quality dashboard panel shows raw vs. EWMA.
Unit tests verify EWMA correctness over 60 samples.

Story Points: 3

US-RE-019 — ML-assisted weight learning (opt-in per route)

Title: As a platform engineer, I want an opt-in ML model that adjusts routing weights based on historical delivery outcomes per (destinationPrefix, hourOfDay, operatorId) tuple.

Acceptance Criteria:

Model: gradient-boosted regressor trained nightly on the prior 14 days from ClickHouse.
Model artifact stored in routing.ml_models table; loaded on service start.
Feature flag ROUTING_ML_ENABLED per route in routing.routes.ml_enabled.
Predictions clamped to [0.5×, 2×] baseline weights to prevent runaway shifts.
Shadow-mode comparison metric routing_ml_lift_ratio{prefix,hour} so impact can be measured before enabling.

Story Points: 8

US-RE-020 — Cost-quality joint optimisation

Title: As a finance stakeholder, I want a JOINT_COST_QUALITY strategy that balances cost and quality with a configurable weight λ so that we don't blindly route to the cheapest unhealthy operator.

Acceptance Criteria:

score = (1 - λ) × (1 / cost) + λ × deliveryRate × (1 / dlrLatencyP95).
λ configurable per route (default 0.5).
Unit tests cover λ=0 (pure cost), λ=1 (pure quality), λ=0.5 (balanced).

Story Points: 3

US-RE-021 — Live quality dashboard

Title: As the NOC, I want a Grafana panel showing live operator quality scores and the routing weights derived from them so that I can debug routing decisions.

Acceptance Criteria:

Panel: per-operator score over 24h with annotations on tier transitions.
Panel: per-route weight distribution (heatmap).
Linked alert RoutingQualityScoreCollapse when ≥ 2 operators drop below 0.5 simultaneously.

Story Points: 3

US-RE-022 — Strategy override audit trail

Title: As a compliance auditor, I want every QUALITY_WEIGHTED selection logged with the inputs used so that any disputed routing decision is reproducible.

Acceptance Criteria:

Sampled at 1-in-100; logged to routing.selection_audit (persistOperatorScores JSONB).
30-day retention; archived to ClickHouse for long-term.
Query GET /v1/internal/routing/audit?messageId=... returns the snapshot.

Story Points: 3

EP-RE-06: Per-Tenant Route Preferences, Exclusions, and Regulatory Restrictions

Description: Tenants need to declare preferred operators, excluded operators (e.g., for compliance reasons), and route restrictions (e.g., regulator-mandated paths for certain destination ranges).

US-RE-023 — Tenant preference table and resolution

Title: As an enterprise tenant, I want to declare preferred operators per destination prefix so that my traffic prefers a specific route.

Acceptance Criteria:

Table routing.tenant_preferences (tenantId, prefix, preferredOperatorIds[], excludedOperatorIds[]).
Resolution order: tenant preference → strategy → default.
Excluded operators removed from candidate set before strategy runs.
Admin REST CRUD with audit log.

Story Points: 5

US-RE-024 — Regulatory route restrictions

Title: As a compliance officer, I want regulator-mandated routes for specific destination ranges so that legal mandates are enforced at routing time.

Acceptance Criteria:

Table routing.regulatory_routes (countryCode, destinationPrefix, mandatoryOperatorIds[], reason, regulatorRef).
Regulatory routes override tenant preferences and strategy selection.
Selection-audit log includes regulatoryOverride: true when applied.
CRUD restricted to platform.compliance.admin.

Story Points: 5

US-RE-025 — Grey-route exclusion list

Title: As a Trust & Safety engineer, I want a grey-route exclusion list maintained by fraud-intel-service so that operators identified as grey routes are temporarily removed.

Acceptance Criteria:

Consumer for fraud.grey_route.added.v1 and fraud.grey_route.removed.v1.
In-memory grey-route set refreshed on event; persisted to Redis.
Excluded operators removed from selection regardless of preferences.

Story Points: 3

US-RE-026 — Tenant-scoped strategy override

Title: As an enterprise tenant, I want to override the default routing strategy for my traffic (e.g., always FAILOVER for critical OTP).

Acceptance Criteria:

Field routing.tenant_preferences.strategyOverride (COST|PRIORITY|FAILOVER|QUALITY_WEIGHTED|JOINT_COST_QUALITY).
Validated against tenant tier (only ENTERPRISE may pin FAILOVER).
Surfaced in customer portal route preferences page.

Story Points: 3

US-RE-027 — Per-priority-lane strategy mapping

Title: As the routing-engine, I want priority lanes (P0..P4) to map to default strategies so that emergency and OTP traffic always favours quality.

Acceptance Criteria:

Lane→strategy defaults: P0 → FAILOVER, P1 → QUALITY_WEIGHTED, P2 → JOINT_COST_QUALITY (λ=0.7), P3 → COST, P4 → COST.
Tenant strategy override may not relax beyond tier (e.g., P3 cannot select FAILOVER).
Unit tests for lane→strategy mapping.

Story Points: 3

US-RE-028 — Route exclusion explanations in selection-audit

Title: As an auditor, I want every excluded operator paired with a reason so that I can explain why a specific operator was not chosen.

Acceptance Criteria:

Selection-audit JSONB includes excluded: [{operatorId, reason}] array.
Reasons enumerated: UNHEALTHY, GREY_ROUTE, TENANT_EXCLUDED, REGULATORY_BLOCKED, QUALITY_BELOW_THRESHOLD.

Story Points: 3

EP-RE-07: Time-of-Day / Hour-Bucket Cost Tables and Quiet-Window Honour

Description: MNO costs vary by hour and by day-of-week. Some destinations have regulator-mandated quiet windows (e.g., no marketing 22:00–06:00). The routing-engine must honour both.

US-RE-029 — Hour-bucket cost table

Title: As a finance stakeholder, I want operator costs configurable per hour-of-day per day-of-week so that off-peak savings are realised.

Acceptance Criteria:

Table routing.operator_cost_buckets (operatorId, prefix, dayOfWeek, hourOfDay, cost, currency).
Lookup at routing time uses tenant's IANA timezone (default Asia/Kabul).
Fallback to routing.routes.cost if no bucket entry.
Admin REST CRUD with bulk import via CSV.

Story Points: 5

US-RE-030 — Regulator quiet-window honour

Title: As a compliance officer, I want marketing (P3) traffic blocked during regulator quiet windows so that we don't violate national rules.

Acceptance Criteria:

Table routing.quiet_windows (countryCode, lane, dayOfWeek, startHour, endHour, regulatorRef).
Selection rejects (deferred to next allowed window) when lane=P3 and current local time is within window.
Deferred messages re-published to lane.p3.outbound.deferred with notBefore timestamp.
Customer portal shows expected delivery time when deferred.

Story Points: 5

US-RE-031 — Cost-bucket validation

Title: As a finance stakeholder, I want bucket entries validated to prevent overlap and gaps so that pricing is unambiguous.

Acceptance Criteria:

On insert/update, system checks no other entry for same (operator, prefix, day, hour).
Coverage report: GET /v1/admin/routing/cost-buckets/coverage?operatorId= returns missing day×hour cells.
CI lint pass fails if coverage < 100% for production routes.

Story Points: 2

US-RE-032 — Cost-bucket admin dashboard

Title: As a finance stakeholder, I want a heatmap of operator costs by day×hour so I can spot anomalies.

Acceptance Criteria:

Heatmap component in admin-dashboard (paired with EP-ADMDASH-09).
Filters: operator, prefix, currency.
Export CSV.

Story Points: 3

EP-RE-01: gRPC Service Foundation​

US-RE-001 — Scaffold NestJS gRPC application​

US-RE-002 — Implement mTLS for gRPC transport​

US-RE-003 — Health, readiness, and metrics endpoints​

EP-RE-02: Routing Rule Engine​

US-RE-004 — Longest-prefix match for E.164 destination numbers​

US-RE-005 — COST routing strategy​

US-RE-006 — PRIORITY routing strategy​

US-RE-007 — FAILOVER routing strategy​

US-RE-008 — Redis routing decision cache​

US-RE-009 — Complete SelectOperator gRPC handler​

EP-RE-03: Operator Health Cache Subscription​

US-RE-010 — NATS JetStream consumer setup​

US-RE-011 — Update operator health cache from NATS events​

EP-RE-04: Observability & Operational Readiness​

US-RE-012 — Complete Prometheus metrics instrumentation​

US-RE-013 — Structured JSON logging with PII masking​

US-RE-014 — OpenTelemetry distributed tracing​

US-RE-015 — Kubernetes deployment and HPA configuration​

EP-RE-05: Quality-Adaptive Routing (live operator quality scoring + ML-assisted weights)​

US-RE-016 — Operator quality signals collector​

US-RE-017 — Quality-weighted routing strategy​

US-RE-018 — Per-operator quality decay window​

US-RE-019 — ML-assisted weight learning (opt-in per route)​

US-RE-020 — Cost-quality joint optimisation​

US-RE-021 — Live quality dashboard​

US-RE-022 — Strategy override audit trail​

EP-RE-06: Per-Tenant Route Preferences, Exclusions, and Regulatory Restrictions​

US-RE-023 — Tenant preference table and resolution​

US-RE-024 — Regulatory route restrictions​

US-RE-025 — Grey-route exclusion list​

US-RE-026 — Tenant-scoped strategy override​

US-RE-027 — Per-priority-lane strategy mapping​

US-RE-028 — Route exclusion explanations in selection-audit​

EP-RE-07: Time-of-Day / Hour-Bucket Cost Tables and Quiet-Window Honour​

US-RE-029 — Hour-bucket cost table​

US-RE-030 — Regulator quiet-window honour​

US-RE-031 — Cost-bucket validation​

US-RE-032 — Cost-bucket admin dashboard​

EP-RE-01: gRPC Service Foundation

US-RE-001 — Scaffold NestJS gRPC application

US-RE-002 — Implement mTLS for gRPC transport

US-RE-003 — Health, readiness, and metrics endpoints

EP-RE-02: Routing Rule Engine

US-RE-004 — Longest-prefix match for E.164 destination numbers

US-RE-005 — COST routing strategy

US-RE-006 — PRIORITY routing strategy

US-RE-007 — FAILOVER routing strategy

US-RE-008 — Redis routing decision cache

US-RE-009 — Complete SelectOperator gRPC handler

EP-RE-03: Operator Health Cache Subscription

US-RE-010 — NATS JetStream consumer setup

US-RE-011 — Update operator health cache from NATS events

EP-RE-04: Observability & Operational Readiness

US-RE-012 — Complete Prometheus metrics instrumentation

US-RE-013 — Structured JSON logging with PII masking

US-RE-014 — OpenTelemetry distributed tracing

US-RE-015 — Kubernetes deployment and HPA configuration

EP-RE-05: Quality-Adaptive Routing (live operator quality scoring + ML-assisted weights)

US-RE-016 — Operator quality signals collector

US-RE-017 — Quality-weighted routing strategy

US-RE-018 — Per-operator quality decay window

US-RE-019 — ML-assisted weight learning (opt-in per route)

US-RE-020 — Cost-quality joint optimisation

US-RE-021 — Live quality dashboard

US-RE-022 — Strategy override audit trail

EP-RE-06: Per-Tenant Route Preferences, Exclusions, and Regulatory Restrictions

US-RE-023 — Tenant preference table and resolution

US-RE-024 — Regulatory route restrictions

US-RE-025 — Grey-route exclusion list

US-RE-026 — Tenant-scoped strategy override

US-RE-027 — Per-priority-lane strategy mapping

US-RE-028 — Route exclusion explanations in selection-audit

EP-RE-07: Time-of-Day / Hour-Bucket Cost Tables and Quiet-Window Honour

US-RE-029 — Hour-bucket cost table

US-RE-030 — Regulator quiet-window honour

US-RE-031 — Cost-bucket validation

US-RE-032 — Cost-bucket admin dashboard