Skip to main content

routing-engine — Epics & User Stories

Last updated: 2026-04-18 Story point scale: 1 (trivial) · 2 (small) · 3 (medium) · 5 (large) · 8 (XL)


EP-RE-01: gRPC Service Foundation

Description: Bootstrap the NestJS gRPC microservice with proto definition, transport configuration, mTLS, and health/metrics endpoints.


US-RE-001 — Scaffold NestJS gRPC application

Title: As a platform engineer, I want a NestJS application bootstrapped with @nestjs/microservices gRPC transport so that routing-engine can receive gRPC calls.

Description: Create the NestJS app with the gRPC microservice transport configured on port 50051. Include the RoutingService proto definition and generate TypeScript bindings with ts-proto or @grpc/proto-loader.

Acceptance Criteria:

  • main.ts bootstraps both a gRPC microservice (port 50051) and an HTTP app (port 3001)
  • Proto file at src/proto/routing.proto matches the agreed service definition
  • TypeScript types generated and committed to src/generated/
  • grpcurl call against the local service returns a valid (empty/mock) response
  • ESLint and TypeScript compiler report 0 errors

Story Points: 3


US-RE-002 — Implement mTLS for gRPC transport

Title: As a security engineer, I want gRPC connections to require mutual TLS so that only authorised callers (sms-orchestrator) can invoke SelectOperator.

Description: Configure the gRPC server with credentials.createSsl() using a server cert, server key, and CA bundle loaded from mounted Kubernetes Secret files. Verify that a call without a valid client cert is rejected.

Acceptance Criteria:

  • gRPC server rejects connections without a valid client certificate
  • Cert paths are read from environment variables TLS_CERT_PATH, TLS_KEY_PATH, TLS_CA_PATH
  • Service starts with GRPC_TLS_ENABLED=false in local dev (plain TCP)
  • Integration test verifies TLS rejection with a self-signed test cert

Story Points: 3


US-RE-003 — Health, readiness, and metrics endpoints

Title: As a platform engineer, I want /health, /ready, and /metrics HTTP endpoints so that Kubernetes probes and Prometheus scraping work correctly.

Description: Implement the three HTTP management endpoints. /ready checks PostgreSQL and Redis connectivity. /metrics exposes the initial set of Prometheus counters and histograms.

Acceptance Criteria:

  • GET /health returns 200 { status: "ok" } without checking dependencies
  • GET /ready returns 200 when both Redis and PostgreSQL are reachable; 503 otherwise
  • GET /metrics returns valid Prometheus text format
  • Initial metrics registered: grpc_requests_total, grpc_request_duration_seconds
  • Kubernetes liveness and readiness probes pass in staging deployment

Story Points: 2


EP-RE-02: Routing Rule Engine

Description: Implement the core routing logic — prefix matching, rule loading, strategy execution, and Redis caching.


US-RE-004 — Longest-prefix match for E.164 destination numbers

Title: As a routing engineer, I want the service to resolve a full E.164 number to its most specific matching destination prefix so that the correct routing rule is applied.

Description: Implement PrefixMatchingService with longest-prefix matching. Load all destination_prefixes into an in-process cache at startup and refresh every 60 s.

Acceptance Criteria:

  • +447911123456 matches +4479 when both +447 and +4479 exist
  • Returns null when no prefix matches any row in destination_prefixes
  • Prefix cache is refreshed every 60 s via a background @Cron job
  • Unit tests cover: exact match, longest-prefix selection, no-match, single-digit prefix
  • Prefix cache size emitted as routing_engine_prefix_cache_size gauge

Story Points: 3


US-RE-005 — COST routing strategy

Title: As a platform engineer, I want a COST routing strategy that selects the operator with the lowest cost per message so that the platform minimises transmission costs.

Description: Implement the COST strategy in RoutingStrategyService. Accept a list of RoutingRuleOperator records and return the one with the minimum cost value (ties broken by priority ASC).

Acceptance Criteria:

  • Given 3 operators with costs 0.005, 0.003, 0.007, returns the 0.003 operator
  • Unhealthy operators (UNBOUND) are excluded before strategy evaluation
  • Tie-breaking by priority number when costs are equal
  • Unit tests cover: single operator, multiple operators, tie, all unhealthy

Story Points: 2


US-RE-006 — PRIORITY routing strategy

Title: As a platform engineer, I want a PRIORITY routing strategy that selects the operator with the lowest priority number so that preferred operators are used first.

Acceptance Criteria:

  • Given operators with priorities 3, 1, 2, returns priority-1 operator
  • Unhealthy operators excluded
  • Unit tests mirror US-RE-005 pattern

Story Points: 1


US-RE-007 — FAILOVER routing strategy

Title: As a platform engineer, I want a FAILOVER routing strategy that tries operators in priority order and returns the first healthy one so that message delivery continues when primary operators are down.

Acceptance Criteria:

  • Priority-1 operator is UNBOUND → falls through to priority-2
  • Priority-1 and priority-2 UNBOUND → returns priority-3
  • All operators UNBOUND → returns null (caller returns gRPC UNAVAILABLE)
  • Integration test: seed 3 operators, mark top 2 UNBOUND via health cache, verify priority-3 selected

Story Points: 3


US-RE-008 — Redis routing decision cache

Title: As a platform engineer, I want routing decisions cached in Redis with a 300 s TTL so that repeated calls for the same prefix/account/messageType return in < 5 ms.

Acceptance Criteria:

  • Cache key format: route:decision:{prefix}:{accountId}:{messageType}
  • Cache HIT returns serialised OperatorConfig without DB query
  • Cache MISS resolves from DB, writes result to Redis with EX 300
  • routing_engine_cache_hits_total and routing_engine_cache_misses_total counters increment correctly
  • Integration test verifies second call is served from cache (DB mock not called)

Story Points: 3


US-RE-009 — Complete SelectOperator gRPC handler

Title: As sms-orchestrator, I want to call SelectOperator and receive a resolved OperatorConfig so that I know which SMPP operator to dispatch the message through.

Description: Wire together PrefixMatchingService, routing rule loading, RoutingStrategyService, and Redis cache into the final SelectOperator handler. Implement correct gRPC error codes for all failure paths.

Acceptance Criteria:

  • Happy path returns OperatorConfig with all required fields populated
  • Invalid to returns INVALID_ARGUMENT
  • No prefix match returns NOT_FOUND
  • All operators UNBOUND returns UNAVAILABLE
  • Unexpected error returns INTERNAL (error details logged, not surfaced)
  • P95 latency ≤ 50 ms under 500 RPS load test

Story Points: 5


EP-RE-03: Operator Health Cache Subscription

Description: Consume NATS operator.health events and keep the Redis health cache up-to-date.


US-RE-010 — NATS JetStream consumer setup

Title: As a platform engineer, I want routing-engine to subscribe to operator.health on NATS JetStream with a durable consumer so that health events survive service restarts.

Acceptance Criteria:

  • Consumer group name: routing-engine-health
  • Durable consumer configured with AckExplicit policy
  • Service reconnects to NATS with exponential backoff on disconnect
  • NATS credential file path read from NATS_CREDS_PATH env var
  • Log event emitted on connect, disconnect, and reconnect

Story Points: 3


US-RE-011 — Update operator health cache from NATS events

Title: As a routing engineer, I want operator health events to update the Redis health cache so that SelectOperator reflects the latest operator status within 60 s.

Acceptance Criteria:

  • BOUND event: SET operator:health:{id} '{"status":"BOUND",...}' EX 60
  • UNBOUND event: same write + triggers cache invalidation sweep for decision keys
  • FAILBACK event: same write as BOUND
  • NAK message on Redis write failure (NATS redelivers)
  • Manual ACK on success
  • Integration test: publish UNBOUND event → verify Redis key updated + affected decision cache keys deleted

Story Points: 3


EP-RE-04: Observability & Operational Readiness

Description: Full metrics, structured logging, distributed tracing, alerts, and deployment hardening.


US-RE-012 — Complete Prometheus metrics instrumentation

Title: As a platform engineer, I want all defined Prometheus metrics emitting correctly so that I can monitor routing-engine performance and health in Grafana.

Acceptance Criteria:

  • All 9 metrics from OBSERVABILITY.md are registered and emitting
  • routing_engine_grpc_request_duration_seconds histogram has correct buckets
  • Metrics verified in staging Prometheus scrape
  • Grafana dashboard panel definitions added to dashboards/routing-engine.json

Story Points: 3


US-RE-013 — Structured JSON logging with PII masking

Title: As a security engineer, I want all log events to be structured JSON with the destination phone number masked so that we meet data protection requirements.

Acceptance Criteria:

  • All log output is valid JSON (verified by log aggregation parsing rules)
  • to field appears as +447*** (prefix + asterisks) in all log events
  • traceId and spanId propagated from incoming gRPC metadata to log fields
  • Log level controllable via LOG_LEVEL env var at runtime

Story Points: 2


US-RE-014 — OpenTelemetry distributed tracing

Title: As a platform engineer, I want SelectOperator calls to produce OpenTelemetry traces so that I can identify latency bottlenecks in Jaeger/Tempo.

Acceptance Criteria:

  • Parent span routing-engine.SelectOperator created per gRPC call
  • Child spans for Redis GET, PostgreSQL query, Redis SET
  • Trace context propagated from grpc-trace-bin header
  • Traces visible in staging tracing backend

Story Points: 3


US-RE-015 — Kubernetes deployment and HPA configuration

Title: As a platform engineer, I want routing-engine deployed to Kubernetes with HPA so that the service scales automatically under load.

Acceptance Criteria:

  • Deployment runs 3 replicas minimum in production
  • HPA scales to 10 replicas at 60% CPU utilisation
  • Rolling update completes with zero dropped gRPC requests (verified with ghz during deployment)
  • NetworkPolicy restricts inbound to sms-orchestrator pods only
  • Resource requests (200m CPU / 256Mi RAM) and limits (1000m CPU / 512Mi RAM) applied

Story Points: 3


EP-RE-05: Quality-Adaptive Routing (live operator quality scoring + ML-assisted weights)

Description: Move beyond static COST/PRIORITY/FAILOVER strategies to a quality-adaptive strategy that consumes live signals (delivery rate, DLR latency, ESME error rate, cost) and produces routing weights every 60 s. ML-assisted re-weighting is opt-in per route.


US-RE-016 — Operator quality signals collector

Title: As the routing-engine, I want to consume operator.quality.v1 events from analytics-service every 60 s so that route weights reflect current operator performance.

Acceptance Criteria:

  • NATS consumer for operator.quality.v1 (durable, queue group routing-engine-quality).
  • Quality fields ingested: deliveryRate, dlrLatencyP95, esmeErrorRate, inflightSubmits.
  • Per-operator metrics persisted in Redis hash quality:{operatorId} with TTL 300 s.
  • Stale data (TTL expired) downgrades operator to weight=0.5 of static value.
  • Unit test: ingest 10 events; verify Redis hash matches.

Story Points: 5


US-RE-017 — Quality-weighted routing strategy

Title: As a platform engineer, I want a QUALITY_WEIGHTED routing strategy that picks operators using score = baseWeight × deliveryRate × (1 / (1 + dlrLatencyP95)) so that higher-quality operators get more traffic.

Acceptance Criteria:

  • Strategy registered in RoutingStrategyService.
  • Selection: weighted random sampling from score (Smooth Weighted Round Robin).
  • Operators with deliveryRate < 0.8 excluded for that selection round.
  • Unit tests cover: all healthy, one degraded, all degraded → fallback to FAILOVER.

Story Points: 5


US-RE-018 — Per-operator quality decay window

Title: As a platform engineer, I want a 5-minute exponentially-weighted moving average on quality signals so that transient outliers don't flip routing.

Acceptance Criteria:

  • EWMA with α = 0.3 applied per signal.
  • Decay computed in-process (not in Redis).
  • Quality dashboard panel shows raw vs. EWMA.
  • Unit tests verify EWMA correctness over 60 samples.

Story Points: 3


US-RE-019 — ML-assisted weight learning (opt-in per route)

Title: As a platform engineer, I want an opt-in ML model that adjusts routing weights based on historical delivery outcomes per (destinationPrefix, hourOfDay, operatorId) tuple.

Acceptance Criteria:

  • Model: gradient-boosted regressor trained nightly on the prior 14 days from ClickHouse.
  • Model artifact stored in routing.ml_models table; loaded on service start.
  • Feature flag ROUTING_ML_ENABLED per route in routing.routes.ml_enabled.
  • Predictions clamped to [0.5×, 2×] baseline weights to prevent runaway shifts.
  • Shadow-mode comparison metric routing_ml_lift_ratio{prefix,hour} so impact can be measured before enabling.

Story Points: 8


US-RE-020 — Cost-quality joint optimisation

Title: As a finance stakeholder, I want a JOINT_COST_QUALITY strategy that balances cost and quality with a configurable weight λ so that we don't blindly route to the cheapest unhealthy operator.

Acceptance Criteria:

  • score = (1 - λ) × (1 / cost) + λ × deliveryRate × (1 / dlrLatencyP95).
  • λ configurable per route (default 0.5).
  • Unit tests cover λ=0 (pure cost), λ=1 (pure quality), λ=0.5 (balanced).

Story Points: 3


US-RE-021 — Live quality dashboard

Title: As the NOC, I want a Grafana panel showing live operator quality scores and the routing weights derived from them so that I can debug routing decisions.

Acceptance Criteria:

  • Panel: per-operator score over 24h with annotations on tier transitions.
  • Panel: per-route weight distribution (heatmap).
  • Linked alert RoutingQualityScoreCollapse when ≥ 2 operators drop below 0.5 simultaneously.

Story Points: 3


US-RE-022 — Strategy override audit trail

Title: As a compliance auditor, I want every QUALITY_WEIGHTED selection logged with the inputs used so that any disputed routing decision is reproducible.

Acceptance Criteria:

  • Sampled at 1-in-100; logged to routing.selection_audit (persistOperatorScores JSONB).
  • 30-day retention; archived to ClickHouse for long-term.
  • Query GET /v1/internal/routing/audit?messageId=... returns the snapshot.

Story Points: 3


EP-RE-06: Per-Tenant Route Preferences, Exclusions, and Regulatory Restrictions

Description: Tenants need to declare preferred operators, excluded operators (e.g., for compliance reasons), and route restrictions (e.g., regulator-mandated paths for certain destination ranges).


US-RE-023 — Tenant preference table and resolution

Title: As an enterprise tenant, I want to declare preferred operators per destination prefix so that my traffic prefers a specific route.

Acceptance Criteria:

  • Table routing.tenant_preferences (tenantId, prefix, preferredOperatorIds[], excludedOperatorIds[]).
  • Resolution order: tenant preference → strategy → default.
  • Excluded operators removed from candidate set before strategy runs.
  • Admin REST CRUD with audit log.

Story Points: 5


US-RE-024 — Regulatory route restrictions

Title: As a compliance officer, I want regulator-mandated routes for specific destination ranges so that legal mandates are enforced at routing time.

Acceptance Criteria:

  • Table routing.regulatory_routes (countryCode, destinationPrefix, mandatoryOperatorIds[], reason, regulatorRef).
  • Regulatory routes override tenant preferences and strategy selection.
  • Selection-audit log includes regulatoryOverride: true when applied.
  • CRUD restricted to platform.compliance.admin.

Story Points: 5


US-RE-025 — Grey-route exclusion list

Title: As a Trust & Safety engineer, I want a grey-route exclusion list maintained by fraud-intel-service so that operators identified as grey routes are temporarily removed.

Acceptance Criteria:

  • Consumer for fraud.grey_route.added.v1 and fraud.grey_route.removed.v1.
  • In-memory grey-route set refreshed on event; persisted to Redis.
  • Excluded operators removed from selection regardless of preferences.

Story Points: 3


US-RE-026 — Tenant-scoped strategy override

Title: As an enterprise tenant, I want to override the default routing strategy for my traffic (e.g., always FAILOVER for critical OTP).

Acceptance Criteria:

  • Field routing.tenant_preferences.strategyOverride (COST|PRIORITY|FAILOVER|QUALITY_WEIGHTED|JOINT_COST_QUALITY).
  • Validated against tenant tier (only ENTERPRISE may pin FAILOVER).
  • Surfaced in customer portal route preferences page.

Story Points: 3


US-RE-027 — Per-priority-lane strategy mapping

Title: As the routing-engine, I want priority lanes (P0..P4) to map to default strategies so that emergency and OTP traffic always favours quality.

Acceptance Criteria:

  • Lane→strategy defaults: P0 → FAILOVER, P1 → QUALITY_WEIGHTED, P2 → JOINT_COST_QUALITY (λ=0.7), P3 → COST, P4 → COST.
  • Tenant strategy override may not relax beyond tier (e.g., P3 cannot select FAILOVER).
  • Unit tests for lane→strategy mapping.

Story Points: 3


US-RE-028 — Route exclusion explanations in selection-audit

Title: As an auditor, I want every excluded operator paired with a reason so that I can explain why a specific operator was not chosen.

Acceptance Criteria:

  • Selection-audit JSONB includes excluded: [{operatorId, reason}] array.
  • Reasons enumerated: UNHEALTHY, GREY_ROUTE, TENANT_EXCLUDED, REGULATORY_BLOCKED, QUALITY_BELOW_THRESHOLD.

Story Points: 3


EP-RE-07: Time-of-Day / Hour-Bucket Cost Tables and Quiet-Window Honour

Description: MNO costs vary by hour and by day-of-week. Some destinations have regulator-mandated quiet windows (e.g., no marketing 22:00–06:00). The routing-engine must honour both.


US-RE-029 — Hour-bucket cost table

Title: As a finance stakeholder, I want operator costs configurable per hour-of-day per day-of-week so that off-peak savings are realised.

Acceptance Criteria:

  • Table routing.operator_cost_buckets (operatorId, prefix, dayOfWeek, hourOfDay, cost, currency).
  • Lookup at routing time uses tenant's IANA timezone (default Asia/Kabul).
  • Fallback to routing.routes.cost if no bucket entry.
  • Admin REST CRUD with bulk import via CSV.

Story Points: 5


US-RE-030 — Regulator quiet-window honour

Title: As a compliance officer, I want marketing (P3) traffic blocked during regulator quiet windows so that we don't violate national rules.

Acceptance Criteria:

  • Table routing.quiet_windows (countryCode, lane, dayOfWeek, startHour, endHour, regulatorRef).
  • Selection rejects (deferred to next allowed window) when lane=P3 and current local time is within window.
  • Deferred messages re-published to lane.p3.outbound.deferred with notBefore timestamp.
  • Customer portal shows expected delivery time when deferred.

Story Points: 5


US-RE-031 — Cost-bucket validation

Title: As a finance stakeholder, I want bucket entries validated to prevent overlap and gaps so that pricing is unambiguous.

Acceptance Criteria:

  • On insert/update, system checks no other entry for same (operator, prefix, day, hour).
  • Coverage report: GET /v1/admin/routing/cost-buckets/coverage?operatorId= returns missing day×hour cells.
  • CI lint pass fails if coverage < 100% for production routes.

Story Points: 2


US-RE-032 — Cost-bucket admin dashboard

Title: As a finance stakeholder, I want a heatmap of operator costs by day×hour so I can spot anomalies.

Acceptance Criteria:

  • Heatmap component in admin-dashboard (paired with EP-ADMDASH-09).
  • Filters: operator, prefix, currency.
  • Export CSV.

Story Points: 3