Skip to main content

Operator Management Service — Jira Epic & User Stories

Status: populated Owner: Platform Engineering + Product Last updated: 2026-04-18


EP-OPS-01: Operator Configuration Management

Epic summary: Build the authoritative operator configuration service — CRUD for SMPP operators, routing rules, and TPS limits — with Vault-backed credential management and NATS change propagation.

Goal: Replace the manual spreadsheet + LastPass operator inventory with an auditable, API-driven configuration store that routing-engine and smpp-connector consume automatically.

Definition of Done: All routing-engine and smpp-connector config reads come from this service; legacy spreadsheet archived; security review passed.


US-OPS-01: Create a new SMPP operator via admin API

As a carrier relations admin, I want to create a new SMPP operator record via the admin REST API, So that the platform can route messages to the new carrier without manual configuration.

Acceptance Criteria:

  • AC1: POST /v1/admin/operators with valid payload returns 201 with operator object (no password field).
  • AC2: SMPP password is stored in Vault at secret/ops/operators/{id}/credentials; absent from PG.
  • AC3: Duplicate (host, port, systemId) returns 409 DUPLICATE_OPERATOR.
  • AC4: Vault write failure returns 503 and PG row is compensated (deleted).
  • AC5: operator.config.created.v1 NATS event published within 500 ms.
  • AC6: Integration test create-operator.spec.ts passes.

Story points: 8


US-OPS-02: Update operator configuration and rotate password

As a carrier relations admin, I want to update operator settings (host, port, TPS, etc.) and optionally rotate the SMPP password, So that I can respond to carrier-side changes without downtime.

Acceptance Criteria:

  • AC1: PATCH /v1/admin/operators/:id with partial payload returns 200 with updated object.
  • AC2: Password field present → Vault PUT issued; password not returned in response.
  • AC3: changedFields correctly enumerated in operator.config.updated.v1 NATS event.
  • AC4: Duplicate check on (host, port, systemId) when those fields change.
  • AC5: Updating deleted operator returns 404.

Story points: 5


US-OPS-03: Soft-delete operator with audit trail

As a carrier relations admin, I want to deactivate an operator without permanently deleting its record, So that I have a full audit trail of all operators ever used on the platform.

Acceptance Criteria:

  • AC1: DELETE /v1/admin/operators/:id sets deleted_at, sets status = INACTIVE, returns 204.
  • AC2: GET /v1/admin/operators/:id on soft-deleted returns 404.
  • AC3: GET /v1/admin/operators list excludes soft-deleted by default; ?includeDeleted=true shows them.
  • AC4: operator.config.deleted.v1 NATS event published.
  • AC5: Re-create with same (host, port, systemId) after delete is allowed (creates new operatorId).

Story points: 3


US-OPS-04: Internal credentials endpoint for smpp-connector

As the smpp-connector service, I want to retrieve SMPP credentials (systemId + password) for an operator via a secure internal endpoint, So that I can bind to the carrier SMPP server without storing secrets locally.

Acceptance Criteria:

  • AC1: GET /v1/internal/operators/:id/credentials (mTLS) returns operator metadata + Vault credentials.
  • AC2: Response includes password field (this is the only endpoint that returns it).
  • AC3: Unauthenticated call (no client cert) rejected with TLS handshake error.
  • AC4: Vault unavailable returns 503; smpp-connector falls back to in-memory cache.
  • AC5: Integration test credentials-endpoint.spec.ts passes.

Story points: 5


US-OPS-05: Routing rules CRUD with prefix conflict detection

As a carrier relations admin, I want to manage destination prefix routing rules per operator, So that the routing-engine can select the correct operator for each destination.

Acceptance Criteria:

  • AC1: POST /v1/admin/operators/:id/routing-rules creates rule; returns 201.
  • AC2: Prefix overlap with existing active rule returns 409 PREFIX_CONFLICT.
  • AC3: PATCH /v1/admin/operators/:id/routing-rules/:ruleId updates priority/weight/cost.
  • AC4: DELETE /v1/admin/operators/:id/routing-rules/:ruleId hard-deletes (no audit required).
  • AC5: operator.config.updated.v1 published on rule create/update/delete.
  • AC6: Integration test routing-rules.spec.ts passes.

Story points: 5


US-OPS-06: Operator health state management

As the platform, I want the service to ingest health signals from smpp-connector, maintain authoritative health state, and propagate changes via NATS and Redis, So that routing-engine always routes to healthy operators in under 1 second of a state change.

Acceptance Criteria:

  • AC1: Health inbound event from smpp-connector updates ops:health:{operatorId} Redis key (TTL 60 s) within 200 ms.
  • AC2: Health state transition logged to ops.operator_health_log.
  • AC3: operator.health.v1 NATS event published on state change (not on no-change heartbeat).
  • AC4: All 6 state transitions (UNKNOWN→HEALTHY, HEALTHY→DEGRADED, DEGRADED→UNHEALTHY, UNHEALTHY→HEALTHY, DEGRADED→HEALTHY) covered by unit tests on HealthStateReducer.
  • AC5: Redis miss → routing-engine internal API fallback documented and tested.

Story points: 8


US-OPS-07: Legacy operator migration from spreadsheet

As the carrier relations team, I want all existing operator configs migrated into the new service, So that we can decommission the spreadsheet on day 1 of production launch.

Acceptance Criteria:

  • AC1: Migration script parses CSV export; Zod-validates each row.
  • AC2: Valid rows create operators (PG + Vault) via CreateOperatorUseCase.
  • AC3: Invalid rows reported in migration report (not silently skipped).
  • AC4: Dry-run mode available (validates + reports without writing).
  • AC5: Post-migration: routing-engine bootstraps from this service in staging and routes correctly.
  • AC6: Ops team sign-off on config accuracy in staging before production run.

Story points: 5


US-OPS-08: Service observability — metrics, traces, alerts

As the SRE team, I want full observability on the operator management service, So that I can detect and resolve incidents quickly.

Acceptance Criteria:

  • AC1: All metric families defined in OBSERVABILITY.md exposed at /metrics.
  • AC2: OTel spans for all use cases visible in Tempo/Jaeger with correct parent propagation.
  • AC3: All 6 alerts defined in OBSERVABILITY.md configured in Alertmanager with runbooks linked.
  • AC4: Grafana dashboard includes: operator CRUD rates, Vault latency, NATS publish success, health state map.
  • AC5: /health/ready returns non-200 when Vault unreachable.

Story points: 5


Total EP-OPS-01 story points: 44


EP-OPS-02: Operator-ID Renaming with Zero In-Flight Loss

Epic summary: When an MNO renames its operator ID (rare but real — happened with Roshan in 2018), the platform must atomically swap the config and reroute in-flight traffic without dropping a single message.


US-OPS-09: Atomic config swap with in-flight drain

As a carrier relations admin, I want to rename an operator ID with a zero-loss swap procedure, So that in-flight messages bound to the old ID are not lost when the MNO renames its endpoint.

Acceptance Criteria:

  • AC1: POST /v1/admin/operators/:id/rename accepts { newId, effectiveAt, drainSeconds }; defaults drainSeconds=120
  • AC2: At effectiveAt - drainSeconds: stop accepting new dispatches for the old ID; existing in-flight (NATS subjects + SMPP windows) drain
  • AC3: At effectiveAt: atomic UPDATE of ops.operators.id; new bind under new ID created; old bind unbound after drain confirmation
  • AC4: All in-flight messages preserved (zero loss) verified by integration test
  • AC5: operator.config.renamed.v1 event emitted with old and new IDs
  • AC6: Rollback procedure documented in runbook

Story points: 8


US-OPS-10: Pre-rename validation

As a carrier relations admin, I want the rename request validated before execution, So that I don't accidentally rename to a conflicting ID or leave the platform in a broken state.

Acceptance Criteria:

  • AC1: newId must not collide with any active operator
  • AC2: effectiveAt must be ≥ now + 5 min (gives lead time for SREs to be online)
  • AC3: System checks: at least one healthy bind currently exists under old ID (else reject — can't drain a dead bind safely)
  • AC4: Validation report returned in response

Story points: 3


US-OPS-11: Rename observability and runbook

As the NOC, I want real-time visibility into the rename progress, So that I can intervene if drain stalls.

Acceptance Criteria:

  • AC1: NOC dashboard shows: drain start time, in-flight count, drain ETA
  • AC2: PagerDuty incident auto-opened with the rename ticket as a reference
  • AC3: Runbook runbooks/operator-rename.md covers normal flow + abort + rollback

Story points: 3


EP-OPS-03: MNO Onboarding Playbook & Runbook Generator

Epic summary: Each new MNO bind requires a consistent onboarding playbook covering credentials, IP whitelisting, TPS contract, escalation tree, and runbook generation. Currently each onboarding is bespoke.


US-OPS-12: MNO onboarding wizard (admin UI)

As a carrier relations admin, I want a step-by-step wizard to onboard a new MNO, So that nothing is missed and the SRE handover is consistent.

Acceptance Criteria:

  • AC1: Wizard steps: 1. MNO contact + escalation tree, 2. SMPP credentials (Vault), 3. IP whitelist exchange, 4. TPS contract, 5. test bind, 6. production bind
  • AC2: Each step blocks until checklist complete
  • AC3: Wizard generates a populated runbook stub at runbooks/mno/{mno-id}.md
  • AC4: Final step emits operator.onboarding.complete.v1

Story points: 8


US-OPS-13: Per-MNO runbook template

As the NOC, I want a standardised per-MNO runbook template that the wizard populates, So that the on-call has consistent troubleshooting steps for every MNO.

Acceptance Criteria:

  • AC1: Template covers: bind procedure, TPS adjustment, common errors, escalation tree (T1/T2/T3 contacts), SLA, change windows
  • AC2: Wizard fills in known fields; SRE reviews + completes
  • AC3: Templates versioned in repo

Story points: 3


US-OPS-14: MNO onboarding GameDay

As the SRE team, I want a documented GameDay procedure for a new MNO, So that failover/drain/rebind/throttle behaviours are validated before production traffic flows.

Acceptance Criteria:

  • AC1: GameDay script: bind, send 1000 test messages, throttle on MNO side, observe back-off, force-disconnect, verify rebind, send 1000 more
  • AC2: Pass criteria documented; failure aborts onboarding
  • AC3: GameDay report archived in runbooks/mno/{mno-id}-gameday-{date}.md

Story points: 5


EP-OPS-04: TPS-Contract Compliance Auditor (cron + alerts)

Epic summary: Each MNO contract specifies a TPS budget. Exceeding the budget can result in fines or bind suspension. The platform must continuously audit actual vs. contracted TPS.


US-OPS-15: Per-MNO TPS contract registry

As a finance + carrier relations team, I want TPS contracts captured with effective dates and per-bind breakdowns, So that the auditor has a single source of truth.

Acceptance Criteria:

  • AC1: Table ops.tps_contracts (mnoId, bindDirection, contractedTps, effectiveFrom, effectiveTo, contractRef, fineSchedule JSONB)
  • AC2: Admin REST CRUD; mandatory contract document attached (S3 ref)
  • AC3: Validation: no overlap of active contracts per (mno, direction)

Story points: 5


US-OPS-16: Hourly TPS-vs-contract auditor

As a finance stakeholder, I want an hourly comparison of actual TPS vs. contracted TPS, So that breaches are caught and reported within the hour.

Acceptance Criteria:

  • AC1: Cron every hour reads Prometheus smpp_window_inflight per (mno, direction); compares to contracted
  • AC2: If exceeded → emit ops.tps_contract.breached.v1; create ops.tps_breaches row
  • AC3: Daily breach report emailed to finance + carrier relations
  • AC4: Tenant-portal dashboard for affected tenants if breach was tenant-driven

Story points: 5


US-OPS-17: TPS-budget Tenant allocation enforcement

As the platform, I want TPS budget allocated to tenants summed and enforced so we never accept work that would breach contract, So that breach prevention is built-in, not just retroactive.

Acceptance Criteria:

  • AC1: Sum of tenant routing.tenant_preferences.reservedTps per MNO ≤ contracted TPS
  • AC2: New tenant reservation rejected if would exceed
  • AC3: Platform-wide on-demand pool sized as contractedTps - sum(reserved)
  • AC4: Alert if on-demand pool < 10% of contracted

Story points: 5