Operator Management Service — Jira Epic & User Stories
Status: populated Owner: Platform Engineering + Product Last updated: 2026-04-18
EP-OPS-01: Operator Configuration Management
Epic summary: Build the authoritative operator configuration service — CRUD for SMPP operators, routing rules, and TPS limits — with Vault-backed credential management and NATS change propagation.
Goal: Replace the manual spreadsheet + LastPass operator inventory with an auditable, API-driven configuration store that routing-engine and smpp-connector consume automatically.
Definition of Done: All routing-engine and smpp-connector config reads come from this service; legacy spreadsheet archived; security review passed.
US-OPS-01: Create a new SMPP operator via admin API
As a carrier relations admin, I want to create a new SMPP operator record via the admin REST API, So that the platform can route messages to the new carrier without manual configuration.
Acceptance Criteria:
- AC1:
POST /v1/admin/operatorswith valid payload returns 201 with operator object (no password field). - AC2: SMPP password is stored in Vault at
secret/ops/operators/{id}/credentials; absent from PG. - AC3: Duplicate (host, port, systemId) returns 409
DUPLICATE_OPERATOR. - AC4: Vault write failure returns 503 and PG row is compensated (deleted).
- AC5:
operator.config.created.v1NATS event published within 500 ms. - AC6: Integration test
create-operator.spec.tspasses.
Story points: 8
US-OPS-02: Update operator configuration and rotate password
As a carrier relations admin, I want to update operator settings (host, port, TPS, etc.) and optionally rotate the SMPP password, So that I can respond to carrier-side changes without downtime.
Acceptance Criteria:
- AC1:
PATCH /v1/admin/operators/:idwith partial payload returns 200 with updated object. - AC2: Password field present → Vault
PUTissued; password not returned in response. - AC3:
changedFieldscorrectly enumerated inoperator.config.updated.v1NATS event. - AC4: Duplicate check on (host, port, systemId) when those fields change.
- AC5: Updating deleted operator returns 404.
Story points: 5
US-OPS-03: Soft-delete operator with audit trail
As a carrier relations admin, I want to deactivate an operator without permanently deleting its record, So that I have a full audit trail of all operators ever used on the platform.
Acceptance Criteria:
- AC1:
DELETE /v1/admin/operators/:idsetsdeleted_at, setsstatus = INACTIVE, returns 204. - AC2:
GET /v1/admin/operators/:idon soft-deleted returns 404. - AC3:
GET /v1/admin/operatorslist excludes soft-deleted by default;?includeDeleted=trueshows them. - AC4:
operator.config.deleted.v1NATS event published. - AC5: Re-create with same (host, port, systemId) after delete is allowed (creates new
operatorId).
Story points: 3
US-OPS-04: Internal credentials endpoint for smpp-connector
As the smpp-connector service, I want to retrieve SMPP credentials (systemId + password) for an operator via a secure internal endpoint, So that I can bind to the carrier SMPP server without storing secrets locally.
Acceptance Criteria:
- AC1:
GET /v1/internal/operators/:id/credentials(mTLS) returns operator metadata + Vault credentials. - AC2: Response includes
passwordfield (this is the only endpoint that returns it). - AC3: Unauthenticated call (no client cert) rejected with TLS handshake error.
- AC4: Vault unavailable returns 503; smpp-connector falls back to in-memory cache.
- AC5: Integration test
credentials-endpoint.spec.tspasses.
Story points: 5
US-OPS-05: Routing rules CRUD with prefix conflict detection
As a carrier relations admin, I want to manage destination prefix routing rules per operator, So that the routing-engine can select the correct operator for each destination.
Acceptance Criteria:
- AC1:
POST /v1/admin/operators/:id/routing-rulescreates rule; returns 201. - AC2: Prefix overlap with existing active rule returns 409
PREFIX_CONFLICT. - AC3:
PATCH /v1/admin/operators/:id/routing-rules/:ruleIdupdates priority/weight/cost. - AC4:
DELETE /v1/admin/operators/:id/routing-rules/:ruleIdhard-deletes (no audit required). - AC5:
operator.config.updated.v1published on rule create/update/delete. - AC6: Integration test
routing-rules.spec.tspasses.
Story points: 5
US-OPS-06: Operator health state management
As the platform, I want the service to ingest health signals from smpp-connector, maintain authoritative health state, and propagate changes via NATS and Redis, So that routing-engine always routes to healthy operators in under 1 second of a state change.
Acceptance Criteria:
- AC1: Health inbound event from smpp-connector updates
ops:health:{operatorId}Redis key (TTL 60 s) within 200 ms. - AC2: Health state transition logged to
ops.operator_health_log. - AC3:
operator.health.v1NATS event published on state change (not on no-change heartbeat). - AC4: All 6 state transitions (
UNKNOWN→HEALTHY,HEALTHY→DEGRADED,DEGRADED→UNHEALTHY,UNHEALTHY→HEALTHY,DEGRADED→HEALTHY) covered by unit tests onHealthStateReducer. - AC5: Redis miss → routing-engine internal API fallback documented and tested.
Story points: 8
US-OPS-07: Legacy operator migration from spreadsheet
As the carrier relations team, I want all existing operator configs migrated into the new service, So that we can decommission the spreadsheet on day 1 of production launch.
Acceptance Criteria:
- AC1: Migration script parses CSV export; Zod-validates each row.
- AC2: Valid rows create operators (PG + Vault) via
CreateOperatorUseCase. - AC3: Invalid rows reported in migration report (not silently skipped).
- AC4: Dry-run mode available (validates + reports without writing).
- AC5: Post-migration: routing-engine bootstraps from this service in staging and routes correctly.
- AC6: Ops team sign-off on config accuracy in staging before production run.
Story points: 5
US-OPS-08: Service observability — metrics, traces, alerts
As the SRE team, I want full observability on the operator management service, So that I can detect and resolve incidents quickly.
Acceptance Criteria:
- AC1: All metric families defined in OBSERVABILITY.md exposed at
/metrics. - AC2: OTel spans for all use cases visible in Tempo/Jaeger with correct parent propagation.
- AC3: All 6 alerts defined in OBSERVABILITY.md configured in Alertmanager with runbooks linked.
- AC4: Grafana dashboard includes: operator CRUD rates, Vault latency, NATS publish success, health state map.
- AC5:
/health/readyreturns non-200 when Vault unreachable.
Story points: 5
Total EP-OPS-01 story points: 44
EP-OPS-02: Operator-ID Renaming with Zero In-Flight Loss
Epic summary: When an MNO renames its operator ID (rare but real — happened with Roshan in 2018), the platform must atomically swap the config and reroute in-flight traffic without dropping a single message.
US-OPS-09: Atomic config swap with in-flight drain
As a carrier relations admin, I want to rename an operator ID with a zero-loss swap procedure, So that in-flight messages bound to the old ID are not lost when the MNO renames its endpoint.
Acceptance Criteria:
- AC1:
POST /v1/admin/operators/:id/renameaccepts{ newId, effectiveAt, drainSeconds }; defaultsdrainSeconds=120 - AC2: At
effectiveAt - drainSeconds: stop accepting new dispatches for the old ID; existing in-flight (NATS subjects + SMPP windows) drain - AC3: At
effectiveAt: atomicUPDATEofops.operators.id; new bind under new ID created; old bind unbound after drain confirmation - AC4: All in-flight messages preserved (zero loss) verified by integration test
- AC5:
operator.config.renamed.v1event emitted with old and new IDs - AC6: Rollback procedure documented in runbook
Story points: 8
US-OPS-10: Pre-rename validation
As a carrier relations admin, I want the rename request validated before execution, So that I don't accidentally rename to a conflicting ID or leave the platform in a broken state.
Acceptance Criteria:
- AC1:
newIdmust not collide with any active operator - AC2:
effectiveAtmust be ≥ now + 5 min (gives lead time for SREs to be online) - AC3: System checks: at least one healthy bind currently exists under old ID (else reject — can't drain a dead bind safely)
- AC4: Validation report returned in response
Story points: 3
US-OPS-11: Rename observability and runbook
As the NOC, I want real-time visibility into the rename progress, So that I can intervene if drain stalls.
Acceptance Criteria:
- AC1: NOC dashboard shows: drain start time, in-flight count, drain ETA
- AC2: PagerDuty incident auto-opened with the rename ticket as a reference
- AC3: Runbook
runbooks/operator-rename.mdcovers normal flow + abort + rollback
Story points: 3
EP-OPS-03: MNO Onboarding Playbook & Runbook Generator
Epic summary: Each new MNO bind requires a consistent onboarding playbook covering credentials, IP whitelisting, TPS contract, escalation tree, and runbook generation. Currently each onboarding is bespoke.
US-OPS-12: MNO onboarding wizard (admin UI)
As a carrier relations admin, I want a step-by-step wizard to onboard a new MNO, So that nothing is missed and the SRE handover is consistent.
Acceptance Criteria:
- AC1: Wizard steps: 1. MNO contact + escalation tree, 2. SMPP credentials (Vault), 3. IP whitelist exchange, 4. TPS contract, 5. test bind, 6. production bind
- AC2: Each step blocks until checklist complete
- AC3: Wizard generates a populated runbook stub at
runbooks/mno/{mno-id}.md - AC4: Final step emits
operator.onboarding.complete.v1
Story points: 8
US-OPS-13: Per-MNO runbook template
As the NOC, I want a standardised per-MNO runbook template that the wizard populates, So that the on-call has consistent troubleshooting steps for every MNO.
Acceptance Criteria:
- AC1: Template covers: bind procedure, TPS adjustment, common errors, escalation tree (T1/T2/T3 contacts), SLA, change windows
- AC2: Wizard fills in known fields; SRE reviews + completes
- AC3: Templates versioned in repo
Story points: 3
US-OPS-14: MNO onboarding GameDay
As the SRE team, I want a documented GameDay procedure for a new MNO, So that failover/drain/rebind/throttle behaviours are validated before production traffic flows.
Acceptance Criteria:
- AC1: GameDay script: bind, send 1000 test messages, throttle on MNO side, observe back-off, force-disconnect, verify rebind, send 1000 more
- AC2: Pass criteria documented; failure aborts onboarding
- AC3: GameDay report archived in
runbooks/mno/{mno-id}-gameday-{date}.md
Story points: 5
EP-OPS-04: TPS-Contract Compliance Auditor (cron + alerts)
Epic summary: Each MNO contract specifies a TPS budget. Exceeding the budget can result in fines or bind suspension. The platform must continuously audit actual vs. contracted TPS.
US-OPS-15: Per-MNO TPS contract registry
As a finance + carrier relations team, I want TPS contracts captured with effective dates and per-bind breakdowns, So that the auditor has a single source of truth.
Acceptance Criteria:
- AC1: Table
ops.tps_contracts(mnoId, bindDirection, contractedTps, effectiveFrom, effectiveTo, contractRef, fineSchedule JSONB) - AC2: Admin REST CRUD; mandatory contract document attached (S3 ref)
- AC3: Validation: no overlap of active contracts per (mno, direction)
Story points: 5
US-OPS-16: Hourly TPS-vs-contract auditor
As a finance stakeholder, I want an hourly comparison of actual TPS vs. contracted TPS, So that breaches are caught and reported within the hour.
Acceptance Criteria:
- AC1: Cron every hour reads Prometheus
smpp_window_inflightper (mno, direction); compares to contracted - AC2: If exceeded → emit
ops.tps_contract.breached.v1; createops.tps_breachesrow - AC3: Daily breach report emailed to finance + carrier relations
- AC4: Tenant-portal dashboard for affected tenants if breach was tenant-driven
Story points: 5
US-OPS-17: TPS-budget Tenant allocation enforcement
As the platform, I want TPS budget allocated to tenants summed and enforced so we never accept work that would breach contract, So that breach prevention is built-in, not just retroactive.
Acceptance Criteria:
- AC1: Sum of tenant
routing.tenant_preferences.reservedTpsper MNO ≤ contracted TPS - AC2: New tenant reservation rejected if would exceed
- AC3: Platform-wide on-demand pool sized as
contractedTps - sum(reserved) - AC4: Alert if on-demand pool < 10% of contracted
Story points: 5