Analytics Service — Jira Epic & User Stories

Status: populated Owner: Platform Engineering + Product Last updated: 2026-04-18

EP-ANLYT-01: Analytics Aggregation & Reporting Service

Epic summary: Build the analytics service that consumes billing and DLR events from NATS, aggregates them into hourly/daily PostgreSQL buckets, and exposes read-only internal REST endpoints for admin-dashboard and customer-portal.

Goal: Replace ad-hoc direct-DB queries from dashboards with a stable, performant, and well-defined analytics API. Provide per-operator, per-account, and platform-wide metrics at sub-second query latency.

Definition of Done: Admin-dashboard and customer-portal consume analytics API exclusively; ad-hoc orch/billing direct queries removed; security review passed.

US-ANLYT-01: Consume billing events and build hourly aggregates

As the platform, I want the analytics service to consume billing.events from NATS and upsert into hourly metric buckets, So that cost and message volume data is available for dashboards within 5 minutes of billing.

Acceptance Criteria:

AC1: sms.billed.v1 event consumed and upserted into anlyt.metrics_hourly for PLATFORM, OPERATOR, and ACCOUNT scopes.
AC2: Same eventId processed twice → counter unchanged after second process (idempotency).
AC3: BILLED status increments total_messages and total_cost; REFUNDED decrements.
AC4: NATS consumer lag ≤ 5 min under normal load.
AC5: Integration test billing-event-upsert.spec.ts passes.

Story points: 8

US-ANLYT-02: Consume DLR events and build operator performance aggregates

As the platform, I want the analytics service to consume sms.dlr.inbound events and aggregate delivery rates and latency per operator, So that operator health can be reported in the admin dashboard.

Acceptance Criteria:

AC1: DLR event with deliveryStatus=DELIVERED increments deliveredMessages in anlyt.operator_performance.
AC2: DLR event with deliveryStatus=FAILED increments failedMessages.
AC3: latencyMs accumulated for avg and P95 computation per hour bucket.
AC4: Idempotent: replayed DLR event does not change counters.
AC5: Integration test dlr-event-upsert.spec.ts passes.

Story points: 8

US-ANLYT-03: Hourly-to-daily rollup

As the platform, I want hourly metric buckets rolled up into daily buckets via a scheduled job, So that dashboards can query aggregated daily data without scanning hourly rows.

Acceptance Criteria:

AC1: RollUpHourlyToDailyUseCase runs every hour via CronJob.
AC2: Daily bucket sums match sum of corresponding hourly buckets exactly.
AC3: Rollup is idempotent — running twice for the same day produces the same result.
AC4: AnlytRollupFailed alert fires if job has not run in 3 h.
AC5: Integration test rollup.spec.ts passes.

Story points: 5

US-ANLYT-04: Platform summary API endpoint

As the admin dashboard, I want to call GET /v1/internal/analytics/summary to get platform-wide metrics for a date range, So that admins can see overall platform health at a glance.

Acceptance Criteria:

AC1: Returns totalMessages, deliveredMessages, failedMessages, deliveryRate, totalCost, activeAccounts, peakTps, avgLatencyMs, p95LatencyMs for requested window.
AC2: Defaults to today's data if no from/to supplied.
AC3: Window > 90 days returns 400 INVALID_WINDOW.
AC4: Response includes dataAsOf field indicating most recent aggregate timestamp.
AC5: P95 ≤ 500 ms for 90-day window query in load test.

Story points: 5

US-ANLYT-05: Operator performance API endpoint

As the admin dashboard, I want to call GET /v1/internal/analytics/operators/:id/performance for any operator, So that carrier relations team can review per-carrier performance and SLA compliance.

Acceptance Criteria:

AC1: Returns per-operator delivery rate, avg latency, P95 latency, error rate, peak TPS, total cost.
AC2: 404 if no data exists for the requested operatorId.
AC3: Supports granularity=hourly|daily.
AC4: Integration test summary-api.spec.ts covers operator endpoint.

Story points: 3

US-ANLYT-06: Account usage API endpoint

As the customer portal, I want to call GET /v1/internal/analytics/accounts/:id/usage to show customers their usage, So that customers can track their spend and delivery performance.

Acceptance Criteria:

AC1: Returns messagesSent, messagesDelivered, messagesFailed, deliveryRate, totalCost, avgCostPerMessage, daily breakdown array.
AC2: Caller cannot query a different account's data — X-Account-Id header mismatch returns 403.
AC3: No MSISDN or message body in response.
AC4: Integration test account-scope.spec.ts passes.

Story points: 5

US-ANLYT-07: Throughput and delivery-breakdown API endpoints

As the admin dashboard, I want throughput time series and delivery status breakdown endpoints, So that admins have a real-time view of platform activity and carrier split.

Acceptance Criteria:

AC1: GET /throughput returns time series with configurable resolution (1m, 5m, 1h).
AC2: GET /delivery-breakdown returns byStatus counts and byOperator table.
AC3: Both endpoints respond in P95 ≤ 500 ms.
AC4: 1m resolution documented as interpolated (not actual per-minute counters).

Story points: 5

US-ANLYT-08: Historical backfill from NATS stream replay

As the SRE team, I want to replay NATS stream history into analytics on initial deploy, So that dashboards have up to 7 days of historical data on day one.

Acceptance Criteria:

AC1: Seeking anlyt-billing-consumer to stream start causes service to process all retained events.
AC2: Idempotency prevents double-count if service was already partially deployed.
AC3: After backfill, consumer lag drops to ≤ 5 min.
AC4: Backfill procedure documented in MIGRATION_PLAN.md.

Story points: 3

US-ANLYT-09: Data accuracy reconciliation

As the finance team, I want analytics totals reconciled against billing totals daily, So that any discrepancy in reported costs is detected and investigated quickly.

Acceptance Criteria:

AC1: Daily CronJob compares anlyt.metrics_daily.total_cost (scope=PLATFORM) vs billing.daily_totals.
AC2: Divergence > 0.01% triggers AnlytReconciliationDivergence alert.
AC3: Reconciliation report written to anlyt.reconciliation_log table.
AC4: Manual re-run of reconciliation job documented in runbook.

Story points: 5

US-ANLYT-10: Service observability — metrics, traces, alerts

As the SRE team, I want full observability for the analytics service, So that processing lag and accuracy issues are detected before they affect dashboard users.

Acceptance Criteria:

AC1: All metric families in OBSERVABILITY.md exposed at /metrics.
AC2: OTel spans for all event processing and REST endpoints.
AC3: All 6 alerts configured with runbooks.
AC4: Consumer lag Grafana panel visible with alert threshold line.
AC5: Data freshness panel shows dataAsOf across all projections.

Story points: 5

Total EP-ANLYT-01 story points: 52

EP-ANLYT-02: ClickHouse Cold-Tier Analytics + Pre-Aggregated Cubes

Epic summary: Postgres is unfit for ad-hoc analytics over billions of CDR + DLR rows at national scale. ClickHouse provides sub-second queries on hot 90-d data and economical retention of 7 y of cold history. Pre-aggregated cubes reduce dashboard load.

US-ANLYT-11 — ClickHouse cluster bootstrap

As a data engineer, I want a 3-node ClickHouse cluster with replication and S3 cold-tier so we have a separate analytics store with predictable cost.

Acceptance Criteria:

AC1: Cluster: 3 nodes per region; ReplicatedMergeTree replication; ZooKeeper/Keeper quorum
AC2: S3-backed cold storage policy; lifecycle hot → warm (after 90 d) → cold (after 1 y)
AC3: Helm chart in infra/clickhouse/; secrets via Vault
AC4: SLO: P95 query < 2 s on 30-d window for tenant-scoped dashboards

Story points: 8

US-ANLYT-12 — DLR + CDR ingestion (NATS-to-ClickHouse bridge)

As the platform, I want a streaming ingest from NATS sms.dlr.inbound and cdr.generated.v1 into ClickHouse so data lag is < 60 s.

Acceptance Criteria:

AC1: Bridge service anlyt-ingest consumes durable NATS consumers
AC2: Batched inserts (1 s window or 10 000 rows) into anlyt.facts_dlr / anlyt.facts_cdr
AC3: Lag metric anlyt_ingest_lag_seconds < 60 s P95
AC4: At-least-once semantics with ClickHouse ReplacingMergeTree for dedup

Story points: 8

US-ANLYT-13 — Pre-aggregated cubes (per-tenant per-hour rollup)

As a dashboard developer, I want pre-computed hourly/daily cubes by (tenantId, mno, status, lane) so 90-d dashboards render instantly.

Acceptance Criteria:

AC1: Materialised views: cube_dlr_hourly, cube_cdr_daily, cube_compliance_hourly
AC2: Aggregation columns: count, sum, avg-latency, p50/p95/p99 (via quantileTDigest)
AC3: Cube freshness < 5 min behind raw
AC4: Storage budget: cubes ≤ 5% of raw table size

Story points: 5

US-ANLYT-14 — Tenant-scoped query API

As a customer-portal page, I want a query API that abstracts ClickHouse and enforces tenant scoping so portal devs don't write SQL.

Acceptance Criteria:

AC1: POST /v1/anlyt/query accepts a typed query DSL (no raw SQL)
AC2: Tenant scoping enforced server-side; row counts capped at 100 k per request
AC3: Cursor-based pagination
AC4: P95 ≤ 1 s for 30-d window queries

Story points: 5

US-ANLYT-15 — Cold-tier compliance audit query

As a regulator-portal-service user, I want to query 5-year compliance audit data with sub-30 s response for any (tenantId, dateRange) so audits are practical.

Acceptance Criteria:

AC1: cold ClickHouse partitions on S3; query latency < 30 s P95 for 5-y windows
AC2: Query API exposes audit slices to regulator-portal-service only (mTLS)
AC3: Result set capped 1 M rows; export-to-S3 for larger

Story points: 5

EP-ANLYT-03: Per-MNO Quality Dashboards (delivery rate, latency, cost)

Epic summary: The NOC and finance need at-a-glance visibility into per-MNO performance and cost.

US-ANLYT-16 — Per-MNO delivery-rate dashboard

As the NOC, I want per-MNO delivery rate per hour over the last 7 d so I can spot degradation early.

Acceptance Criteria:

AC1: Grafana dashboard dashboards/mno-quality.json
AC2: Panels: delivery rate (line), DLR latency P95 (heatmap), ESME error breakdown (stacked bar), TPS utilisation
AC3: One row per MNO; filters by lane, destination prefix
AC4: Alert links per panel to runbooks

Story points: 5

US-ANLYT-17 — `operator.quality.v1` publisher (consumed by routing-engine)

As the routing-engine, I want quality signals every 60 s per (mno, direction) so quality-weighted routing has live data.

Acceptance Criteria:

AC1: Cron 60 s aggregates from cube_dlr_hourly and emits operator.quality.v1
AC2: Payload: { mno, direction, deliveryRate, dlrLatencyP95, esmeErrorRate, sampleSize }
AC3: Consumed by EP-RE-05 (quality-adaptive routing)

Story points: 3

US-ANLYT-18 — Per-MNO cost vs. quality scatter

As finance, I want a daily scatter of per-MNO cost vs. quality so I can rebalance commercial allocation.

Acceptance Criteria:

AC1: Daily-refreshed scatter plot in finance dashboard
AC2: X axis: avg cost per delivered SMS; Y axis: delivery rate
AC3: Bubble size = volume

Story points: 3

US-ANLYT-19 — Tenant-facing MNO quality (anonymised)

As an enterprise tenant, I want to see anonymised per-MNO quality metrics so I can decide my route preferences.

Acceptance Criteria:

AC1: Customer-portal /insights/mno-quality shows anonymised MNO IDs (MNO-A, MNO-B, …)
AC2: Tenant can see only the MNOs their traffic actually uses
AC3: Real MNO names hidden unless tenant has signed a data-share addendum

Story points: 3

EP-ANLYT-01: Analytics Aggregation & Reporting Service​

US-ANLYT-01: Consume billing events and build hourly aggregates​

US-ANLYT-02: Consume DLR events and build operator performance aggregates​

US-ANLYT-03: Hourly-to-daily rollup​

US-ANLYT-04: Platform summary API endpoint​

US-ANLYT-05: Operator performance API endpoint​

US-ANLYT-06: Account usage API endpoint​

US-ANLYT-07: Throughput and delivery-breakdown API endpoints​

US-ANLYT-08: Historical backfill from NATS stream replay​

US-ANLYT-09: Data accuracy reconciliation​

US-ANLYT-10: Service observability — metrics, traces, alerts​

EP-ANLYT-02: ClickHouse Cold-Tier Analytics + Pre-Aggregated Cubes​

US-ANLYT-11 — ClickHouse cluster bootstrap​

US-ANLYT-12 — DLR + CDR ingestion (NATS-to-ClickHouse bridge)​

US-ANLYT-13 — Pre-aggregated cubes (per-tenant per-hour rollup)​

US-ANLYT-14 — Tenant-scoped query API​

US-ANLYT-15 — Cold-tier compliance audit query​

EP-ANLYT-03: Per-MNO Quality Dashboards (delivery rate, latency, cost)​

US-ANLYT-16 — Per-MNO delivery-rate dashboard​

US-ANLYT-17 — operator.quality.v1 publisher (consumed by routing-engine)​

US-ANLYT-18 — Per-MNO cost vs. quality scatter​

US-ANLYT-19 — Tenant-facing MNO quality (anonymised)​

EP-ANLYT-01: Analytics Aggregation & Reporting Service

US-ANLYT-01: Consume billing events and build hourly aggregates

US-ANLYT-02: Consume DLR events and build operator performance aggregates

US-ANLYT-03: Hourly-to-daily rollup

US-ANLYT-04: Platform summary API endpoint

US-ANLYT-05: Operator performance API endpoint

US-ANLYT-06: Account usage API endpoint

US-ANLYT-07: Throughput and delivery-breakdown API endpoints

US-ANLYT-08: Historical backfill from NATS stream replay

US-ANLYT-09: Data accuracy reconciliation

US-ANLYT-10: Service observability — metrics, traces, alerts

EP-ANLYT-02: ClickHouse Cold-Tier Analytics + Pre-Aggregated Cubes

US-ANLYT-11 — ClickHouse cluster bootstrap

US-ANLYT-12 — DLR + CDR ingestion (NATS-to-ClickHouse bridge)

US-ANLYT-13 — Pre-aggregated cubes (per-tenant per-hour rollup)

US-ANLYT-14 — Tenant-scoped query API

US-ANLYT-15 — Cold-tier compliance audit query

EP-ANLYT-03: Per-MNO Quality Dashboards (delivery rate, latency, cost)

US-ANLYT-16 — Per-MNO delivery-rate dashboard

US-ANLYT-17 — `operator.quality.v1` publisher (consumed by routing-engine)

US-ANLYT-18 — Per-MNO cost vs. quality scatter

US-ANLYT-19 — Tenant-facing MNO quality (anonymised)