Analytics Service — Jira Epic & User Stories
Status: populated Owner: Platform Engineering + Product Last updated: 2026-04-18
EP-ANLYT-01: Analytics Aggregation & Reporting Service
Epic summary: Build the analytics service that consumes billing and DLR events from NATS, aggregates them into hourly/daily PostgreSQL buckets, and exposes read-only internal REST endpoints for admin-dashboard and customer-portal.
Goal: Replace ad-hoc direct-DB queries from dashboards with a stable, performant, and well-defined analytics API. Provide per-operator, per-account, and platform-wide metrics at sub-second query latency.
Definition of Done: Admin-dashboard and customer-portal consume analytics API exclusively; ad-hoc orch/billing direct queries removed; security review passed.
US-ANLYT-01: Consume billing events and build hourly aggregates
As the platform,
I want the analytics service to consume billing.events from NATS and upsert into hourly metric buckets,
So that cost and message volume data is available for dashboards within 5 minutes of billing.
Acceptance Criteria:
- AC1:
sms.billed.v1event consumed and upserted intoanlyt.metrics_hourlyfor PLATFORM, OPERATOR, and ACCOUNT scopes. - AC2: Same
eventIdprocessed twice → counter unchanged after second process (idempotency). - AC3: BILLED status increments
total_messagesandtotal_cost; REFUNDED decrements. - AC4: NATS consumer lag ≤ 5 min under normal load.
- AC5: Integration test
billing-event-upsert.spec.tspasses.
Story points: 8
US-ANLYT-02: Consume DLR events and build operator performance aggregates
As the platform,
I want the analytics service to consume sms.dlr.inbound events and aggregate delivery rates and latency per operator,
So that operator health can be reported in the admin dashboard.
Acceptance Criteria:
- AC1: DLR event with
deliveryStatus=DELIVEREDincrementsdeliveredMessagesinanlyt.operator_performance. - AC2: DLR event with
deliveryStatus=FAILEDincrementsfailedMessages. - AC3:
latencyMsaccumulated for avg and P95 computation per hour bucket. - AC4: Idempotent: replayed DLR event does not change counters.
- AC5: Integration test
dlr-event-upsert.spec.tspasses.
Story points: 8
US-ANLYT-03: Hourly-to-daily rollup
As the platform, I want hourly metric buckets rolled up into daily buckets via a scheduled job, So that dashboards can query aggregated daily data without scanning hourly rows.
Acceptance Criteria:
- AC1:
RollUpHourlyToDailyUseCaseruns every hour via CronJob. - AC2: Daily bucket sums match sum of corresponding hourly buckets exactly.
- AC3: Rollup is idempotent — running twice for the same day produces the same result.
- AC4:
AnlytRollupFailedalert fires if job has not run in 3 h. - AC5: Integration test
rollup.spec.tspasses.
Story points: 5
US-ANLYT-04: Platform summary API endpoint
As the admin dashboard,
I want to call GET /v1/internal/analytics/summary to get platform-wide metrics for a date range,
So that admins can see overall platform health at a glance.
Acceptance Criteria:
- AC1: Returns
totalMessages,deliveredMessages,failedMessages,deliveryRate,totalCost,activeAccounts,peakTps,avgLatencyMs,p95LatencyMsfor requested window. - AC2: Defaults to today's data if no
from/tosupplied. - AC3: Window > 90 days returns 400
INVALID_WINDOW. - AC4: Response includes
dataAsOffield indicating most recent aggregate timestamp. - AC5: P95 ≤ 500 ms for 90-day window query in load test.
Story points: 5
US-ANLYT-05: Operator performance API endpoint
As the admin dashboard,
I want to call GET /v1/internal/analytics/operators/:id/performance for any operator,
So that carrier relations team can review per-carrier performance and SLA compliance.
Acceptance Criteria:
- AC1: Returns per-operator delivery rate, avg latency, P95 latency, error rate, peak TPS, total cost.
- AC2: 404 if no data exists for the requested
operatorId. - AC3: Supports
granularity=hourly|daily. - AC4: Integration test
summary-api.spec.tscovers operator endpoint.
Story points: 3
US-ANLYT-06: Account usage API endpoint
As the customer portal,
I want to call GET /v1/internal/analytics/accounts/:id/usage to show customers their usage,
So that customers can track their spend and delivery performance.
Acceptance Criteria:
- AC1: Returns
messagesSent,messagesDelivered,messagesFailed,deliveryRate,totalCost,avgCostPerMessage, daily breakdown array. - AC2: Caller cannot query a different account's data —
X-Account-Idheader mismatch returns 403. - AC3: No MSISDN or message body in response.
- AC4: Integration test
account-scope.spec.tspasses.
Story points: 5
US-ANLYT-07: Throughput and delivery-breakdown API endpoints
As the admin dashboard, I want throughput time series and delivery status breakdown endpoints, So that admins have a real-time view of platform activity and carrier split.
Acceptance Criteria:
- AC1:
GET /throughputreturns time series with configurable resolution (1m, 5m, 1h). - AC2:
GET /delivery-breakdownreturnsbyStatuscounts andbyOperatortable. - AC3: Both endpoints respond in P95 ≤ 500 ms.
- AC4:
1mresolution documented as interpolated (not actual per-minute counters).
Story points: 5
US-ANLYT-08: Historical backfill from NATS stream replay
As the SRE team, I want to replay NATS stream history into analytics on initial deploy, So that dashboards have up to 7 days of historical data on day one.
Acceptance Criteria:
- AC1: Seeking
anlyt-billing-consumerto stream start causes service to process all retained events. - AC2: Idempotency prevents double-count if service was already partially deployed.
- AC3: After backfill, consumer lag drops to ≤ 5 min.
- AC4: Backfill procedure documented in MIGRATION_PLAN.md.
Story points: 3
US-ANLYT-09: Data accuracy reconciliation
As the finance team, I want analytics totals reconciled against billing totals daily, So that any discrepancy in reported costs is detected and investigated quickly.
Acceptance Criteria:
- AC1: Daily CronJob compares
anlyt.metrics_daily.total_cost(scope=PLATFORM) vsbilling.daily_totals. - AC2: Divergence > 0.01% triggers
AnlytReconciliationDivergencealert. - AC3: Reconciliation report written to
anlyt.reconciliation_logtable. - AC4: Manual re-run of reconciliation job documented in runbook.
Story points: 5
US-ANLYT-10: Service observability — metrics, traces, alerts
As the SRE team, I want full observability for the analytics service, So that processing lag and accuracy issues are detected before they affect dashboard users.
Acceptance Criteria:
- AC1: All metric families in OBSERVABILITY.md exposed at
/metrics. - AC2: OTel spans for all event processing and REST endpoints.
- AC3: All 6 alerts configured with runbooks.
- AC4: Consumer lag Grafana panel visible with alert threshold line.
- AC5: Data freshness panel shows
dataAsOfacross all projections.
Story points: 5
Total EP-ANLYT-01 story points: 52
EP-ANLYT-02: ClickHouse Cold-Tier Analytics + Pre-Aggregated Cubes
Epic summary: Postgres is unfit for ad-hoc analytics over billions of CDR + DLR rows at national scale. ClickHouse provides sub-second queries on hot 90-d data and economical retention of 7 y of cold history. Pre-aggregated cubes reduce dashboard load.
US-ANLYT-11 — ClickHouse cluster bootstrap
As a data engineer, I want a 3-node ClickHouse cluster with replication and S3 cold-tier so we have a separate analytics store with predictable cost.
Acceptance Criteria:
- AC1: Cluster: 3 nodes per region; ReplicatedMergeTree replication; ZooKeeper/Keeper quorum
- AC2: S3-backed
coldstorage policy; lifecyclehot → warm (after 90 d) → cold (after 1 y) - AC3: Helm chart in
infra/clickhouse/; secrets via Vault - AC4: SLO: P95 query < 2 s on 30-d window for tenant-scoped dashboards
Story points: 8
US-ANLYT-12 — DLR + CDR ingestion (NATS-to-ClickHouse bridge)
As the platform,
I want a streaming ingest from NATS sms.dlr.inbound and cdr.generated.v1 into ClickHouse so data lag is < 60 s.
Acceptance Criteria:
- AC1: Bridge service
anlyt-ingestconsumes durable NATS consumers - AC2: Batched inserts (1 s window or 10 000 rows) into
anlyt.facts_dlr/anlyt.facts_cdr - AC3: Lag metric
anlyt_ingest_lag_seconds< 60 s P95 - AC4: At-least-once semantics with ClickHouse
ReplacingMergeTreefor dedup
Story points: 8
US-ANLYT-13 — Pre-aggregated cubes (per-tenant per-hour rollup)
As a dashboard developer,
I want pre-computed hourly/daily cubes by (tenantId, mno, status, lane) so 90-d dashboards render instantly.
Acceptance Criteria:
- AC1: Materialised views:
cube_dlr_hourly,cube_cdr_daily,cube_compliance_hourly - AC2: Aggregation columns:
count,sum,avg-latency,p50/p95/p99(viaquantileTDigest) - AC3: Cube freshness < 5 min behind raw
- AC4: Storage budget: cubes ≤ 5% of raw table size
Story points: 5
US-ANLYT-14 — Tenant-scoped query API
As a customer-portal page, I want a query API that abstracts ClickHouse and enforces tenant scoping so portal devs don't write SQL.
Acceptance Criteria:
- AC1:
POST /v1/anlyt/queryaccepts a typed query DSL (no raw SQL) - AC2: Tenant scoping enforced server-side; row counts capped at 100 k per request
- AC3: Cursor-based pagination
- AC4: P95 ≤ 1 s for 30-d window queries
Story points: 5
US-ANLYT-15 — Cold-tier compliance audit query
As a regulator-portal-service user,
I want to query 5-year compliance audit data with sub-30 s response for any (tenantId, dateRange) so audits are practical.
Acceptance Criteria:
- AC1:
coldClickHouse partitions on S3; query latency < 30 s P95 for 5-y windows - AC2: Query API exposes audit slices to
regulator-portal-serviceonly (mTLS) - AC3: Result set capped 1 M rows; export-to-S3 for larger
Story points: 5
EP-ANLYT-03: Per-MNO Quality Dashboards (delivery rate, latency, cost)
Epic summary: The NOC and finance need at-a-glance visibility into per-MNO performance and cost.
US-ANLYT-16 — Per-MNO delivery-rate dashboard
As the NOC, I want per-MNO delivery rate per hour over the last 7 d so I can spot degradation early.
Acceptance Criteria:
- AC1: Grafana dashboard
dashboards/mno-quality.json - AC2: Panels: delivery rate (line), DLR latency P95 (heatmap), ESME error breakdown (stacked bar), TPS utilisation
- AC3: One row per MNO; filters by lane, destination prefix
- AC4: Alert links per panel to runbooks
Story points: 5
US-ANLYT-17 — operator.quality.v1 publisher (consumed by routing-engine)
As the routing-engine, I want quality signals every 60 s per (mno, direction) so quality-weighted routing has live data.
Acceptance Criteria:
- AC1: Cron 60 s aggregates from
cube_dlr_hourlyand emitsoperator.quality.v1 - AC2: Payload:
{ mno, direction, deliveryRate, dlrLatencyP95, esmeErrorRate, sampleSize } - AC3: Consumed by EP-RE-05 (quality-adaptive routing)
Story points: 3
US-ANLYT-18 — Per-MNO cost vs. quality scatter
As finance, I want a daily scatter of per-MNO cost vs. quality so I can rebalance commercial allocation.
Acceptance Criteria:
- AC1: Daily-refreshed scatter plot in finance dashboard
- AC2: X axis: avg cost per delivered SMS; Y axis: delivery rate
- AC3: Bubble size = volume
Story points: 3
US-ANLYT-19 — Tenant-facing MNO quality (anonymised)
As an enterprise tenant, I want to see anonymised per-MNO quality metrics so I can decide my route preferences.
Acceptance Criteria:
- AC1: Customer-portal
/insights/mno-qualityshows anonymised MNO IDs (MNO-A,MNO-B, …) - AC2: Tenant can see only the MNOs their traffic actually uses
- AC3: Real MNO names hidden unless tenant has signed a data-share addendum
Story points: 3