Skip to main content

Analytics Service — Service Overview

Status: populated Owner: Platform Engineering Last updated: 2026-04-18 Companion: DOMAIN_MODEL · API_CONTRACTS · EVENT_SCHEMAS

1. Purpose

analytics-service aggregates operational metrics from NATS event streams and exposes read-only REST endpoints for internal dashboards. It:

  • Consumes billing.events and sms.dlr.inbound NATS subjects.
  • Computes hourly and daily aggregates — idempotent upserts — in PostgreSQL (anlyt schema).
  • Serves read-only internal REST API (no Kong route) consumed by admin-dashboard and customer-portal.
  • Provides per-operator metrics (delivery rate, latency, TPS), per-account metrics (messages sent/delivered/failed, cost), and platform summary metrics.
  • Optionally archives data older than 90 days to ClickHouse for long-term trend queries.

2. Bounded Context

Analytics & Reporting — read-side of the platform; no write authority over any business aggregate. Classified as Supporting (dashboards rely on it; SMS pipeline does not depend on it for correctness or availability).

3. Responsibilities

AreaWhat this service owns
Event consumptionbilling.events, sms.dlr.inbound — durable NATS consumers
Hourly aggregationIdempotent UPSERT into anlyt.metrics_hourly per window
Daily roll-upIdempotent UPSERT into anlyt.metrics_daily (computed from hourly rows)
Per-operator metricsDelivery rate, avg latency, P95 latency, error rate, peak TPS
Per-account metricsMessages sent/delivered/failed, total cost, avg cost per message
Platform summaryTotals, overall delivery rate, active accounts
Internal REST API5 read-only endpoints (no Kong route)
ClickHouse offloadOptional: rows older than 90 d migrated to ClickHouse

4. Non-Responsibilities

AreaOwner
Billing decisions / charge computationbilling-service
Real-time alerting on delivery ratesobservability stack (Grafana/Prometheus)
SMPP delivery receipt parsingdlr-processor (publishes sms.dlr.inbound)
Customer-facing invoicesbilling-service
Data warehouse / BI exportfuture ETL pipeline, not this service

5. Dependencies

DependencyKindPurpose
NATS JetStreamEvent busConsume billing.events, sms.dlr.inbound
PostgreSQL (schema anlyt)Data storeAggregate tables
ClickHouse (optional)Cold storeLong-term queries (> 90 d)
admin-dashboardCallerReads summary + operator/account metrics
customer-portalCallerReads per-account usage

6. High-Level Flow

7. Key Design Decisions

DecisionRationaleTrade-off
Idempotent UPSERT, not INSERTNATS redelivery is inevitable; double-counting would corrupt metricsSlightly more complex SQL; acceptable
Hourly + daily separate tables (not materialised views)Explicit control over rollup timing; can recompute if source events are replayedExtra rollup job; scheduled cron
No Kong routeAnalytics is read-only internal — exposing it via Kong adds no security benefit and could expose aggregate metrics to misconfigured consumersRequires caller to be on cluster network
ClickHouse optional (> 90 d)PostgreSQL is sufficient for 90 d at expected scale; ClickHouse avoids PG table bloat for long-termOperational complexity of running ClickHouse added only when scale justifies it
P95 latency via approximate percentileStoring every DLR latency sample is prohibitive; percentile_disc over hourly bucketed samplesP95 is approximate within hour window

8. Status

Design approved. Implementation in progress. See SERVICE_READINESS for gate checklist.