Analytics Service — Service Overview
Status: populated Owner: Platform Engineering Last updated: 2026-04-18 Companion: DOMAIN_MODEL · API_CONTRACTS · EVENT_SCHEMAS
1. Purpose
analytics-service aggregates operational metrics from NATS event streams and exposes read-only REST endpoints for internal dashboards. It:
- Consumes
billing.eventsandsms.dlr.inboundNATS subjects. - Computes hourly and daily aggregates — idempotent upserts — in PostgreSQL (
anlytschema). - Serves read-only internal REST API (no Kong route) consumed by admin-dashboard and customer-portal.
- Provides per-operator metrics (delivery rate, latency, TPS), per-account metrics (messages sent/delivered/failed, cost), and platform summary metrics.
- Optionally archives data older than 90 days to ClickHouse for long-term trend queries.
2. Bounded Context
Analytics & Reporting — read-side of the platform; no write authority over any business aggregate. Classified as Supporting (dashboards rely on it; SMS pipeline does not depend on it for correctness or availability).
3. Responsibilities
| Area | What this service owns |
|---|---|
| Event consumption | billing.events, sms.dlr.inbound — durable NATS consumers |
| Hourly aggregation | Idempotent UPSERT into anlyt.metrics_hourly per window |
| Daily roll-up | Idempotent UPSERT into anlyt.metrics_daily (computed from hourly rows) |
| Per-operator metrics | Delivery rate, avg latency, P95 latency, error rate, peak TPS |
| Per-account metrics | Messages sent/delivered/failed, total cost, avg cost per message |
| Platform summary | Totals, overall delivery rate, active accounts |
| Internal REST API | 5 read-only endpoints (no Kong route) |
| ClickHouse offload | Optional: rows older than 90 d migrated to ClickHouse |
4. Non-Responsibilities
| Area | Owner |
|---|---|
| Billing decisions / charge computation | billing-service |
| Real-time alerting on delivery rates | observability stack (Grafana/Prometheus) |
| SMPP delivery receipt parsing | dlr-processor (publishes sms.dlr.inbound) |
| Customer-facing invoices | billing-service |
| Data warehouse / BI export | future ETL pipeline, not this service |
5. Dependencies
| Dependency | Kind | Purpose |
|---|---|---|
| NATS JetStream | Event bus | Consume billing.events, sms.dlr.inbound |
PostgreSQL (schema anlyt) | Data store | Aggregate tables |
| ClickHouse (optional) | Cold store | Long-term queries (> 90 d) |
| admin-dashboard | Caller | Reads summary + operator/account metrics |
| customer-portal | Caller | Reads per-account usage |
6. High-Level Flow
7. Key Design Decisions
| Decision | Rationale | Trade-off |
|---|---|---|
| Idempotent UPSERT, not INSERT | NATS redelivery is inevitable; double-counting would corrupt metrics | Slightly more complex SQL; acceptable |
| Hourly + daily separate tables (not materialised views) | Explicit control over rollup timing; can recompute if source events are replayed | Extra rollup job; scheduled cron |
| No Kong route | Analytics is read-only internal — exposing it via Kong adds no security benefit and could expose aggregate metrics to misconfigured consumers | Requires caller to be on cluster network |
| ClickHouse optional (> 90 d) | PostgreSQL is sufficient for 90 d at expected scale; ClickHouse avoids PG table bloat for long-term | Operational complexity of running ClickHouse added only when scale justifies it |
| P95 latency via approximate percentile | Storing every DLR latency sample is prohibitive; percentile_disc over hourly bucketed samples | P95 is approximate within hour window |
8. Status
Design approved. Implementation in progress. See SERVICE_READINESS for gate checklist.