Analytics Service — Observability
Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18 Companion: 12 Observability
1. SLIs / SLOs
| SLI | SLO | Window | Measurement |
|---|---|---|---|
| Event processing lag | ≤ 5 min behind real-time | 7 d | NATS consumer NumPending |
| REST API P95 latency | ≤ 500 ms | 30 d | All /v1/internal/analytics/* endpoints |
| REST API availability | ≥ 99% | 30 d | Non-5xx ratio |
| Daily rollup freshness | ≤ 2 h staleness | 7 d | Max age of metrics_daily.updated_at |
| Dedup success rate | ≥ 99.99% | 7 d | processed_events vs anlyt_events_processed_total |
2. Metrics
Exposed at /metrics (Prometheus):
anlyt_events_processed_total{type="billing|dlr", result="ok|dedup|error"}
anlyt_events_processing_lag_seconds -- time between event.at and processing time
anlyt_upsert_duration_seconds_bucket{table}
anlyt_rollup_duration_seconds_bucket{granularity="daily"}
anlyt_query_duration_seconds_bucket{endpoint}
anlyt_nats_consumer_pending{consumer="billing|dlr"}
anlyt_pg_errors_total{op="upsert|select"}
anlyt_deserialization_errors_total{type}
anlyt_processed_events_table_size_rows
3. Traces
OpenTelemetry spans:
anlyt.process.billingEvent— per event, includeseventId,accountId,bucketHouranlyt.process.dlrEvent— per eventanlyt.rollup.daily— entire rollup run, includesbucketsProcessed,durationMsanlyt.api.summary/.operatorPerformance/.accountUsage/.throughput/.deliveryBreakdown
4. Logs (Pino → Loki)
Fields: level, ts, service=analytics-service, eventId, eventType, bucketHour, durationMs, traceId, spanId.
No MSISDN or message body in any log line.
5. Dashboards (Grafana)
- Analytics Pipeline — consumer lag, event processing rate, dedup rate, error rate
- Aggregation Health — hourly bucket fill rate, rollup job status, P95 query latency
- Data Freshness —
metrics_daily.updated_atvs now per scope - ClickHouse ETL (if enabled) — ETL job success rate, rows transferred
6. Alerts
| Alert | Condition | Runbook |
|---|---|---|
AnlytConsumerLag | NATS consumer pending > 10,000 for 5 m | runbooks/anlyt/consumer-lag.md |
AnlytRollupFailed | Daily rollup job has not run in 3 h | runbooks/anlyt/rollup-failed.md |
AnlytPgErrors | PG errors > 5/min | runbooks/anlyt/pg-down.md |
AnlytDeserializationErrors | Deserialization errors > 10/min | runbooks/anlyt/schema-mismatch.md |
AnlytQuerySlow | REST P95 > 1 s for 10 m | runbooks/anlyt/query-slow.md |
AnlytProcessedEventsTableLarge | processed_events > 5M rows | runbooks/anlyt/purge-events.md |