Skip to main content

Analytics Service — Observability

Status: populated Owner: Platform Engineering + SRE Last updated: 2026-04-18 Companion: 12 Observability

1. SLIs / SLOs

SLISLOWindowMeasurement
Event processing lag≤ 5 min behind real-time7 dNATS consumer NumPending
REST API P95 latency≤ 500 ms30 dAll /v1/internal/analytics/* endpoints
REST API availability≥ 99%30 dNon-5xx ratio
Daily rollup freshness≤ 2 h staleness7 dMax age of metrics_daily.updated_at
Dedup success rate≥ 99.99%7 dprocessed_events vs anlyt_events_processed_total

2. Metrics

Exposed at /metrics (Prometheus):

anlyt_events_processed_total{type="billing|dlr", result="ok|dedup|error"}
anlyt_events_processing_lag_seconds -- time between event.at and processing time
anlyt_upsert_duration_seconds_bucket{table}
anlyt_rollup_duration_seconds_bucket{granularity="daily"}
anlyt_query_duration_seconds_bucket{endpoint}
anlyt_nats_consumer_pending{consumer="billing|dlr"}
anlyt_pg_errors_total{op="upsert|select"}
anlyt_deserialization_errors_total{type}
anlyt_processed_events_table_size_rows

3. Traces

OpenTelemetry spans:

  • anlyt.process.billingEvent — per event, includes eventId, accountId, bucketHour
  • anlyt.process.dlrEvent — per event
  • anlyt.rollup.daily — entire rollup run, includes bucketsProcessed, durationMs
  • anlyt.api.summary / .operatorPerformance / .accountUsage / .throughput / .deliveryBreakdown

4. Logs (Pino → Loki)

Fields: level, ts, service=analytics-service, eventId, eventType, bucketHour, durationMs, traceId, spanId. No MSISDN or message body in any log line.

5. Dashboards (Grafana)

  • Analytics Pipeline — consumer lag, event processing rate, dedup rate, error rate
  • Aggregation Health — hourly bucket fill rate, rollup job status, P95 query latency
  • Data Freshnessmetrics_daily.updated_at vs now per scope
  • ClickHouse ETL (if enabled) — ETL job success rate, rows transferred

6. Alerts

AlertConditionRunbook
AnlytConsumerLagNATS consumer pending > 10,000 for 5 mrunbooks/anlyt/consumer-lag.md
AnlytRollupFailedDaily rollup job has not run in 3 hrunbooks/anlyt/rollup-failed.md
AnlytPgErrorsPG errors > 5/minrunbooks/anlyt/pg-down.md
AnlytDeserializationErrorsDeserialization errors > 10/minrunbooks/anlyt/schema-mismatch.md
AnlytQuerySlowREST P95 > 1 s for 10 mrunbooks/anlyt/query-slow.md
AnlytProcessedEventsTableLargeprocessed_events > 5M rowsrunbooks/anlyt/purge-events.md