Observability
:::info Source
Sourced from services/analytics-service/OBSERVABILITY.md in the documentation repo.
:::
1. Logs
Events: analytics.ingest.*, analytics.query.*, analytics.export.*, analytics.report.*, analytics.alert.*, analytics.ai.*.
2. Metrics
RED
analytics_ingest_total{event_type}counteranalytics_ingest_duration_secondshistogramanalytics_query_total{endpoint,status}counteranalytics_query_duration_secondshistogram
Domain
analytics_ingestion_lag_secondsgauge (firehose → ClickHouse)analytics_events_raw_bytes_totalcounteranalytics_dashboard_renders_totalcounteranalytics_exports_total{status}counteranalytics_export_bytes_totalcounteranalytics_alerts_triggered_total{severity}counter
USE
analytics_ch_queue_depthgaugeanalytics_ch_disk_usage_bytes{tier=hot|cold}gauge
Cost
analytics_storage_cost_estimate{tenant_id}gaugeanalytics_query_cost_seconds_total{tenant_id}counteranalytics_ai_insight_cost_micro_usd_total{tenant_id}counter
3. Traces
Spans: analytics.ingest.kafka_connect, analytics.query.clickhouse, analytics.export.job, analytics.ai.generate_sql.
4. Dashboards
- Ingestion throughput + lag.
- Query performance (p95 by endpoint).
- Export queue + success rate.
- Dashboard render cache hit.
- AI insight usage.
- Storage cost (per-tenant).
5. Alerts
| Alert | Threshold | Severity |
|---|---|---|
| ingest-lag | > 60s p99 | P2 |
| query-slow | p95 > 5s | P3 |
| export-failure-rate | > 5% | P2 |
| ch-disk-full | > 85% | P2 |
| cross-tenant-query-leak | any detected | P1 |
| ai-sql-generation-refused | rate spike | P3 |
6. SLOs
| SLI | Target |
|---|---|
| Ingestion lag p99 | < 30s |
| Dashboard render p95 | < 2s |
| Metric query p95 | < 1s |
| Export start p95 | < 10s |
7. Business Metrics (Platform-Admin)
- Platform MAU, MRR, GMV.
- Tenant health scores.
- Per-tenant retention.