Skip to main content

Observability

:::info Source Sourced from services/analytics-service/OBSERVABILITY.md in the documentation repo. :::

1. Logs

Events: analytics.ingest.*, analytics.query.*, analytics.export.*, analytics.report.*, analytics.alert.*, analytics.ai.*.

2. Metrics

RED

analytics_ingest_total{event_type} counter
analytics_ingest_duration_seconds histogram
analytics_query_total{endpoint,status} counter
analytics_query_duration_seconds histogram

Domain

analytics_ingestion_lag_seconds gauge (firehose → ClickHouse)
analytics_events_raw_bytes_total counter
analytics_dashboard_renders_total counter
analytics_exports_total{status} counter
analytics_export_bytes_total counter
analytics_alerts_triggered_total{severity} counter

USE

analytics_ch_queue_depth gauge
analytics_ch_disk_usage_bytes{tier=hot|cold} gauge

Cost

analytics_storage_cost_estimate{tenant_id} gauge
analytics_query_cost_seconds_total{tenant_id} counter
analytics_ai_insight_cost_micro_usd_total{tenant_id} counter

3. Traces

Spans: analytics.ingest.kafka_connect, analytics.query.clickhouse, analytics.export.job, analytics.ai.generate_sql.

4. Dashboards

Ingestion throughput + lag.
Query performance (p95 by endpoint).
Export queue + success rate.
Dashboard render cache hit.
AI insight usage.
Storage cost (per-tenant).

5. Alerts

Alert	Threshold	Severity
ingest-lag	> 60s p99	P2
query-slow	p95 > 5s	P3
export-failure-rate	> 5%	P2
ch-disk-full	> 85%	P2
cross-tenant-query-leak	any detected	P1
ai-sql-generation-refused	rate spike	P3

6. SLOs

SLI	Target
Ingestion lag p99	< 30s
Dashboard render p95	< 2s
Metric query p95	< 1s
Export start p95	< 10s

7. Business Metrics (Platform-Admin)

Platform MAU, MRR, GMV.
Tenant health scores.
Per-tenant retention.

1. Logs
2. Metrics
- RED
- Domain
- USE
- Cost
3. Traces
4. Dashboards
5. Alerts
6. SLOs
7. Business Metrics (Platform-Admin)