Skip to main content

Observability

:::info Source Sourced from services/analytics-service/OBSERVABILITY.md in the documentation repo. :::

1. Logs

Events: analytics.ingest.*, analytics.query.*, analytics.export.*, analytics.report.*, analytics.alert.*, analytics.ai.*.

2. Metrics

RED

  • analytics_ingest_total{event_type} counter
  • analytics_ingest_duration_seconds histogram
  • analytics_query_total{endpoint,status} counter
  • analytics_query_duration_seconds histogram

Domain

  • analytics_ingestion_lag_seconds gauge (firehose → ClickHouse)
  • analytics_events_raw_bytes_total counter
  • analytics_dashboard_renders_total counter
  • analytics_exports_total{status} counter
  • analytics_export_bytes_total counter
  • analytics_alerts_triggered_total{severity} counter

USE

  • analytics_ch_queue_depth gauge
  • analytics_ch_disk_usage_bytes{tier=hot|cold} gauge

Cost

  • analytics_storage_cost_estimate{tenant_id} gauge
  • analytics_query_cost_seconds_total{tenant_id} counter
  • analytics_ai_insight_cost_micro_usd_total{tenant_id} counter

3. Traces

Spans: analytics.ingest.kafka_connect, analytics.query.clickhouse, analytics.export.job, analytics.ai.generate_sql.

4. Dashboards

  • Ingestion throughput + lag.
  • Query performance (p95 by endpoint).
  • Export queue + success rate.
  • Dashboard render cache hit.
  • AI insight usage.
  • Storage cost (per-tenant).

5. Alerts

AlertThresholdSeverity
ingest-lag> 60s p99P2
query-slowp95 > 5sP3
export-failure-rate> 5%P2
ch-disk-full> 85%P2
cross-tenant-query-leakany detectedP1
ai-sql-generation-refusedrate spikeP3

6. SLOs

SLITarget
Ingestion lag p99< 30s
Dashboard render p95< 2s
Metric query p95< 1s
Export start p95< 10s

7. Business Metrics (Platform-Admin)

  • Platform MAU, MRR, GMV.
  • Tenant health scores.
  • Per-tenant retention.