Skip to main content

Data Model

:::info Source Sourced from services/analytics-service/DATA_MODEL.md in the documentation repo. :::

1. Databases

  • ClickHouse (OLAP) — primary warehouse.
  • Postgres schema analytics — metadata (dashboards, metric defs, export jobs, cohorts).

2. ClickHouse Tables

2.1 events_raw (partitioned monthly × tenant)

CREATE TABLE analytics.events_raw
(
event_id String,
event_type String,
event_version UInt8,
occurred_at DateTime64(3, 'UTC'),
ingested_at DateTime64(3, 'UTC'),
tenant_id String,
actor_type String,
actor_id_hash String,
correlation_id String,
causation_id String,
service String,
aggregate String,
payload String, -- JSON
data_residency Enum('us', 'eu', 'me', 'ap')
)
ENGINE = MergeTree
PARTITION BY (toYYYYMM(occurred_at), cityHash64(tenant_id) % 16)
ORDER BY (tenant_id, event_type, occurred_at)
TTL toDateTime(occurred_at) + INTERVAL 13 MONTH TO VOLUME 'cold';

2.2 Materialized Views (examples)

-- Per-tenant daily active learners
CREATE MATERIALIZED VIEW analytics.dau_learners
ENGINE = SummingMergeTree
PARTITION BY toYYYYMM(day)
ORDER BY (tenant_id, day)
AS SELECT
tenant_id,
toDate(occurred_at) AS day,
uniqState(actor_id_hash) AS uniq_learners
FROM analytics.events_raw
WHERE event_type = 'delivery.play_session.started.v1'
GROUP BY tenant_id, day;

-- Per-course completion funnel
CREATE MATERIALIZED VIEW analytics.course_completion_funnel
ENGINE = AggregatingMergeTree
ORDER BY (tenant_id, course_id, stage)
AS SELECT
tenant_id,
JSONExtractString(payload, 'courseId') AS course_id,
multiIf(event_type = 'enrollment.created.v1', 'enrolled',
event_type = 'delivery.play_session.started.v1', 'started',
event_type = 'progress.completion.recorded.v1', 'completed', 'other') AS stage,
count() AS n
FROM analytics.events_raw
GROUP BY tenant_id, course_id, stage;

-- GMV per tenant per day
CREATE MATERIALIZED VIEW analytics.gmv_daily
ENGINE = SummingMergeTree
ORDER BY (tenant_id, day)
AS SELECT
tenant_id,
toDate(occurred_at) AS day,
sum(JSONExtractInt(payload, 'amountMicro')) AS gmv_micro
FROM analytics.events_raw
WHERE event_type = 'marketplace.order.paid.v1' OR event_type = 'billing.payment.succeeded.v1';

2.3 cohort_members

CREATE TABLE analytics.cohort_members
(
tenant_id String,
cohort_id String,
actor_id_hash String,
joined_at DateTime,
criteria_snapshot String
)
ENGINE = ReplacingMergeTree(joined_at)
ORDER BY (tenant_id, cohort_id, actor_id_hash);

3. Postgres Tables (metadata)

CREATE TABLE analytics.metric_definitions (...);
CREATE TABLE analytics.dashboards (...);
CREATE TABLE analytics.reports (...);
CREATE TABLE analytics.export_jobs (...);
CREATE TABLE analytics.cohort_definitions (...);

4. Multi-Tenant Isolation

  • ClickHouse: tenant_id in ORDER BY; queries must filter; settings.max_bytes_before_external_group_by limit.
  • API layer injects WHERE tenant_id = ? before sending to ClickHouse.
  • Audit trail for cross-tenant platform-admin queries.

5. Retention

  • Raw events: 13 months hot → S3 Parquet (cold) for 5 years.
  • Materialized views: indefinite (small relative to raw).
  • Per-tenant retention override (e.g., EU tenants with strict GDPR rules).

6. Indexing

  • ClickHouse: skip indexes on event_type, tenant_id, partition pruning.
  • Postgres: standard indexes on metadata.

7. Performance

  • Ingestion: 100k events/sec/region.
  • Query: sub-second on indexed dimensions; minutes for cross-tenant aggregates (admin only).
  • Caching: Grafana caches dashboard panels 60s.

8. PII Handling

  • actor_id_hash = SHA256(actor_id + tenant_salt) — opaque but consistent.
  • Full PII only in audit tier (separate ClickHouse cluster with restricted access).
  • GDPR erasure: delete rows for user across hot + cold; anonymization tagged.