Data Model
:::info Source
Sourced from services/analytics-service/DATA_MODEL.md in the documentation repo.
:::
1. Databases
- ClickHouse (OLAP) — primary warehouse.
- Postgres schema
analytics— metadata (dashboards, metric defs, export jobs, cohorts).
2. ClickHouse Tables
2.1 events_raw (partitioned monthly × tenant)
CREATE TABLE analytics.events_raw
(
event_id String,
event_type String,
event_version UInt8,
occurred_at DateTime64(3, 'UTC'),
ingested_at DateTime64(3, 'UTC'),
tenant_id String,
actor_type String,
actor_id_hash String,
correlation_id String,
causation_id String,
service String,
aggregate String,
payload String, -- JSON
data_residency Enum('us', 'eu', 'me', 'ap')
)
ENGINE = MergeTree
PARTITION BY (toYYYYMM(occurred_at), cityHash64(tenant_id) % 16)
ORDER BY (tenant_id, event_type, occurred_at)
TTL toDateTime(occurred_at) + INTERVAL 13 MONTH TO VOLUME 'cold';
2.2 Materialized Views (examples)
-- Per-tenant daily active learners
CREATE MATERIALIZED VIEW analytics.dau_learners
ENGINE = SummingMergeTree
PARTITION BY toYYYYMM(day)
ORDER BY (tenant_id, day)
AS SELECT
tenant_id,
toDate(occurred_at) AS day,
uniqState(actor_id_hash) AS uniq_learners
FROM analytics.events_raw
WHERE event_type = 'delivery.play_session.started.v1'
GROUP BY tenant_id, day;
-- Per-course completion funnel
CREATE MATERIALIZED VIEW analytics.course_completion_funnel
ENGINE = AggregatingMergeTree
ORDER BY (tenant_id, course_id, stage)
AS SELECT
tenant_id,
JSONExtractString(payload, 'courseId') AS course_id,
multiIf(event_type = 'enrollment.created.v1', 'enrolled',
event_type = 'delivery.play_session.started.v1', 'started',
event_type = 'progress.completion.recorded.v1', 'completed', 'other') AS stage,
count() AS n
FROM analytics.events_raw
GROUP BY tenant_id, course_id, stage;
-- GMV per tenant per day
CREATE MATERIALIZED VIEW analytics.gmv_daily
ENGINE = SummingMergeTree
ORDER BY (tenant_id, day)
AS SELECT
tenant_id,
toDate(occurred_at) AS day,
sum(JSONExtractInt(payload, 'amountMicro')) AS gmv_micro
FROM analytics.events_raw
WHERE event_type = 'marketplace.order.paid.v1' OR event_type = 'billing.payment.succeeded.v1';
2.3 cohort_members
CREATE TABLE analytics.cohort_members
(
tenant_id String,
cohort_id String,
actor_id_hash String,
joined_at DateTime,
criteria_snapshot String
)
ENGINE = ReplacingMergeTree(joined_at)
ORDER BY (tenant_id, cohort_id, actor_id_hash);
3. Postgres Tables (metadata)
CREATE TABLE analytics.metric_definitions (...);
CREATE TABLE analytics.dashboards (...);
CREATE TABLE analytics.reports (...);
CREATE TABLE analytics.export_jobs (...);
CREATE TABLE analytics.cohort_definitions (...);
4. Multi-Tenant Isolation
- ClickHouse:
tenant_idin ORDER BY; queries must filter;settings.max_bytes_before_external_group_bylimit. - API layer injects
WHERE tenant_id = ?before sending to ClickHouse. - Audit trail for cross-tenant platform-admin queries.
5. Retention
- Raw events: 13 months hot → S3 Parquet (cold) for 5 years.
- Materialized views: indefinite (small relative to raw).
- Per-tenant retention override (e.g., EU tenants with strict GDPR rules).
6. Indexing
- ClickHouse: skip indexes on
event_type,tenant_id, partition pruning. - Postgres: standard indexes on metadata.
7. Performance
- Ingestion: 100k events/sec/region.
- Query: sub-second on indexed dimensions; minutes for cross-tenant aggregates (admin only).
- Caching: Grafana caches dashboard panels 60s.
8. PII Handling
actor_id_hash= SHA256(actor_id + tenant_salt) — opaque but consistent.- Full PII only in
audittier (separate ClickHouse cluster with restricted access). - GDPR erasure: delete rows for user across hot + cold; anonymization tagged.