Data Model

:::info Source Sourced from services/search-service/DATA_MODEL.md in the documentation repo. :::

Companion: DOMAIN_MODEL.md · EVENT_SCHEMAS.md

Search-service persists four distinct data surfaces:

OpenSearch indices — primary lexical + facet store.
pgvector — embeddings, accessed via ai-gateway-service.
Postgres (search schema) — operational metadata: index policies, reindex jobs, inbox/outbox, recommendation snapshots, DLQ.
Redis — short-lived caches.

1. OpenSearch

1.1 Cluster Layout

Concern	Choice	Rationale
Version	OpenSearch 2.x	Apache 2.0 license; compatible with Elasticsearch APIs
Nodes	3 data + 3 master (prod); 1+1 (dev)	HA + split-brain avoidance
Replication	1 primary + 2 replicas on hot indices	Resilience to single AZ loss
Per-tenant strategy	Shared alias by default; dedicated alias for top-N tenants (shard count bumped)	Balances cost vs isolation
Lifecycle policy	Hot → warm at 30d, delete after 365d (tombstones only)	See §1.6

1.2 Index Naming

<env>-search-<scope>-<date>       e.g. prod-search-tenant_01HA...-2026-04-15
<env>-search-<scope>              alias, always points to the most recent physical index

scope = 'shared' → single physical index across many small tenants (they share it, filtered by tenantId).
scope = 'tenant_X' → dedicated to tenant X (large/enterprise).

1.3 Mapping (abridged)

{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": { "tokenizer": "standard", "filter": ["lowercase", "icu_folding"] },
        "ar":      { "tokenizer": "standard", "filter": ["lowercase", "arabic_normalization"] },
        "suggest": { "tokenizer": "standard", "filter": ["lowercase", "edge_ngram_2_12"] }
      }
    },
    "number_of_shards": 1,
    "number_of_replicas": 2
  },
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "id":                 { "type": "keyword" },
      "tenantId":           { "type": "keyword" },
      "type":               { "type": "keyword" },
      "source.service":     { "type": "keyword" },
      "source.aggregateId": { "type": "keyword" },
      "source.aggregateVersion": { "type": "long" },

      "title":   { "type": "object", "properties": {
        "en": { "type": "text", "analyzer": "default", "fields": { "kw": { "type": "keyword" }, "suggest": { "type": "completion" } } },
        "ar": { "type": "text", "analyzer": "ar" },
        "fr": { "type": "text", "analyzer": "default" }
      }},
      "body":    { "type": "object", "enabled": true, "properties": {
        "en": { "type": "text", "analyzer": "default" },
        "ar": { "type": "text", "analyzer": "ar" },
        "fr": { "type": "text", "analyzer": "default" }
      }},

      "tags":      { "type": "keyword" },
      "taxonomy":  { "type": "keyword" },
      "facets":    { "type": "flattened" },
      "visibility":{ "type": "keyword" },
      "audiences": { "type": "keyword" },

      "locale":    { "type": "keyword" },
      "region":    { "type": "keyword" },

      "quality":   { "type": "object", "properties": {
        "ratingAvg":       { "type": "float" },
        "enrollmentCount": { "type": "long" },
        "completionRate":  { "type": "float" }
      }},

      "publishedAt": { "type": "date" },
      "updatedAt":   { "type": "date" },
      "deletedAt":   { "type": "date" },

      "embeddingModelId": { "type": "keyword" },
      "embeddingHash":    { "type": "keyword" }
      /* embedding vector NOT stored in OpenSearch — lives in pgvector */
    }
  }
}

1.4 Shard Sizing Rule

Target shard size: 20–40 GB. A physical index is split (split API) when size > 40 GB. Reindex job triggered when shard growth rate projects overflow in 30 days.

1.5 Zero-Downtime Rebuild Flow

1.6 ILM / ISM Policy

Phase	Trigger	Action
hot	default	1 primary + 2 replicas
warm	30d since last write	force-merge to 1 segment; 1 replica
delete	tombstone only + 30d	remove documents matching `deletedAt` older than 30d

2. pgvector (Embeddings)

Embeddings stored in pgvector tables owned by ai-gateway-service. search-service calls the ai-gateway HTTP API:

POST /v1/vectors/upsert
POST /v1/vectors/knn
POST /v1/vectors/delete

Logical table shape (ai-gateway-side):

CREATE TABLE embeddings (
  tenant_id     UUID NOT NULL,
  doc_id        TEXT NOT NULL,
  doc_type      TEXT NOT NULL,
  vector        vector(1024) NOT NULL,
  model_id      TEXT NOT NULL,
  embedding_hash TEXT NOT NULL,
  updated_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
  PRIMARY KEY (tenant_id, doc_id)
);
CREATE INDEX embeddings_vec_idx
  ON embeddings USING hnsw (vector vector_cosine_ops)
  WITH (m = 16, ef_construction = 128);
CREATE INDEX embeddings_tenant_type ON embeddings (tenant_id, doc_type);

RLS enforces tenant isolation in pgvector even within shared tables.

3. Postgres (`search` schema)

3.1 `search.index_policy`

CREATE TABLE search.index_policy (
  tenant_id           UUID PRIMARY KEY,
  alias               TEXT NOT NULL UNIQUE,
  physical_index      TEXT NOT NULL,
  primary_shards      INT  NOT NULL DEFAULT 1,
  replicas            INT  NOT NULL DEFAULT 2,
  embedding_model_id  TEXT NOT NULL,
  reindex_version     INT  NOT NULL DEFAULT 1,
  region              TEXT NOT NULL,
  created_at          TIMESTAMPTZ NOT NULL DEFAULT now(),
  last_reindex_at     TIMESTAMPTZ
);
ALTER TABLE search.index_policy ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON search.index_policy
  USING (tenant_id = current_setting('app.tenant_id')::uuid);

3.2 `search.reindex_job`

CREATE TABLE search.reindex_job (
  job_id          UUID PRIMARY KEY,
  tenant_id       UUID NOT NULL,
  scope           TEXT NOT NULL CHECK (scope IN ('tenant','global')),
  scope_target_id TEXT NOT NULL,
  status          TEXT NOT NULL CHECK (status IN ('queued','running','completed','failed')),
  phase           TEXT,
  include_embeddings BOOLEAN NOT NULL DEFAULT false,
  total_docs      BIGINT,
  processed_docs  BIGINT DEFAULT 0,
  started_at      TIMESTAMPTZ,
  completed_at    TIMESTAMPTZ,
  requested_by    TEXT NOT NULL,
  error           TEXT,
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ON search.reindex_job (tenant_id, status);

3.3 `search.inbox` (idempotent consumption)

CREATE TABLE search.inbox (
  event_id    UUID PRIMARY KEY,          -- ULID stored as UUID
  tenant_id   UUID NOT NULL,
  subject     TEXT NOT NULL,
  received_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  handled_at  TIMESTAMPTZ
);
CREATE INDEX ON search.inbox (tenant_id, received_at);

3.4 `search.outbox` (internal events)

CREATE TABLE search.outbox (
  outbox_id   UUID PRIMARY KEY,
  tenant_id   UUID NOT NULL,
  subject     TEXT NOT NULL,
  payload     JSONB NOT NULL,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  published_at TIMESTAMPTZ
);
CREATE INDEX ON search.outbox (published_at NULLS FIRST, created_at);

3.5 `search.recommendation`

CREATE TABLE search.recommendation (
  generation_id   UUID PRIMARY KEY,
  tenant_id       UUID NOT NULL,
  user_id         UUID NOT NULL,
  context         TEXT NOT NULL,
  model_version   TEXT NOT NULL,
  items           JSONB NOT NULL,
  generated_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
  expires_at      TIMESTAMPTZ NOT NULL
);
CREATE INDEX ON search.recommendation (tenant_id, user_id, generated_at DESC);

3.6 `search.dlq`

CREATE TABLE search.dlq (
  id            UUID PRIMARY KEY,
  original_event JSONB NOT NULL,
  subject        TEXT NOT NULL,
  error          TEXT NOT NULL,
  delivery_count INT NOT NULL,
  received_at   TIMESTAMPTZ NOT NULL DEFAULT now(),
  replayed_at   TIMESTAMPTZ
);

3.7 `search.recommendation_feedback`

As implemented in Ghasi-EdTech (EP-11 migration 0003_ep11_feedback.sql — authoritative for OSS repo):

CREATE TABLE search.recommendation_feedback (
  id              TEXT PRIMARY KEY,
  tenant_id       UUID NOT NULL,
  user_id         TEXT NOT NULL,
  item_id         TEXT NOT NULL,  -- e.g. catalog URN `catalog:course:{courseId}`
  action          TEXT NOT NULL CHECK (action IN ('click','dismiss','convert','not_interested')),
  position        INT,
  context         TEXT,
  idempotency_key TEXT,
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE UNIQUE INDEX idx_search_rf_idempotency
  ON search.recommendation_feedback (tenant_id, idempotency_key)
  WHERE idempotency_key IS NOT NULL;
CREATE INDEX idx_search_rf_tenant_user_item
  ON search.recommendation_feedback (tenant_id, user_id, item_id, created_at DESC);
ALTER TABLE search.recommendation_feedback ENABLE ROW LEVEL SECURITY;
-- policies: SELECT/INSERT with tenant_id = current_setting('app.tenant_id')::uuid

generationId for analytics is carried on the feedback API body and on the emitted search.recommendation.feedback.recorded.v1 payload, not as a denormalized FK column in v1.

4. Redis

Key pattern	TTL	Purpose
`search:q:{tenant}:{hash}`	30s	Query result cache
`search:sugg:{tenant}:{prefix}:{type}`	30s	Suggest cache
`search:rec:{tenant}:{user}:{ctx}`	3600s	Recommendation cache
`search:policy:{tenant}`	300s	IndexPolicy cache
`search:ratelimit:{actor}:{bucket}`	60s	Sliding window counter
`search:rebuild:lock:{tenant}`	1h	Mutex

5. Volumes and Sizing (targets)

Quantity	Estimate per 1k tenants	Notes
Docs per tenant	5k average, 500k for largest	Shared index handles up to ~1M docs comfortably
Lexical index size	~300 GB hot / 1k tenants
Embedding size	1024 × 4 B = 4 KB/doc + HNSW overhead	Roughly 2× raw
Reindex time	~1 s / 1k docs (without embeddings), ~10 s / 1k (with)

6. Data Residency

Every document carries region. Physical indices and pgvector tables are provisioned in that region only. A tenant cannot be searched from another region. Cross-region searches for marketplace listings are handled by a dedicated global marketplace index (US-hosted, no PII, explicit opt-in).

7. Backup & Recovery

OpenSearch snapshots → object storage daily (full) + every 6h (incremental).
Postgres search schema PITR (15 min RPO).
Indices can always be rebuilt from event log → primary recovery strategy is reindex from NATS, not snapshot.

1. OpenSearch​

1.1 Cluster Layout​

1.2 Index Naming​

1.3 Mapping (abridged)​

1.4 Shard Sizing Rule​

1.5 Zero-Downtime Rebuild Flow​

1.6 ILM / ISM Policy​

2. pgvector (Embeddings)​

3. Postgres (search schema)​

3.1 search.index_policy​

3.2 search.reindex_job​

3.3 search.inbox (idempotent consumption)​

3.4 search.outbox (internal events)​

3.5 search.recommendation​

3.6 search.dlq​

3.7 search.recommendation_feedback​

4. Redis​

5. Volumes and Sizing (targets)​

6. Data Residency​

7. Backup & Recovery​