Skip to main content

search-aggregation-service — DEPLOYMENT_TOPOLOGY

Companion: SERVICE_OVERVIEW · DATA_MODEL · SECURITY_MODEL · OBSERVABILITY · FAILURE_MODES · ../../docs/02-enterprise-architecture.md

1. Cloud, regions, and SKUs

ResourceProviderRegion(s)SKU / size
Service runtimeGCP Cloud Run (gen2)europe-west1 (primary), asia-south1 (warm secondary)min instances 2 (prod), max 50; 2 vCPU / 4 GiB / cpu-always-allocated
Postgres projectionCloud SQL for PostgreSQL 15 + PostGISeurope-west1 HA + cross-region replica in asia-south1db-custom-4-16384, 200 GB SSD, IOPS auto-resize
OpenSearchAiven for OpenSearch 2.x (peered VPC in GCP)europe-west1 cluster + asia-south1 cluster (separate, not replicated; populated by independent rebuild jobs)3 master + 3 data nodes, business-4 plan, 200 GB SSD per data node
Redis cacheMemorystore for Redis 7 (private services access)europe-west1 HA + asia-south1 HAM3 (4 GiB)
Pub/SubGCP Pub/Subglobalper-topic; ordering enabled where required
Object archive (event replay)BigQuery + GCSeurope-west1 (primary), GCS dual-region euper platform standard
Secret storeGoogle Secret Managerglobal with regional replicasper SECURITY_MODEL § 6
CI/CDCloud Build + Artifact Registryeurope-west1platform standard
Edge / WAFCloud Armor + External HTTPS LBglobalper platform

Cloud is GCP everywhere — no AWS, no on-prem. Desktop is Electron, but this service has no desktop surface (see SYNC_CONTRACT.md).

2. Topology diagram

┌──────────────────────────────────────────────────────────┐
│ Internet (anonymous) │
└───────────────────────────┬──────────────────────────────┘

Cloud Armor (WAF + rate limit)

External HTTPS LB (global)

Apigee (API gateway)

bff-consumer-service (Cloud Run)

┌───────────────────────────────────┴───────────────────────────────────┐
│ search-aggregation-service │
│ (Cloud Run, region-pinned: EU + ASIA) │
│ ┌──────────┬──────────────┬───────────────┬─────────────────────┐ │
│ │ presentation │ application │ infrastructure (adapters) │ │
│ └────┬─────┴───────┬──────┴────────────┬───────────────────────┘ │
└───────┼─────────────┼───────────────────┼─────────────────────────────┘
│ │ │
┌───────┴───┐ ┌──────┴──────┐ ┌───────┴────────────────────────┐
│ Memorystore│ │ Cloud SQL │ │ Aiven OpenSearch (per region) │
│ Redis │ │ Postgres │ │ alias: melmastoon-search- │
│ (region) │ │ +PostGIS │ │ current │
└───────────┘ └─────┬──────┘ └───────────────┬────────────────┘
│ │
(transactional outbox) │
│ │
Pub/Sub topics & subscriptions ─────┘

┌────────────────────┼─────────────────────────────────────────────┐
│ property-service pricing-service inventory-service │
│ tenant-service analytics-service │
└──────────────────────────────────────────────────────────────────┘

3. Process model

search-aggregation-service is a single Cloud Run revision that runs three processes inside one container, started by a shared Node.js process supervisor:

  1. HTTP server (NestJS) — public read API, operator API, internal API.
  2. Outbox publisher — single concurrent worker per pod that drains search.outbox to Pub/Sub. Uses a Postgres SELECT … FOR UPDATE SKIP LOCKED advisory lock to avoid duplicate publish across pods.
  3. Pub/Sub pull subscribers — one subscriber per upstream subscription, started in-process. Concurrency tuned per subscription:
SubscriptionConcurrencyRationale
melmastoon.property.*8Bursty on launch / publish flips
melmastoon.pricing.*16Highest steady-state volume
melmastoon.inventory.*16Highest steady-state volume
melmastoon.tenant.deleted.v11Rare, but expensive cascade

All three processes share the SQL pool (max 30 connections per pod). Health endpoints reflect the worst-of all three (HTTP returns 503 on /readyz if outbox lag > threshold or any subscriber backlog > threshold).

4. Scaling parameters

KnobValueRationale
Cloud Run min instances2 prod / 1 stagingAvoid cold starts in latency-sensitive read path
Cloud Run max instances50Paired with Postgres pool (50 × 30 = 1 500 — guarded by Cloud SQL Proxy connection limit; we cap concurrent SQL connections at 800 via PgBouncer side-car)
Cloud Run concurrency80 requests/instanceNestJS event-loop bound; keeps p95 stable
Cloud Run CPU2 vCPU, always-allocatedBackground subscribers must run between requests
Memory4 GiBOpenSearch client buffers + pgvector readers
Pub/Sub ack deadline60 sMost projection writes < 200 ms; 60 s tolerates GC and brief Postgres latency
Pub/Sub max delivery7 attempts → DLQper-subscription policy
Outbox publish batch100 messagesPub/Sub publish API max efficient batch

5. Network & connectivity

  • Serverless VPC Connector vpc-conn-search-eu and vpc-conn-search-asia allow Cloud Run to reach private resources.
  • Cloud SQL: private IP only, accessed via Cloud SQL Auth Proxy side-car or cloud_sql_proxy static binding.
  • Memorystore: private services access (PSA) range.
  • Aiven OpenSearch: VPC peering to GCP project; firewall allow-list from VPC connector NAT IP only.
  • Egress to ai-orchestrator-service: internal HTTPS LB, mTLS via service mesh.
  • All cross-service calls carry a service-account-bound short-lived ID token (audience = receiver URL).

6. CI/CD pipeline

Branches → Cloud Build trigger → image build → tests → cosign sign → release pipeline.

StageActionGate
Pull requestunit + integration + contract + lint + typecheck + dep auditgreen required to merge
Merge to mainfull test matrix + coverage gate + perf smokegreen required to release
Buildhermetic build (BuildKit) + SBOM + cosign signatureimage published to gcr.io/melmastoon-prod/search-aggregation-service
Migration plandry-run expand/backfill/contract against ephemeral Postgresrequired for expand/contract PRs
Staging deployArgo CD promotes new revision; runs smoke syntheticsrequired pass before prod
Prod deployprogressive rollout (see § 7)manual approval + change ticket

SERVICE_VERSION env is the git tag (e.g. v1.42.0); embedded into traces, logs, and the /readyz payload.

7. Progressive rollout

Cloud Run traffic splitting:

  1. Deploy new revision with --no-traffic.
  2. Run pre-rollout integration test against the new revision via tagged URL.
  3. Promote to 1 % of traffic for 10 min; SLO burn-rate alarms must remain quiet.
  4. 10 % for 10 min, then 50 % for 10 min, then 100 %.
  5. Hold previous revision for 24 h to enable instant rollback (gcloud run services update-traffic --to-revisions=<old>=100).

For DB schema changes, see MIGRATION_PLAN.md (expand → backfill → contract over multiple releases).

For OpenSearch index template changes, see § 9 (index swap runbook).

8. Disaster recovery

AssetRPORTOMechanism
Postgres projection5 min30 minCloud SQL HA + cross-region read replica + PITR (7 d)
OpenSearchrebuild from canonical4 h (cold rebuild) / 30 min (snapshot restore)Aiven nightly snapshots to GCS dual-region; rebuild from BigQuery archive
Memorystorediscardable1 minNew instance + cache warm-up via top-N golden queries
Pub/Sub7-day retention; replay supported0Subscriptions are durable; DLQ + replay topic
BigQuery event archive24 hn/aAuthoritative replay source
SecretsplatformplatformSecret Manager regional replicas

A region-loss exercise (game day) is required quarterly: simulate europe-west1 Cloud Run + Postgres outage, fail over reads to asia-south1, and rebuild the EU OpenSearch cluster from the archive.

9. OpenSearch index swap runbook (summary)

Detailed runbook: ops/runbooks/search-aggregation-service/index-rebuild.md.

  1. POST /api/v1/search/index:rebuild { regions:["AF"], sinceTs:"2024-01-01T00:00:00Z" } ⇒ creates IndexBuild.
  2. Service creates new index melmastoon-search-v<n>-AF with the latest template.
  3. Service consumes BigQuery archive of canonical events from sinceTs, replaying through the same ProjectionAllowListPolicy and writing into the new index. Existing index keeps serving reads.
  4. Service enters catching_up phase: live consumption is dual-mirrored (current alias + new index) until they converge within index_lag_docs ≤ 100.
  5. Service enters swapping phase: atomic alias melmastoon-search-current → new index in a single _aliases call; emits index.rebuilt.v1.
  6. Old index retained 48 h for instant rollback (single alias swap back). Then deleted by ILM.

Rollback: POST /api/v1/search/index:rollback { region:"AF" } if the new index shows higher error rate or worse top-N quality (ranking team's offline eval).

10. Configuration & feature flags

Configuration is loaded at boot from Secret Manager (SECRET_*) and from a ConfigMap-equivalent stored in melmastoon-prod GCS bucket (config/search-aggregation-service/<env>.json), watched for hot-reload by the ConfigWatcher.

Feature flags via the platform flags-service (read at boot + periodic refresh):

FlagDefault (prod)Purpose
search.semantic_rerank.enabledfalsePhase 2
search.sponsored_slots.enabledfalsePhase 3
search.degrade_on_opensearch_errortruePostgres-only fallback
search.intent_cache_ttl_sec259200030 d
search.region_pinning.stricttrueenforce region filter
search.index_rebuild.dry_runfalsefor chaos tests

11. Cost guardrails

SurfaceSoft capHard cap
Cloud Run CPU·s/month80 % of monthly budget110 % ⇒ scale max instances down
Pub/Sub bytes/month80 %110 % ⇒ down-sample query.executed.v1
OpenSearch storage80 %110 % ⇒ tighten ILM warm phase
AI orchestrator $$per AI_INTEGRATION § 5hard cap denies new calls

Monthly cost report posted to #ops-billing and reviewed by the platform owner.

12. Service identity & SAs

SAUsed byRoles
search-aggregation@<project>.iam.gserviceaccount.comCloud Run revisionroles/cloudsql.client, roles/pubsub.subscriber, roles/pubsub.publisher (own topics only), roles/secretmanager.secretAccessor (scoped), roles/redis.editor (Memorystore), roles/aiplatform.user (only when Phase 2+ direct calls authorized — currently denied)
ci-search-aggregation@<project>.iam.gserviceaccount.comCloud Buildimage push, deploy, run integration tests
index-builder@<project>.iam.gserviceaccount.comIndex rebuild Cloud Run JobBigQuery read-only on events_raw.melmastoon_*, OpenSearch admin via Aiven token

All SAs are managed in the platform IaC (Terraform gcp/iam/search-aggregation/*.tf).

13. Operational handover artifacts

  • Live dashboards URL set in ops/dashboards/search-aggregation-service.json (Grafana).
  • On-call rotation: search-aggregation PagerDuty schedule, primary + secondary, follow-the-sun.
  • Runbooks under ops/runbooks/search-aggregation-service/ (per OBSERVABILITY § 9).
  • Change calendar: weekly window Tue 09:00 UTC for non-urgent changes; emergency outside-window changes require platform owner approval.