search-aggregation-service — DEPLOYMENT_TOPOLOGY
Companion: SERVICE_OVERVIEW · DATA_MODEL · SECURITY_MODEL · OBSERVABILITY · FAILURE_MODES · ../../docs/02-enterprise-architecture.md
1. Cloud, regions, and SKUs
| Resource | Provider | Region(s) | SKU / size |
|---|---|---|---|
| Service runtime | GCP Cloud Run (gen2) | europe-west1 (primary), asia-south1 (warm secondary) | min instances 2 (prod), max 50; 2 vCPU / 4 GiB / cpu-always-allocated |
| Postgres projection | Cloud SQL for PostgreSQL 15 + PostGIS | europe-west1 HA + cross-region replica in asia-south1 | db-custom-4-16384, 200 GB SSD, IOPS auto-resize |
| OpenSearch | Aiven for OpenSearch 2.x (peered VPC in GCP) | europe-west1 cluster + asia-south1 cluster (separate, not replicated; populated by independent rebuild jobs) | 3 master + 3 data nodes, business-4 plan, 200 GB SSD per data node |
| Redis cache | Memorystore for Redis 7 (private services access) | europe-west1 HA + asia-south1 HA | M3 (4 GiB) |
| Pub/Sub | GCP Pub/Sub | global | per-topic; ordering enabled where required |
| Object archive (event replay) | BigQuery + GCS | europe-west1 (primary), GCS dual-region eu | per platform standard |
| Secret store | Google Secret Manager | global with regional replicas | per SECURITY_MODEL § 6 |
| CI/CD | Cloud Build + Artifact Registry | europe-west1 | platform standard |
| Edge / WAF | Cloud Armor + External HTTPS LB | global | per platform |
Cloud is GCP everywhere — no AWS, no on-prem. Desktop is Electron, but this service has no desktop surface (see SYNC_CONTRACT.md).
2. Topology diagram
┌──────────────────────────────────────────────────────────┐
│ Internet (anonymous) │
└───────────────────────────┬──────────────────────────────┘
│
Cloud Armor (WAF + rate limit)
│
External HTTPS LB (global)
│
Apigee (API gateway)
│
bff-consumer-service (Cloud Run)
│
┌───────────────────────────────────┴───────────────────────────────────┐
│ search-aggregation-service │
│ (Cloud Run, region-pinned: EU + ASIA) │
│ ┌──────────┬──────────────┬───────────────┬─────────────────────┐ │
│ │ presentation │ application │ infrastructure (adapters) │ │
│ └────┬─────┴───────┬──────┴────────────┬───────────────────────┘ │
└───────┼─────────────┼───────────────────┼─────────────────────────────┘
│ │ │
┌───────┴───┐ ┌──────┴──────┐ ┌───────┴────────────────────────┐
│ Memorystore│ │ Cloud SQL │ │ Aiven OpenSearch (per region) │
│ Redis │ │ Postgres │ │ alias: melmastoon-search- │
│ (region) │ │ +PostGIS │ │ current │
└───────────┘ └─────┬──────┘ └───────────────┬────────────────┘
│ │
(transactional outbox) │
│ │
Pub/Sub topics & subscriptions ─────┘
│
┌────────────────────┼─────────────────────────────────────────────┐
│ property-service pricing-service inventory-service │
│ tenant-service analytics-service │
└──────────────────────────────────────────────────────────────────┘
3. Process model
search-aggregation-service is a single Cloud Run revision that runs three processes inside one container, started by a shared Node.js process supervisor:
- HTTP server (NestJS) — public read API, operator API, internal API.
- Outbox publisher — single concurrent worker per pod that drains
search.outboxto Pub/Sub. Uses a PostgresSELECT … FOR UPDATE SKIP LOCKEDadvisory lock to avoid duplicate publish across pods. - Pub/Sub pull subscribers — one subscriber per upstream subscription, started in-process. Concurrency tuned per subscription:
| Subscription | Concurrency | Rationale |
|---|---|---|
melmastoon.property.* | 8 | Bursty on launch / publish flips |
melmastoon.pricing.* | 16 | Highest steady-state volume |
melmastoon.inventory.* | 16 | Highest steady-state volume |
melmastoon.tenant.deleted.v1 | 1 | Rare, but expensive cascade |
All three processes share the SQL pool (max 30 connections per pod). Health endpoints reflect the worst-of all three (HTTP returns 503 on /readyz if outbox lag > threshold or any subscriber backlog > threshold).
4. Scaling parameters
| Knob | Value | Rationale |
|---|---|---|
| Cloud Run min instances | 2 prod / 1 staging | Avoid cold starts in latency-sensitive read path |
| Cloud Run max instances | 50 | Paired with Postgres pool (50 × 30 = 1 500 — guarded by Cloud SQL Proxy connection limit; we cap concurrent SQL connections at 800 via PgBouncer side-car) |
| Cloud Run concurrency | 80 requests/instance | NestJS event-loop bound; keeps p95 stable |
| Cloud Run CPU | 2 vCPU, always-allocated | Background subscribers must run between requests |
| Memory | 4 GiB | OpenSearch client buffers + pgvector readers |
| Pub/Sub ack deadline | 60 s | Most projection writes < 200 ms; 60 s tolerates GC and brief Postgres latency |
| Pub/Sub max delivery | 7 attempts → DLQ | per-subscription policy |
| Outbox publish batch | 100 messages | Pub/Sub publish API max efficient batch |
5. Network & connectivity
- Serverless VPC Connector
vpc-conn-search-euandvpc-conn-search-asiaallow Cloud Run to reach private resources. - Cloud SQL: private IP only, accessed via Cloud SQL Auth Proxy side-car or
cloud_sql_proxystatic binding. - Memorystore: private services access (PSA) range.
- Aiven OpenSearch: VPC peering to GCP project; firewall allow-list from VPC connector NAT IP only.
- Egress to
ai-orchestrator-service: internal HTTPS LB, mTLS via service mesh. - All cross-service calls carry a service-account-bound short-lived ID token (audience = receiver URL).
6. CI/CD pipeline
Branches → Cloud Build trigger → image build → tests → cosign sign → release pipeline.
| Stage | Action | Gate |
|---|---|---|
| Pull request | unit + integration + contract + lint + typecheck + dep audit | green required to merge |
Merge to main | full test matrix + coverage gate + perf smoke | green required to release |
| Build | hermetic build (BuildKit) + SBOM + cosign signature | image published to gcr.io/melmastoon-prod/search-aggregation-service |
| Migration plan | dry-run expand/backfill/contract against ephemeral Postgres | required for expand/contract PRs |
| Staging deploy | Argo CD promotes new revision; runs smoke synthetics | required pass before prod |
| Prod deploy | progressive rollout (see § 7) | manual approval + change ticket |
SERVICE_VERSION env is the git tag (e.g. v1.42.0); embedded into traces, logs, and the /readyz payload.
7. Progressive rollout
Cloud Run traffic splitting:
- Deploy new revision with
--no-traffic. - Run pre-rollout integration test against the new revision via tagged URL.
- Promote to 1 % of traffic for 10 min; SLO burn-rate alarms must remain quiet.
- 10 % for 10 min, then 50 % for 10 min, then 100 %.
- Hold previous revision for 24 h to enable instant rollback (
gcloud run services update-traffic --to-revisions=<old>=100).
For DB schema changes, see MIGRATION_PLAN.md (expand → backfill → contract over multiple releases).
For OpenSearch index template changes, see § 9 (index swap runbook).
8. Disaster recovery
| Asset | RPO | RTO | Mechanism |
|---|---|---|---|
| Postgres projection | 5 min | 30 min | Cloud SQL HA + cross-region read replica + PITR (7 d) |
| OpenSearch | rebuild from canonical | 4 h (cold rebuild) / 30 min (snapshot restore) | Aiven nightly snapshots to GCS dual-region; rebuild from BigQuery archive |
| Memorystore | discardable | 1 min | New instance + cache warm-up via top-N golden queries |
| Pub/Sub | 7-day retention; replay supported | 0 | Subscriptions are durable; DLQ + replay topic |
| BigQuery event archive | 24 h | n/a | Authoritative replay source |
| Secrets | platform | platform | Secret Manager regional replicas |
A region-loss exercise (game day) is required quarterly: simulate europe-west1 Cloud Run + Postgres outage, fail over reads to asia-south1, and rebuild the EU OpenSearch cluster from the archive.
9. OpenSearch index swap runbook (summary)
Detailed runbook: ops/runbooks/search-aggregation-service/index-rebuild.md.
POST /api/v1/search/index:rebuild { regions:["AF"], sinceTs:"2024-01-01T00:00:00Z" }⇒ createsIndexBuild.- Service creates new index
melmastoon-search-v<n>-AFwith the latest template. - Service consumes BigQuery archive of canonical events from
sinceTs, replaying through the sameProjectionAllowListPolicyand writing into the new index. Existing index keeps serving reads. - Service enters
catching_upphase: live consumption is dual-mirrored (current alias + new index) until they converge withinindex_lag_docs ≤ 100. - Service enters
swappingphase: atomic aliasmelmastoon-search-current→ new index in a single_aliasescall; emitsindex.rebuilt.v1. - Old index retained 48 h for instant rollback (single alias swap back). Then deleted by ILM.
Rollback: POST /api/v1/search/index:rollback { region:"AF" } if the new index shows higher error rate or worse top-N quality (ranking team's offline eval).
10. Configuration & feature flags
Configuration is loaded at boot from Secret Manager (SECRET_*) and from a ConfigMap-equivalent stored in melmastoon-prod GCS bucket (config/search-aggregation-service/<env>.json), watched for hot-reload by the ConfigWatcher.
Feature flags via the platform flags-service (read at boot + periodic refresh):
| Flag | Default (prod) | Purpose |
|---|---|---|
search.semantic_rerank.enabled | false | Phase 2 |
search.sponsored_slots.enabled | false | Phase 3 |
search.degrade_on_opensearch_error | true | Postgres-only fallback |
search.intent_cache_ttl_sec | 2592000 | 30 d |
search.region_pinning.strict | true | enforce region filter |
search.index_rebuild.dry_run | false | for chaos tests |
11. Cost guardrails
| Surface | Soft cap | Hard cap |
|---|---|---|
| Cloud Run CPU·s/month | 80 % of monthly budget | 110 % ⇒ scale max instances down |
| Pub/Sub bytes/month | 80 % | 110 % ⇒ down-sample query.executed.v1 |
| OpenSearch storage | 80 % | 110 % ⇒ tighten ILM warm phase |
| AI orchestrator $$ | per AI_INTEGRATION § 5 | hard cap denies new calls |
Monthly cost report posted to #ops-billing and reviewed by the platform owner.
12. Service identity & SAs
| SA | Used by | Roles |
|---|---|---|
search-aggregation@<project>.iam.gserviceaccount.com | Cloud Run revision | roles/cloudsql.client, roles/pubsub.subscriber, roles/pubsub.publisher (own topics only), roles/secretmanager.secretAccessor (scoped), roles/redis.editor (Memorystore), roles/aiplatform.user (only when Phase 2+ direct calls authorized — currently denied) |
ci-search-aggregation@<project>.iam.gserviceaccount.com | Cloud Build | image push, deploy, run integration tests |
index-builder@<project>.iam.gserviceaccount.com | Index rebuild Cloud Run Job | BigQuery read-only on events_raw.melmastoon_*, OpenSearch admin via Aiven token |
All SAs are managed in the platform IaC (Terraform gcp/iam/search-aggregation/*.tf).
13. Operational handover artifacts
- Live dashboards URL set in
ops/dashboards/search-aggregation-service.json(Grafana). - On-call rotation:
search-aggregationPagerDuty schedule, primary + secondary, follow-the-sun. - Runbooks under
ops/runbooks/search-aggregation-service/(per OBSERVABILITY § 9). - Change calendar: weekly window Tue 09:00 UTC for non-urgent changes; emergency outside-window changes require platform owner approval.