search-aggregation-service — DEPLOYMENT_TOPOLOGY

Companion: SERVICE_OVERVIEW · DATA_MODEL · SECURITY_MODEL · OBSERVABILITY · FAILURE_MODES · ../../docs/02-enterprise-architecture.md

1. Cloud, regions, and SKUs

Resource	Provider	Region(s)	SKU / size
Service runtime	GCP Cloud Run (gen2)	`europe-west1` (primary), `asia-south1` (warm secondary)	min instances 2 (prod), max 50; 2 vCPU / 4 GiB / cpu-always-allocated
Postgres projection	Cloud SQL for PostgreSQL 15 + PostGIS	`europe-west1` HA + cross-region replica in `asia-south1`	`db-custom-4-16384`, 200 GB SSD, IOPS auto-resize
OpenSearch	Aiven for OpenSearch 2.x (peered VPC in GCP)	`europe-west1` cluster + `asia-south1` cluster (separate, not replicated; populated by independent rebuild jobs)	3 master + 3 data nodes, `business-4` plan, 200 GB SSD per data node
Redis cache	Memorystore for Redis 7 (private services access)	`europe-west1` HA + `asia-south1` HA	M3 (4 GiB)
Pub/Sub	GCP Pub/Sub	global	per-topic; ordering enabled where required
Object archive (event replay)	BigQuery + GCS	`europe-west1` (primary), GCS dual-region `eu`	per platform standard
Secret store	Google Secret Manager	global with regional replicas	per SECURITY_MODEL § 6
CI/CD	Cloud Build + Artifact Registry	`europe-west1`	platform standard
Edge / WAF	Cloud Armor + External HTTPS LB	global	per platform

Cloud is GCP everywhere — no AWS, no on-prem. Desktop is Electron, but this service has no desktop surface (see SYNC_CONTRACT.md).

2. Topology diagram

                    ┌──────────────────────────────────────────────────────────┐
                    │                       Internet (anonymous)               │
                    └───────────────────────────┬──────────────────────────────┘
                                                │
                                  Cloud Armor (WAF + rate limit)
                                                │
                                External HTTPS LB (global)
                                                │
                                       Apigee (API gateway)
                                                │
                                bff-consumer-service (Cloud Run)
                                                │
            ┌───────────────────────────────────┴───────────────────────────────────┐
            │                       search-aggregation-service                     │
            │                  (Cloud Run, region-pinned: EU + ASIA)                │
            │  ┌──────────┬──────────────┬───────────────┬─────────────────────┐   │
            │  │ presentation │ application │ infrastructure (adapters)        │   │
            │  └────┬─────┴───────┬──────┴────────────┬───────────────────────┘    │
            └───────┼─────────────┼───────────────────┼─────────────────────────────┘
                    │             │                   │
            ┌───────┴───┐  ┌──────┴──────┐    ┌───────┴────────────────────────┐
            │ Memorystore│  │ Cloud SQL  │    │ Aiven OpenSearch (per region) │
            │  Redis    │  │  Postgres  │    │  alias: melmastoon-search-     │
            │  (region) │  │  +PostGIS  │    │         current                │
            └───────────┘  └─────┬──────┘    └───────────────┬────────────────┘
                                 │                            │
                          (transactional outbox)              │
                                 │                            │
                          Pub/Sub topics & subscriptions ─────┘
                                 │
            ┌────────────────────┼─────────────────────────────────────────────┐
            │  property-service   pricing-service   inventory-service          │
            │  tenant-service     analytics-service                            │
            └──────────────────────────────────────────────────────────────────┘

3. Process model

search-aggregation-service is a single Cloud Run revision that runs three processes inside one container, started by a shared Node.js process supervisor:

HTTP server (NestJS) — public read API, operator API, internal API.
Outbox publisher — single concurrent worker per pod that drains search.outbox to Pub/Sub. Uses a Postgres SELECT … FOR UPDATE SKIP LOCKED advisory lock to avoid duplicate publish across pods.
Pub/Sub pull subscribers — one subscriber per upstream subscription, started in-process. Concurrency tuned per subscription:

Subscription	Concurrency	Rationale
`melmastoon.property.*`	8	Bursty on launch / publish flips
`melmastoon.pricing.*`	16	Highest steady-state volume
`melmastoon.inventory.*`	16	Highest steady-state volume
`melmastoon.tenant.deleted.v1`	1	Rare, but expensive cascade

All three processes share the SQL pool (max 30 connections per pod). Health endpoints reflect the worst-of all three (HTTP returns 503 on /readyz if outbox lag > threshold or any subscriber backlog > threshold).

4. Scaling parameters

Knob	Value	Rationale
Cloud Run min instances	2 prod / 1 staging	Avoid cold starts in latency-sensitive read path
Cloud Run max instances	50	Paired with Postgres pool (50 × 30 = 1 500 — guarded by Cloud SQL Proxy connection limit; we cap concurrent SQL connections at 800 via PgBouncer side-car)
Cloud Run concurrency	80 requests/instance	NestJS event-loop bound; keeps p95 stable
Cloud Run CPU	2 vCPU, always-allocated	Background subscribers must run between requests
Memory	4 GiB	OpenSearch client buffers + pgvector readers
Pub/Sub ack deadline	60 s	Most projection writes < 200 ms; 60 s tolerates GC and brief Postgres latency
Pub/Sub max delivery	7 attempts → DLQ	per-subscription policy
Outbox publish batch	100 messages	Pub/Sub publish API max efficient batch

5. Network & connectivity

Serverless VPC Connector vpc-conn-search-eu and vpc-conn-search-asia allow Cloud Run to reach private resources.
Cloud SQL: private IP only, accessed via Cloud SQL Auth Proxy side-car or cloud_sql_proxy static binding.
Memorystore: private services access (PSA) range.
Aiven OpenSearch: VPC peering to GCP project; firewall allow-list from VPC connector NAT IP only.
Egress to ai-orchestrator-service: internal HTTPS LB, mTLS via service mesh.
All cross-service calls carry a service-account-bound short-lived ID token (audience = receiver URL).

6. CI/CD pipeline

Branches → Cloud Build trigger → image build → tests → cosign sign → release pipeline.

Stage	Action	Gate
Pull request	unit + integration + contract + lint + typecheck + dep audit	green required to merge
Merge to `main`	full test matrix + coverage gate + perf smoke	green required to release
Build	hermetic build (BuildKit) + SBOM + cosign signature	image published to `gcr.io/melmastoon-prod/search-aggregation-service`
Migration plan	dry-run `expand`/`backfill`/`contract` against ephemeral Postgres	required for `expand`/`contract` PRs
Staging deploy	Argo CD promotes new revision; runs smoke synthetics	required pass before prod
Prod deploy	progressive rollout (see § 7)	manual approval + change ticket

SERVICE_VERSION env is the git tag (e.g. v1.42.0); embedded into traces, logs, and the /readyz payload.

7. Progressive rollout

Cloud Run traffic splitting:

Deploy new revision with --no-traffic.
Run pre-rollout integration test against the new revision via tagged URL.
Promote to 1 % of traffic for 10 min; SLO burn-rate alarms must remain quiet.
10 % for 10 min, then 50 % for 10 min, then 100 %.
Hold previous revision for 24 h to enable instant rollback (gcloud run services update-traffic --to-revisions=<old>=100).

For DB schema changes, see MIGRATION_PLAN.md (expand → backfill → contract over multiple releases).

For OpenSearch index template changes, see § 9 (index swap runbook).

8. Disaster recovery

Asset	RPO	RTO	Mechanism
Postgres projection	5 min	30 min	Cloud SQL HA + cross-region read replica + PITR (7 d)
OpenSearch	rebuild from canonical	4 h (cold rebuild) / 30 min (snapshot restore)	Aiven nightly snapshots to GCS dual-region; rebuild from BigQuery archive
Memorystore	discardable	1 min	New instance + cache warm-up via top-N golden queries
Pub/Sub	7-day retention; replay supported	0	Subscriptions are durable; DLQ + replay topic
BigQuery event archive	24 h	n/a	Authoritative replay source
Secrets	platform	platform	Secret Manager regional replicas

A region-loss exercise (game day) is required quarterly: simulate europe-west1 Cloud Run + Postgres outage, fail over reads to asia-south1, and rebuild the EU OpenSearch cluster from the archive.

9. OpenSearch index swap runbook (summary)

Detailed runbook: ops/runbooks/search-aggregation-service/index-rebuild.md.

POST /api/v1/search/index:rebuild { regions:["AF"], sinceTs:"2024-01-01T00:00:00Z" } ⇒ creates IndexBuild.
Service creates new index melmastoon-search-v<n>-AF with the latest template.
Service consumes BigQuery archive of canonical events from sinceTs, replaying through the same ProjectionAllowListPolicy and writing into the new index. Existing index keeps serving reads.
Service enters catching_up phase: live consumption is dual-mirrored (current alias + new index) until they converge within index_lag_docs ≤ 100.
Service enters swapping phase: atomic alias melmastoon-search-current → new index in a single _aliases call; emits index.rebuilt.v1.
Old index retained 48 h for instant rollback (single alias swap back). Then deleted by ILM.

Rollback: POST /api/v1/search/index:rollback { region:"AF" } if the new index shows higher error rate or worse top-N quality (ranking team's offline eval).

10. Configuration & feature flags

Configuration is loaded at boot from Secret Manager (SECRET_*) and from a ConfigMap-equivalent stored in melmastoon-prod GCS bucket (config/search-aggregation-service/<env>.json), watched for hot-reload by the ConfigWatcher.

Feature flags via the platform flags-service (read at boot + periodic refresh):

Flag	Default (prod)	Purpose
`search.semantic_rerank.enabled`	`false`	Phase 2
`search.sponsored_slots.enabled`	`false`	Phase 3
`search.degrade_on_opensearch_error`	`true`	Postgres-only fallback
`search.intent_cache_ttl_sec`	`2592000`	30 d
`search.region_pinning.strict`	`true`	enforce region filter
`search.index_rebuild.dry_run`	`false`	for chaos tests

11. Cost guardrails

Surface	Soft cap	Hard cap
Cloud Run CPU·s/month	80 % of monthly budget	110 % ⇒ scale max instances down
Pub/Sub bytes/month	80 %	110 % ⇒ down-sample `query.executed.v1`
OpenSearch storage	80 %	110 % ⇒ tighten ILM warm phase
AI orchestrator $$	per AI_INTEGRATION § 5	hard cap denies new calls

Monthly cost report posted to #ops-billing and reviewed by the platform owner.

12. Service identity & SAs

SA	Used by	Roles
`search-aggregation@<project>.iam.gserviceaccount.com`	Cloud Run revision	`roles/cloudsql.client`, `roles/pubsub.subscriber`, `roles/pubsub.publisher` (own topics only), `roles/secretmanager.secretAccessor` (scoped), `roles/redis.editor` (Memorystore), `roles/aiplatform.user` (only when Phase 2+ direct calls authorized — currently denied)
`ci-search-aggregation@<project>.iam.gserviceaccount.com`	Cloud Build	image push, deploy, run integration tests
`index-builder@<project>.iam.gserviceaccount.com`	Index rebuild Cloud Run Job	BigQuery read-only on `events_raw.melmastoon_*`, OpenSearch admin via Aiven token

All SAs are managed in the platform IaC (Terraform gcp/iam/search-aggregation/*.tf).

13. Operational handover artifacts

Live dashboards URL set in ops/dashboards/search-aggregation-service.json (Grafana).
On-call rotation: search-aggregation PagerDuty schedule, primary + secondary, follow-the-sun.
Runbooks under ops/runbooks/search-aggregation-service/ (per OBSERVABILITY § 9).
Change calendar: weekly window Tue 09:00 UTC for non-urgent changes; emergency outside-window changes require platform owner approval.

1. Cloud, regions, and SKUs​

2. Topology diagram​

3. Process model​

4. Scaling parameters​

5. Network & connectivity​

6. CI/CD pipeline​

7. Progressive rollout​

8. Disaster recovery​

9. OpenSearch index swap runbook (summary)​

10. Configuration & feature flags​

11. Cost guardrails​

12. Service identity & SAs​

13. Operational handover artifacts​