Skip to main content

search-aggregation-service — SECURITY_MODEL

Companion: SERVICE_OVERVIEW · DATA_MODEL · API_CONTRACTS · DEPLOYMENT_TOPOLOGY · ../../docs/07-security-compliance-tenancy.md · ../../docs/architecture/ADR-0002-multi-tenancy-model.md

1. Threat model — top three

  1. Cross-tenant PII leak through the search index. The service is uniquely permitted to read across tenants, so any field that escapes the allow-list reaches anonymous users worldwide. Mitigation: four defense layers (type, projection, schema, audit).
  2. Tenant data injection by a compromised upstream service publishing a malicious payload. Mitigation: strict event payload validation, allow-list filtering, signed event provenance.
  3. DoS via expensive queries (from + size > 10 000, very wide bbox, many facets). Mitigation: request validation, hard limits, per-IP and per-user-bucket rate limits, query-cost circuit breaker.

The full threat model is maintained in SERVICE_RISK_REGISTER.md.

2. Tenancy posture (inverted)

This service is the single exception to the tenant-isolation rule documented in ADR-0002. It is therefore subject to extra controls:

PostureStandard servicesearch-aggregation-service
X-Tenant-Id required on consumer readsyesno (anonymous meta-search)
Postgres RLS scoped to tenant_id = current_setting('app.tenant_id')yesno — sentinel __cross_tenant__ allowed; tables have no tenant-scoped policy
Per-tenant Cloud SQL connection poolyesno — single pool with the cross-tenant SA
Field-level allow-list for indexable dataoptionalmandatory and CI-enforced
OpenSearch index per tenantn/a (only this service uses OpenSearch)no — single index across tenants, by region
Cascade purge on tenant.deleted.v1optionalmandatory
Tenant-isolation integration testmandatoryinverted — asserts cross-tenant reads succeed AND that no forbidden field appears in any document

3. Identity, authentication, authorization

3.1 Identity context

SurfaceIdentity providerCredentialScopes
Public consumer routes (/api/v1/search/queries, /hotels/*, /suggest)nonenonenone — anonymous
Public consumer click route (/api/v1/search/clicks)optional opaque X-User-Bucketnonenone — anonymous
Operator boost-rule routesiam-service (JWT, RS256)Authorization: Bearer <jwt>, X-Tenant-Id: <uuid>`search:boost-rule:read
Operator index routesplatform admin JWTas above + X-Admin: true claim`search:index:rebuild
Internal endpoints (/internal/*)mTLS via service-mesh + IAM SA on Cloud Runx509per-route IAM bindings
Health (/healthz, /readyz)nonenonenone

3.2 Authorization rules

  • Public routes: no authorization check; only rate limiting and request validation.
  • Boost-rule routes: an operator may only read/write boost rules whose tenantId matches the JWT tenantId claim. Cross-tenant boost-rule writes are rejected with MELMASTOON.SEARCH.BOOST_RULE_SCOPE_VIOLATION.
  • Index admin routes: gated by an OPA policy bundle search-admin.rego requiring role: platform.search_admin AND mfa: true claims. Source-IP allow-list (operator VPN ranges) enforced at Cloud Armor.
  • Internal routes: bound to caller SAs:
    • analytics-service@…GET /internal/v1/projection/changes
    • bff-consumer-service@…POST /internal/v1/cache:invalidate
    • other SAs → 403.

3.3 Token verification

  • JWKS pulled from iam-service and cached 10 minutes.
  • iss, aud, exp, nbf, tenantId, sub, mfa claims validated.
  • Clock skew tolerance: 60 s.
  • On verification failure: 401 + audit log entry + per-IP rate-limit decrement.

4. Cross-tenant data exposure controls (the centerpiece)

LayerControl
L1 — Type systemHotelIndexEntry TypeScript type literally has no field for forbidden data; extends never static guard on any extra key. CI test type-allowlist.spec.ts walks the type via ts-morph and rejects PRs that add a non-allow-listed property.
L2 — Projection policyProjectionAllowListPolicy filters every inbound event payload to the explicit allow-list set in domain/policies/allow-list.ts. Anything outside ⇒ stripped, counter projection_field_stripped_total{field} incremented, alert when > 0 in steady state.
L3 — SchemaPostgres search.hotel_index_entries columns and OpenSearch index template (dynamic: "strict") only declare allow-listed fields. Schema diff CI gate compares the live schema to the committed allow-list.
L4 — Audit jobNightly ProjectionExposureAuditor (DATA_MODEL § 7) scans all rows and all OpenSearch fields and pages security on any anomaly.

Adding a new searchable field requires a PR that touches all four layers and includes a security review checkbox referencing the field's classification.

5. Data classification & handling

Field groupClassificationAt restIn transitIn logs
propertyId, tenantId, region, geo, amenities, name, description, hero*, priceFrom*, roomsAvailablepublicunencrypted column-level (CMEK at storage layer)TLS 1.2+allowed
popularity*, boostMultiplier, freshnessBoost, qualityScoreinternalas aboveTLSallowed
search_queries.text, search_queries.user_bucketrestricted (potential PII in free text)column CMEK + nullified at 30 dTLShash only, never raw
click_events.user_bucketrestrictednullified at 30 dTLShash only
Any AI provenanceinternaljsonbTLSallowed
Anything elseforbidden — must not existn/an/an/a

Logs are routed to Cloud Logging with severity, traceId, spanId, eventId, tenantId (when present), and request.shape (only the request structure, never raw text or user identifiers). PII redaction middleware runs before the log appender.

6. Secret handling

Secrets:

  • OPENSEARCH_BASIC_AUTH — Aiven managed credentials.
  • MEMORYSTORE_AUTH_STRING — Memorystore AUTH.
  • IAM_JWKS_URL and JWKS cache (no secret per se, hardened TLS only).
  • AI_ORCHESTRATOR_BASE_URL (no secret; SA-bound).

All secrets live in Google Secret Manager, accessed via Workload Identity Federation. The runtime SA search-aggregation@<project>.iam.gserviceaccount.com has roles/secretmanager.secretAccessor scoped to those secret names only. Rotation: 90 days for OpenSearch and Memorystore; automated via Cloud Scheduler + Secret Manager rotation hook.

No secret is ever read from environment variables in production. Local dev pulls from a synthetic secret bundle (see LOCAL_DEV_SETUP.md).

7. Network posture

  • Service runs on Cloud Run (region europe-west1 primary, asia-south1 secondary). Ingress: internal + Cloud Load Balancing only.
  • Public traffic enters via the Apigee → Cloud Armor → external HTTPS LB → bff-consumer-service → this service. Direct internet ingress is denied at the LB.
  • Egress is restricted by Serverless VPC Connector to:
    • Cloud SQL Postgres private IP,
    • OpenSearch (Aiven peered VPC),
    • Memorystore Redis private IP,
    • Pub/Sub via Private Google Access,
    • Secret Manager via PGA,
    • ai-orchestrator-service via internal LB.
  • All other egress denied by VPC firewall rules.
  • Cloud Armor rules: WAF preconfigured rules (OWASP CRS), per-IP rate limit (60 rps burst / 10 rps sustained for /api/v1/search/queries), geographic block list (synced from sanctions/OFAC list weekly).

8. Input validation & query safety

  • Request bodies validated with zod schemas per endpoint, max body size 32 KiB.
  • text field: max 256 characters, NFKC-normalized, control characters stripped, regex/wildcard syntax stripped before going to OpenSearch.
  • bbox: rejected if area > 250 000 km².
  • radiusKm: clamped to [0.1, 200].
  • page.size: hard max 50; page.cursor validated as opaque base64 of an HMAC-signed payload (rejects forged cursors).
  • from + size cap on OpenSearch translates to a "deep paging" guard returning MELMASTOON.SEARCH.PAGE_OUT_OF_RANGE past 10 000.
  • Facet selection limited to 12 simultaneous facets per query.
  • Idempotency-Key validated as ULID/UUID; replay returns the previous response and tracks idempotency_replay_total.

9. AI surface controls

Per AI_INTEGRATION.md:

  • All LLM/embedding calls go through ai-orchestrator-service and are subject to platform-wide redaction, rate limits, and cost caps.
  • AI is never used to construct OpenSearch DSL or SQL.
  • Cache keys for AI responses use the redacted prompt; raw prompts never leak into Redis.
  • AI-generated user-visible text requires hitlReviewed=true to be served.

10. Auditability

EventSinkRetention
Boost rule create/activate/cancelmelmastoon.search.boost_rule.v1 topic + audit-service mirror7 years
Index rebuild start/complete/failmelmastoon.search.index.v1 topic + audit2 years
Allow-list strip eventcounter + projection.failed.v1 if persistent2 years
Auth failure (operator routes)structured log + audit mirror1 year
Sensitive admin action (cache invalidate, index swap)structured log + Slack #sec-ops notification1 year
Tenant cascade purgemelmastoon.tenant.purge_completed.v1 ack7 years

Tamper-evidence: audit topic is append-only with subscriber audit-service writing to BigQuery + GCS Object Lock (90-day immutability).

11. Compliance hooks

  • Right-to-erasure (per tenant or per upstream subject): when a property is deleted upstream, the projection row is removed and OpenSearch document deleted within the SLO window. Search query logs for the affected propertyId clicks are anonymized in the next nightly nightly-anonymize-search-queries run.
  • Data residency: Region-pinned indexes ensure AF/TJ/IR property data are stored in regional indexes; cross-region reads from the consumer surface are explicit user choice.
  • Sanctions screening: not enforced here (no transactions). Tenants under sanctions are suppressed by tenant-service issuing tenant.deleted.v1 (or .suspended.v1 Phase 2+) which suppresses index entries.

12. Service identity & supply chain

  • Container built from a hermetic Cloud Build pipeline; image signed with Sigstore cosign, attestation stored in Artifact Registry.
  • Cloud Run admission policy requires cosign.verified=true on the deployed image.
  • Base image: distroless gcr.io/distroless/nodejs20-debian12.
  • SBOM (CycloneDX) generated per build; CVE scan via Container Analysis. Critical CVEs block release.

13. Penetration test coverage (annual)

Scope items required for the annual pentest:

  1. Allow-list bypass attempts via crafted upstream events (replay malformed payloads in staging).
  2. OpenSearch DSL injection through text, filter.amenities, sort.key.
  3. Boost-rule scope violation (cross-tenant write attempts with valid JWT for tenant A targeting tenant B).
  4. Cursor forgery (re-use, modify, sign with wrong key).
  5. Rate-limit bypass (header smuggling, IPv6 expansion, Apigee header spoofing).
  6. Cache poisoning via Vary mismatch on X-Currency / X-Region / Accept-Language.
  7. Cross-tenant query log inference (can a logged query expose another tenant's identity?).