Skip to main content

search-aggregation-service — FAILURE_MODES

Companion: OBSERVABILITY · DEPLOYMENT_TOPOLOGY · APPLICATION_LOGIC · SECURITY_MODEL · SERVICE_RISK_REGISTER

This document enumerates the failure modes of search-aggregation-service, the detection signals, the automatic mitigations built into the service, and the operator runbook handle. For each mode, the user-visible blast radius is the consumer meta-search experience (bff-consumer-service); ranked hotel results are graceful-degradation-first.

Severity legend

  • P1 — paging incident; user-visible outage on the meta-search surface.
  • P2 — paging incident; degraded experience or operator surface impact.
  • P3 — ticket; no user-visible impact yet, but trending toward one.

1. OpenSearch unavailable (cluster red, network partition, Aiven outage)

User impact: Search degrades to Postgres-only path: BM25 absent (no full-text), no facets beyond what Postgres can compute, no k-NN re-rank. Geo + filter + sort by price/distance still work. Latency rises (p95 up to ~700 ms) but service remains 200.

Detection:

  • opensearch_cluster_status{color="red"} for 1 min ⇒ P1.
  • opensearch_query_duration_ms p95 > 1 500 ms for 5 min ⇒ P2.
  • Adapter circuit breaker opens on > 50 % errors over 30 s.

Auto-mitigation:

  • Circuit breaker opens; subsequent calls bypass OpenSearch and call PostgresFallbackSearchPort.
  • Response sets degradationLevel: "opensearch_unavailable"; query.executed.v1 carries the same.
  • Cache TTL doubled (60 s → 120 s) on the search route to reduce read load.

Operator handle: ops/runbooks/search-aggregation-service/opensearch-degraded.md.

Recovery: Aiven failover or fresh cluster + index rebuild via § 9 of DEPLOYMENT_TOPOLOGY.md. Circuit breaker auto-closes once health probe passes 3 consecutive checks.


2. Postgres unavailable (Cloud SQL primary down, connection pool exhausted)

User impact: Search route returns 503 (no Postgres = no fallback); cache hits still serve. Detail route continues to serve from cache for 300 s. Projection writes stop ⇒ freshness degrades.

Detection:

  • db_connection_errors_total rate > 0 for 1 min ⇒ P1.
  • pg_stat_activity (scraped) saturation > 90 % ⇒ P2.
  • /readyz flips to 503 ⇒ Cloud Run stops sending new traffic to that revision (handled by Cloud LB).

Auto-mitigation:

  • HTTP server stops accepting new requests on /readyz=false; LB routes to surviving pods or to asia-south1 revision (cross-region read replica only — writes blocked until primary recovers).
  • Subscribers nack messages; Pub/Sub retries with backoff; messages stay in topics.
  • Outbox publisher backs off but keeps the lock acquired so it doesn't churn.

Operator handle: ops/runbooks/postgres-failover.md (platform-shared).

Recovery: Cloud SQL HA failover (≤ 60 s typical); read-replica promotion if primary is lost.


3. Memorystore Redis unavailable

User impact: Cache misses go through to Postgres + OpenSearch. Latency p95 rises (~+80 ms on /queries). No correctness loss.

Detection:

  • cache_op_duration_ms p99 > 100 ms ⇒ P3.
  • cache_errors_total rate > 0 ⇒ P3.

Auto-mitigation: RedisCachePort returns null on error; calling code treats as cache miss. Reads continue normally; writes are skipped.

Operator handle: ops/runbooks/memorystore-recovery.md.


4. Pub/Sub publish failure (outbox stuck)

User impact: Stale projection — index entries remain consistent but downstream consumers (analytics-service, cache invalidators) miss updates. After ~5 min, search results may serve slightly stale data via the cache (TTL fallback covers within 60 s once outbox catches up).

Detection:

  • outbox_pending > 5 000 for 5 min ⇒ P2 page.
  • outbox_publish_lag_ms p95 > 10 000 for 5 min ⇒ P2.

Auto-mitigation: Publisher uses exponential backoff + jitter; advisory lock prevents thundering herd; messages persist in search.outbox with last_error recorded.

Operator handle: ops/runbooks/search-aggregation-service/outbox-stuck.md — typically: bounce the publisher pod, check Pub/Sub quotas, verify the subscription's DLQ.


5. Pub/Sub consume failure (DLQ spike)

User impact: None immediately — failed events sit in DLQ; the affected slice on a subset of properties is stale. If a structural bug is shipping every event to DLQ, freshness collapses within minutes.

Detection:

  • pubsub_dlq_total rate > 0.1 % over 15 min per subscription ⇒ P2.
  • inbox_unprocessed > 10 000 ⇒ P2.

Auto-mitigation: None automatic — DLQ requires intentional drain.

Operator handle: ops/runbooks/search-aggregation-service/dlq-drain.md. Triage: inspect 5 sample messages; classify as transient, poison, or schema-incompat. Transient ⇒ replay to topic. Poison ⇒ leave in DLQ + open ticket. Schema-incompat ⇒ deploy schema-tolerant consumer + replay.


6. Out-of-order events from upstream

User impact: Without protection, a stale slice would overwrite a newer one ⇒ wrong price/availability. With protection: the stale event is dropped, no corruption.

Detection:

  • projection_skipped_stale_total{topic, slice} rate.
  • Per-topic SLI: rate < 1 % is normal; > 5 % indicates a producer regression.

Auto-mitigation: Vector-clock guard on every consumer; dropped_stale is recorded in inbox and the event is acked (NOT replayed).

Operator handle: If rate is unusually high, contact the upstream service owner; this is a producer-side problem (their vectorClock isn't monotonic).


7. AI orchestrator unavailable (or budget cap exceeded)

User impact: Multilingual intent parse falls back to keyword-only search; semantic re-rank skipped (Phase 2+). User experience slightly worse for non-English queries but service is fully functional.

Detection:

  • ai_search_intent_fallbacks_total / ai_search_intent_calls_total > 20 % over 15 min ⇒ P3.
  • ai.orchestrator span error rate > 50 % over 5 min ⇒ P3.

Auto-mitigation: 800 ms timeout, then fallback path. query.executed.v1 carries intentSource: "fallback". Cache continues to serve hits from the prior 30-day window.

Operator handle: ops/runbooks/search-aggregation-service/ai-orchestrator-degraded.md.


8. Allow-list breach (forbidden field detected in projection)

User impact: Potentially severe — risk of cross-tenant data leak to anonymous public traffic. The breach itself does not necessarily mean a user saw the field, because the OpenSearch template is dynamic: "strict" and would reject the document — but a Postgres breach combined with a code path that returns the row to the consumer is a P1 security incident.

Detection:

  • projection_field_stripped_total ≥ 1 ⇒ P1 security page.
  • Nightly ProjectionExposureAuditor finds an unexpected column or doc field ⇒ P1.

Auto-mitigation:

  • L2 projection-policy strips the field and records the counter; the persisted row is safe.
  • L3 OpenSearch rejects any document with an unknown field; the inbox marks the event dlq and a projection.failed.v1 event is emitted.
  • Read APIs return only allow-listed fields by construction (DTO is the type, not the row).

Operator handle: ops/runbooks/search-aggregation-service/allow-list-breach-ir.md. First action: page security on-call, freeze releases, rotate boost-rule admin tokens (precaution), capture the offending event(s) from inbox + DLQ for forensic analysis.


9. Cross-tenant query log inference

User impact: A logged SearchQuery.text could in theory carry tenant-identifiable strings (e.g. brand name). At scale this could leak which tenants exist. Public anyway, but quantifiable.

Detection: Manual quarterly review of search_queries.text distribution against a list of high-PII brand tokens.

Auto-mitigation: PII redaction on text before persistence; nightly anonymizer nullifies text and user_bucket at 30 d.

Operator handle: Tighten redaction list; consider stricter sampling.


10. Boost rule scope violation attempted

User impact: None to consumers (rejected). Indicates either an operator with the wrong tenant context or a JWT / OPA misconfiguration.

Detection: MELMASTOON.SEARCH.BOOST_RULE_SCOPE_VIOLATION count by tenantId > 0 ⇒ P3 ticket; sustained > 10/min ⇒ P2 page (potential malicious tenant operator).

Auto-mitigation: Service rejects 403 with the canonical error code; audit log entry + Slack #sec-ops notification.


11. Cursor forgery / replay attack

User impact: Forged cursor returns 400 (MELMASTOON.SEARCH.CURSOR_INVALID); replayed valid cursor is honored (idempotent paging).

Detection: cursor_invalid_total rate > 5 / min from a single IP ⇒ Cloud Armor IP-based rate-limit kicks in automatically.

Auto-mitigation: Cursor validates HMAC signature with current key; rotated keys retain validity 24 h; bad cursor rejected.


12. Index lag (Postgres canonical ahead of OpenSearch)

User impact: Stale results — a freshly published property may not appear until OpenSearch catches up.

Detection: index_lag_docs gauge > 5 000 for 5 min ⇒ P2; > 50 000 for 1 min ⇒ P1.

Auto-mitigation: OpenSearch mirror writer batches and retries; drift sweep job re-emits projection.updated.v1 for any row whose hash differs.

Operator handle: ops/runbooks/search-aggregation-service/index-lag-reconcile.md.


13. Index rebuild failure (during a swap)

User impact: None — the existing alias keeps serving. The new index is left orphaned and reaped after 24 h.

Detection: IndexBuild.status='failed'; index.health_alert.v1 event.

Auto-mitigation: Build process aborts on phase failure, leaving alias unchanged. Partial new index marked failed in index_builds.

Operator handle: ops/runbooks/search-aggregation-service/index-rebuild.md — investigate phase logs (replay vs catching_up vs swapping); restart with adjusted parameters.


14. Tenant cascade purge incomplete

User impact: A deleted tenant's hotels remain in search until purge completes. Compliance risk if exceeds the 60-min SLO window.

Detection: Purge handler emits tenant.purge_completed.v1 carrying rowsDeleted, openSearchDocsDeleted, durationMs. If duration > 60 min ⇒ P2; if no completion event for an acked tenant.deleted.v1 within 90 min ⇒ P1.

Auto-mitigation: Handler partitions the cascade into chunks of 1 000 rows; resumable via tenant_id index.

Operator handle: ops/runbooks/search-aggregation-service/tenant-purge.md.


15. Hot query DoS (single repeated expensive query)

User impact: Targeted query saturates OpenSearch shards; other queries slow down.

Detection: opensearch_query_duration_ms p99 spike correlated with a single query.canonicalHash ⇒ P2.

Auto-mitigation:

  • Per-canonical-hash short circuit: if a query took > 1 500 ms in the last 60 s window, identical queries are served from a negative-cache (last result + degradationLevel="rate_limited") until the cool-down passes.
  • Cloud Armor per-IP rate limit absorbs naive amplification.

Operator handle: Add the canonical-hash to a temporary block list via flags-service; engage product to reshape the query if it's a legitimate use case.


16. Pub/Sub topic / subscription misconfiguration

User impact: Either silent data loss (subscription deleted) or duplicate consumption (multiple subscriptions per consumer).

Detection: Synthetic event flow (probe.search.heartbeat.v1) every 5 min with end-to-end latency check; missing heartbeat for 15 min ⇒ P1.

Auto-mitigation: None — IaC drift is what creates this. Terraform plan on every release flags drift.

Operator handle: Reconcile via Terraform apply; drain DLQ if needed.


17. JWKS fetch failure (operator routes broken)

User impact: Operator boost-rule and admin routes return 503; consumer routes unaffected.

Detection: jwks_fetch_errors_total rate > 0 over 5 min ⇒ P2.

Auto-mitigation: JWKS cache (10 min) absorbs short outages; service starts a background refresh with backoff.

Operator handle: Verify iam-service health; if extended, manually rotate the cached JWKS via Secret Manager break-glass entry.


18. Region misconfiguration (property indexed in wrong region)

User impact: Hotel doesn't appear for users targeting their region; appears for the wrong region.

Detection: Auditor counter region_mismatch_total > 0 ⇒ P3 ticket; sustained > 50/h ⇒ P2.

Auto-mitigation: None — requires correct upstream region or a code fix in RegionPinningPolicy.

Operator handle: Patch policy / property; trigger a partial reindex of affected properties via POST /api/v1/search/index:rebuild with a property-id subset (admin extension).


19. Disk pressure on Cloud SQL (partitioned tables overflow)

User impact: Postgres slows down ⇒ projection lag ⇒ freshness SLO burn.

Detection: Cloud SQL disk usage > 80 % ⇒ P3, > 90 % ⇒ P2.

Auto-mitigation: Auto-resize on (Cloud SQL feature). Range-partitioned tables shed old partitions monthly via nightly-partition-prune job.

Operator handle: Verify partition prune is current; emergency: drop oldest click_events and search_queries partitions older than retention policy.


20. Ranking regression after an algorithm change

User impact: Subtle UX degradation — clicks per impression drops, conversion drops downstream.

Detection: Daily Looker report comparing CTR by query.canonicalHash cohort vs the prior 7-day baseline; > 10 % drop ⇒ P3 ticket.

Auto-mitigation: None — feature flags allow instant rollback of ranking changes (search.ranking.algorithm_version).

Operator handle: Toggle flag back to prior algorithm; collect samples; revert via PR.


21. Time skew on consumer pods

User impact: Wrong occurredAt on emitted events ⇒ analytics drift, possibly stale-event misclassification.

Detection: clock_skew_seconds (NTP-synced offset, gauge) > 5 s ⇒ P3; > 30 s ⇒ P2.

Auto-mitigation: Cloud Run hosts are NTP-synced; service refuses to start if clock_skew_seconds > 30 at boot.


22. Catastrophic data loss (region-wide outage, both Postgres + OpenSearch)

User impact: Search down in the affected region.

Detection: Synthetic checks fail in the region; Cloud Status page red.

Auto-mitigation: External LB routes traffic to the surviving region's Cloud Run revision. Surviving region operates with its own (possibly slightly behind) projection.

Operator handle: DR runbook (per DEPLOYMENT_TOPOLOGY § 8). Restore from PITR + rebuild OpenSearch from BigQuery archive in the recovered region. Quarterly game day rehearses this scenario.


Summary table

#ModeSeverityAuto-mitigationRunbook
1OpenSearch downP1Postgres fallbackopensearch-degraded
2Postgres downP1failoverpostgres-failover (platform)
3Redis downP3bypassmemorystore-recovery
4Outbox stuckP2backoffoutbox-stuck
5DLQ spikeP2nonedlq-drain
6Out-of-order eventsn/a (handled)vector-clock skipn/a
7AI downP3fallbackai-orchestrator-degraded
8Allow-list breachP1 secstrip + rejectallow-list-breach-ir
9Query-log inferenceP4redact + anonymizen/a
10Boost-rule scope violationP2/P3reject + auditn/a
11Cursor forgeryP3rejectn/a
12Index lagP2/P1retry + sweepindex-lag-reconcile
13Index rebuild failureP3abort safelyindex-rebuild
14Tenant purge incompleteP2/P1resumable cascadetenant-purge
15Hot-query DoSP2negative cache + Cloud Armorn/a
16Pub/Sub misconfigP1none (IaC)n/a
17JWKS fetch failP2cachen/a
18Region mismatchP3/P2nonen/a
19Disk pressureP3/P2auto-resize + prunen/a
20Ranking regressionP3flag rollbackn/a
21Clock skewP3/P2NTP + boot guardn/a
22Region-wide outageP1LB failover + DRplatform DR