search-aggregation-service — FAILURE_MODES
Companion: OBSERVABILITY · DEPLOYMENT_TOPOLOGY · APPLICATION_LOGIC · SECURITY_MODEL · SERVICE_RISK_REGISTER
This document enumerates the failure modes of search-aggregation-service, the detection signals, the automatic mitigations built into the service, and the operator runbook handle. For each mode, the user-visible blast radius is the consumer meta-search experience (bff-consumer-service); ranked hotel results are graceful-degradation-first.
Severity legend
- P1 — paging incident; user-visible outage on the meta-search surface.
- P2 — paging incident; degraded experience or operator surface impact.
- P3 — ticket; no user-visible impact yet, but trending toward one.
1. OpenSearch unavailable (cluster red, network partition, Aiven outage)
User impact: Search degrades to Postgres-only path: BM25 absent (no full-text), no facets beyond what Postgres can compute, no k-NN re-rank. Geo + filter + sort by price/distance still work. Latency rises (p95 up to ~700 ms) but service remains 200.
Detection:
opensearch_cluster_status{color="red"}for 1 min ⇒ P1.opensearch_query_duration_msp95 > 1 500 ms for 5 min ⇒ P2.- Adapter circuit breaker opens on > 50 % errors over 30 s.
Auto-mitigation:
- Circuit breaker opens; subsequent calls bypass OpenSearch and call
PostgresFallbackSearchPort. - Response sets
degradationLevel: "opensearch_unavailable";query.executed.v1carries the same. - Cache TTL doubled (60 s → 120 s) on the search route to reduce read load.
Operator handle: ops/runbooks/search-aggregation-service/opensearch-degraded.md.
Recovery: Aiven failover or fresh cluster + index rebuild via § 9 of DEPLOYMENT_TOPOLOGY.md. Circuit breaker auto-closes once health probe passes 3 consecutive checks.
2. Postgres unavailable (Cloud SQL primary down, connection pool exhausted)
User impact: Search route returns 503 (no Postgres = no fallback); cache hits still serve. Detail route continues to serve from cache for 300 s. Projection writes stop ⇒ freshness degrades.
Detection:
db_connection_errors_totalrate > 0 for 1 min ⇒ P1.pg_stat_activity(scraped) saturation > 90 % ⇒ P2./readyzflips to 503 ⇒ Cloud Run stops sending new traffic to that revision (handled by Cloud LB).
Auto-mitigation:
- HTTP server stops accepting new requests on
/readyz=false; LB routes to surviving pods or toasia-south1revision (cross-region read replica only — writes blocked until primary recovers). - Subscribers nack messages; Pub/Sub retries with backoff; messages stay in topics.
- Outbox publisher backs off but keeps the lock acquired so it doesn't churn.
Operator handle: ops/runbooks/postgres-failover.md (platform-shared).
Recovery: Cloud SQL HA failover (≤ 60 s typical); read-replica promotion if primary is lost.
3. Memorystore Redis unavailable
User impact: Cache misses go through to Postgres + OpenSearch. Latency p95 rises (~+80 ms on /queries). No correctness loss.
Detection:
cache_op_duration_msp99 > 100 ms ⇒ P3.cache_errors_totalrate > 0 ⇒ P3.
Auto-mitigation: RedisCachePort returns null on error; calling code treats as cache miss. Reads continue normally; writes are skipped.
Operator handle: ops/runbooks/memorystore-recovery.md.
4. Pub/Sub publish failure (outbox stuck)
User impact: Stale projection — index entries remain consistent but downstream consumers (analytics-service, cache invalidators) miss updates. After ~5 min, search results may serve slightly stale data via the cache (TTL fallback covers within 60 s once outbox catches up).
Detection:
outbox_pending > 5 000for 5 min ⇒ P2 page.outbox_publish_lag_msp95 > 10 000 for 5 min ⇒ P2.
Auto-mitigation: Publisher uses exponential backoff + jitter; advisory lock prevents thundering herd; messages persist in search.outbox with last_error recorded.
Operator handle: ops/runbooks/search-aggregation-service/outbox-stuck.md — typically: bounce the publisher pod, check Pub/Sub quotas, verify the subscription's DLQ.
5. Pub/Sub consume failure (DLQ spike)
User impact: None immediately — failed events sit in DLQ; the affected slice on a subset of properties is stale. If a structural bug is shipping every event to DLQ, freshness collapses within minutes.
Detection:
pubsub_dlq_totalrate > 0.1 % over 15 min per subscription ⇒ P2.inbox_unprocessed > 10 000⇒ P2.
Auto-mitigation: None automatic — DLQ requires intentional drain.
Operator handle: ops/runbooks/search-aggregation-service/dlq-drain.md. Triage: inspect 5 sample messages; classify as transient, poison, or schema-incompat. Transient ⇒ replay to topic. Poison ⇒ leave in DLQ + open ticket. Schema-incompat ⇒ deploy schema-tolerant consumer + replay.
6. Out-of-order events from upstream
User impact: Without protection, a stale slice would overwrite a newer one ⇒ wrong price/availability. With protection: the stale event is dropped, no corruption.
Detection:
projection_skipped_stale_total{topic, slice}rate.- Per-topic SLI: rate < 1 % is normal; > 5 % indicates a producer regression.
Auto-mitigation: Vector-clock guard on every consumer; dropped_stale is recorded in inbox and the event is acked (NOT replayed).
Operator handle: If rate is unusually high, contact the upstream service owner; this is a producer-side problem (their vectorClock isn't monotonic).
7. AI orchestrator unavailable (or budget cap exceeded)
User impact: Multilingual intent parse falls back to keyword-only search; semantic re-rank skipped (Phase 2+). User experience slightly worse for non-English queries but service is fully functional.
Detection:
ai_search_intent_fallbacks_total / ai_search_intent_calls_total> 20 % over 15 min ⇒ P3.ai.orchestratorspan error rate > 50 % over 5 min ⇒ P3.
Auto-mitigation: 800 ms timeout, then fallback path. query.executed.v1 carries intentSource: "fallback". Cache continues to serve hits from the prior 30-day window.
Operator handle: ops/runbooks/search-aggregation-service/ai-orchestrator-degraded.md.
8. Allow-list breach (forbidden field detected in projection)
User impact: Potentially severe — risk of cross-tenant data leak to anonymous public traffic. The breach itself does not necessarily mean a user saw the field, because the OpenSearch template is dynamic: "strict" and would reject the document — but a Postgres breach combined with a code path that returns the row to the consumer is a P1 security incident.
Detection:
projection_field_stripped_total≥ 1 ⇒ P1 security page.- Nightly
ProjectionExposureAuditorfinds an unexpected column or doc field ⇒ P1.
Auto-mitigation:
- L2 projection-policy strips the field and records the counter; the persisted row is safe.
- L3 OpenSearch rejects any document with an unknown field; the inbox marks the event
dlqand aprojection.failed.v1event is emitted. - Read APIs return only allow-listed fields by construction (DTO is the type, not the row).
Operator handle: ops/runbooks/search-aggregation-service/allow-list-breach-ir.md. First action: page security on-call, freeze releases, rotate boost-rule admin tokens (precaution), capture the offending event(s) from inbox + DLQ for forensic analysis.
9. Cross-tenant query log inference
User impact: A logged SearchQuery.text could in theory carry tenant-identifiable strings (e.g. brand name). At scale this could leak which tenants exist. Public anyway, but quantifiable.
Detection: Manual quarterly review of search_queries.text distribution against a list of high-PII brand tokens.
Auto-mitigation: PII redaction on text before persistence; nightly anonymizer nullifies text and user_bucket at 30 d.
Operator handle: Tighten redaction list; consider stricter sampling.
10. Boost rule scope violation attempted
User impact: None to consumers (rejected). Indicates either an operator with the wrong tenant context or a JWT / OPA misconfiguration.
Detection: MELMASTOON.SEARCH.BOOST_RULE_SCOPE_VIOLATION count by tenantId > 0 ⇒ P3 ticket; sustained > 10/min ⇒ P2 page (potential malicious tenant operator).
Auto-mitigation: Service rejects 403 with the canonical error code; audit log entry + Slack #sec-ops notification.
11. Cursor forgery / replay attack
User impact: Forged cursor returns 400 (MELMASTOON.SEARCH.CURSOR_INVALID); replayed valid cursor is honored (idempotent paging).
Detection: cursor_invalid_total rate > 5 / min from a single IP ⇒ Cloud Armor IP-based rate-limit kicks in automatically.
Auto-mitigation: Cursor validates HMAC signature with current key; rotated keys retain validity 24 h; bad cursor rejected.
12. Index lag (Postgres canonical ahead of OpenSearch)
User impact: Stale results — a freshly published property may not appear until OpenSearch catches up.
Detection: index_lag_docs gauge > 5 000 for 5 min ⇒ P2; > 50 000 for 1 min ⇒ P1.
Auto-mitigation: OpenSearch mirror writer batches and retries; drift sweep job re-emits projection.updated.v1 for any row whose hash differs.
Operator handle: ops/runbooks/search-aggregation-service/index-lag-reconcile.md.
13. Index rebuild failure (during a swap)
User impact: None — the existing alias keeps serving. The new index is left orphaned and reaped after 24 h.
Detection: IndexBuild.status='failed'; index.health_alert.v1 event.
Auto-mitigation: Build process aborts on phase failure, leaving alias unchanged. Partial new index marked failed in index_builds.
Operator handle: ops/runbooks/search-aggregation-service/index-rebuild.md — investigate phase logs (replay vs catching_up vs swapping); restart with adjusted parameters.
14. Tenant cascade purge incomplete
User impact: A deleted tenant's hotels remain in search until purge completes. Compliance risk if exceeds the 60-min SLO window.
Detection: Purge handler emits tenant.purge_completed.v1 carrying rowsDeleted, openSearchDocsDeleted, durationMs. If duration > 60 min ⇒ P2; if no completion event for an acked tenant.deleted.v1 within 90 min ⇒ P1.
Auto-mitigation: Handler partitions the cascade into chunks of 1 000 rows; resumable via tenant_id index.
Operator handle: ops/runbooks/search-aggregation-service/tenant-purge.md.
15. Hot query DoS (single repeated expensive query)
User impact: Targeted query saturates OpenSearch shards; other queries slow down.
Detection: opensearch_query_duration_ms p99 spike correlated with a single query.canonicalHash ⇒ P2.
Auto-mitigation:
- Per-canonical-hash short circuit: if a query took > 1 500 ms in the last 60 s window, identical queries are served from a
negative-cache(last result +degradationLevel="rate_limited") until the cool-down passes. - Cloud Armor per-IP rate limit absorbs naive amplification.
Operator handle: Add the canonical-hash to a temporary block list via flags-service; engage product to reshape the query if it's a legitimate use case.
16. Pub/Sub topic / subscription misconfiguration
User impact: Either silent data loss (subscription deleted) or duplicate consumption (multiple subscriptions per consumer).
Detection: Synthetic event flow (probe.search.heartbeat.v1) every 5 min with end-to-end latency check; missing heartbeat for 15 min ⇒ P1.
Auto-mitigation: None — IaC drift is what creates this. Terraform plan on every release flags drift.
Operator handle: Reconcile via Terraform apply; drain DLQ if needed.
17. JWKS fetch failure (operator routes broken)
User impact: Operator boost-rule and admin routes return 503; consumer routes unaffected.
Detection: jwks_fetch_errors_total rate > 0 over 5 min ⇒ P2.
Auto-mitigation: JWKS cache (10 min) absorbs short outages; service starts a background refresh with backoff.
Operator handle: Verify iam-service health; if extended, manually rotate the cached JWKS via Secret Manager break-glass entry.
18. Region misconfiguration (property indexed in wrong region)
User impact: Hotel doesn't appear for users targeting their region; appears for the wrong region.
Detection: Auditor counter region_mismatch_total > 0 ⇒ P3 ticket; sustained > 50/h ⇒ P2.
Auto-mitigation: None — requires correct upstream region or a code fix in RegionPinningPolicy.
Operator handle: Patch policy / property; trigger a partial reindex of affected properties via POST /api/v1/search/index:rebuild with a property-id subset (admin extension).
19. Disk pressure on Cloud SQL (partitioned tables overflow)
User impact: Postgres slows down ⇒ projection lag ⇒ freshness SLO burn.
Detection: Cloud SQL disk usage > 80 % ⇒ P3, > 90 % ⇒ P2.
Auto-mitigation: Auto-resize on (Cloud SQL feature). Range-partitioned tables shed old partitions monthly via nightly-partition-prune job.
Operator handle: Verify partition prune is current; emergency: drop oldest click_events and search_queries partitions older than retention policy.
20. Ranking regression after an algorithm change
User impact: Subtle UX degradation — clicks per impression drops, conversion drops downstream.
Detection: Daily Looker report comparing CTR by query.canonicalHash cohort vs the prior 7-day baseline; > 10 % drop ⇒ P3 ticket.
Auto-mitigation: None — feature flags allow instant rollback of ranking changes (search.ranking.algorithm_version).
Operator handle: Toggle flag back to prior algorithm; collect samples; revert via PR.
21. Time skew on consumer pods
User impact: Wrong occurredAt on emitted events ⇒ analytics drift, possibly stale-event misclassification.
Detection: clock_skew_seconds (NTP-synced offset, gauge) > 5 s ⇒ P3; > 30 s ⇒ P2.
Auto-mitigation: Cloud Run hosts are NTP-synced; service refuses to start if clock_skew_seconds > 30 at boot.
22. Catastrophic data loss (region-wide outage, both Postgres + OpenSearch)
User impact: Search down in the affected region.
Detection: Synthetic checks fail in the region; Cloud Status page red.
Auto-mitigation: External LB routes traffic to the surviving region's Cloud Run revision. Surviving region operates with its own (possibly slightly behind) projection.
Operator handle: DR runbook (per DEPLOYMENT_TOPOLOGY § 8). Restore from PITR + rebuild OpenSearch from BigQuery archive in the recovered region. Quarterly game day rehearses this scenario.
Summary table
| # | Mode | Severity | Auto-mitigation | Runbook |
|---|---|---|---|---|
| 1 | OpenSearch down | P1 | Postgres fallback | opensearch-degraded |
| 2 | Postgres down | P1 | failover | postgres-failover (platform) |
| 3 | Redis down | P3 | bypass | memorystore-recovery |
| 4 | Outbox stuck | P2 | backoff | outbox-stuck |
| 5 | DLQ spike | P2 | none | dlq-drain |
| 6 | Out-of-order events | n/a (handled) | vector-clock skip | n/a |
| 7 | AI down | P3 | fallback | ai-orchestrator-degraded |
| 8 | Allow-list breach | P1 sec | strip + reject | allow-list-breach-ir |
| 9 | Query-log inference | P4 | redact + anonymize | n/a |
| 10 | Boost-rule scope violation | P2/P3 | reject + audit | n/a |
| 11 | Cursor forgery | P3 | reject | n/a |
| 12 | Index lag | P2/P1 | retry + sweep | index-lag-reconcile |
| 13 | Index rebuild failure | P3 | abort safely | index-rebuild |
| 14 | Tenant purge incomplete | P2/P1 | resumable cascade | tenant-purge |
| 15 | Hot-query DoS | P2 | negative cache + Cloud Armor | n/a |
| 16 | Pub/Sub misconfig | P1 | none (IaC) | n/a |
| 17 | JWKS fetch fail | P2 | cache | n/a |
| 18 | Region mismatch | P3/P2 | none | n/a |
| 19 | Disk pressure | P3/P2 | auto-resize + prune | n/a |
| 20 | Ranking regression | P3 | flag rollback | n/a |
| 21 | Clock skew | P3/P2 | NTP + boot guard | n/a |
| 22 | Region-wide outage | P1 | LB failover + DR | platform DR |