search-aggregation-service — FAILURE_MODES

Companion: OBSERVABILITY · DEPLOYMENT_TOPOLOGY · APPLICATION_LOGIC · SECURITY_MODEL · SERVICE_RISK_REGISTER

This document enumerates the failure modes of search-aggregation-service, the detection signals, the automatic mitigations built into the service, and the operator runbook handle. For each mode, the user-visible blast radius is the consumer meta-search experience (bff-consumer-service); ranked hotel results are graceful-degradation-first.

Severity legend

P1 — paging incident; user-visible outage on the meta-search surface.
P2 — paging incident; degraded experience or operator surface impact.
P3 — ticket; no user-visible impact yet, but trending toward one.

1. OpenSearch unavailable (cluster red, network partition, Aiven outage)

User impact: Search degrades to Postgres-only path: BM25 absent (no full-text), no facets beyond what Postgres can compute, no k-NN re-rank. Geo + filter + sort by price/distance still work. Latency rises (p95 up to ~700 ms) but service remains 200.

Detection:

opensearch_cluster_status{color="red"} for 1 min ⇒ P1.
opensearch_query_duration_ms p95 > 1 500 ms for 5 min ⇒ P2.
Adapter circuit breaker opens on > 50 % errors over 30 s.

Auto-mitigation:

Circuit breaker opens; subsequent calls bypass OpenSearch and call PostgresFallbackSearchPort.
Response sets degradationLevel: "opensearch_unavailable"; query.executed.v1 carries the same.
Cache TTL doubled (60 s → 120 s) on the search route to reduce read load.

Operator handle: ops/runbooks/search-aggregation-service/opensearch-degraded.md.

Recovery: Aiven failover or fresh cluster + index rebuild via § 9 of DEPLOYMENT_TOPOLOGY.md. Circuit breaker auto-closes once health probe passes 3 consecutive checks.

2. Postgres unavailable (Cloud SQL primary down, connection pool exhausted)

User impact: Search route returns 503 (no Postgres = no fallback); cache hits still serve. Detail route continues to serve from cache for 300 s. Projection writes stop ⇒ freshness degrades.

Detection:

db_connection_errors_total rate > 0 for 1 min ⇒ P1.
pg_stat_activity (scraped) saturation > 90 % ⇒ P2.
/readyz flips to 503 ⇒ Cloud Run stops sending new traffic to that revision (handled by Cloud LB).

Auto-mitigation:

HTTP server stops accepting new requests on /readyz=false; LB routes to surviving pods or to asia-south1 revision (cross-region read replica only — writes blocked until primary recovers).
Subscribers nack messages; Pub/Sub retries with backoff; messages stay in topics.
Outbox publisher backs off but keeps the lock acquired so it doesn't churn.

Operator handle: ops/runbooks/postgres-failover.md (platform-shared).

Recovery: Cloud SQL HA failover (≤ 60 s typical); read-replica promotion if primary is lost.

3. Memorystore Redis unavailable

User impact: Cache misses go through to Postgres + OpenSearch. Latency p95 rises (~+80 ms on /queries). No correctness loss.

Detection:

cache_op_duration_ms p99 > 100 ms ⇒ P3.
cache_errors_total rate > 0 ⇒ P3.

Auto-mitigation: RedisCachePort returns null on error; calling code treats as cache miss. Reads continue normally; writes are skipped.

Operator handle: ops/runbooks/memorystore-recovery.md.

4. Pub/Sub publish failure (outbox stuck)

User impact: Stale projection — index entries remain consistent but downstream consumers (analytics-service, cache invalidators) miss updates. After ~5 min, search results may serve slightly stale data via the cache (TTL fallback covers within 60 s once outbox catches up).

Detection:

outbox_pending > 5 000 for 5 min ⇒ P2 page.
outbox_publish_lag_ms p95 > 10 000 for 5 min ⇒ P2.

Auto-mitigation: Publisher uses exponential backoff + jitter; advisory lock prevents thundering herd; messages persist in search.outbox with last_error recorded.

Operator handle: ops/runbooks/search-aggregation-service/outbox-stuck.md — typically: bounce the publisher pod, check Pub/Sub quotas, verify the subscription's DLQ.

5. Pub/Sub consume failure (DLQ spike)

User impact: None immediately — failed events sit in DLQ; the affected slice on a subset of properties is stale. If a structural bug is shipping every event to DLQ, freshness collapses within minutes.

Detection:

pubsub_dlq_total rate > 0.1 % over 15 min per subscription ⇒ P2.
inbox_unprocessed > 10 000 ⇒ P2.

Auto-mitigation: None automatic — DLQ requires intentional drain.

Operator handle: ops/runbooks/search-aggregation-service/dlq-drain.md. Triage: inspect 5 sample messages; classify as transient, poison, or schema-incompat. Transient ⇒ replay to topic. Poison ⇒ leave in DLQ + open ticket. Schema-incompat ⇒ deploy schema-tolerant consumer + replay.

6. Out-of-order events from upstream

User impact: Without protection, a stale slice would overwrite a newer one ⇒ wrong price/availability. With protection: the stale event is dropped, no corruption.

Detection:

projection_skipped_stale_total{topic, slice} rate.
Per-topic SLI: rate < 1 % is normal; > 5 % indicates a producer regression.

Auto-mitigation: Vector-clock guard on every consumer; dropped_stale is recorded in inbox and the event is acked (NOT replayed).

Operator handle: If rate is unusually high, contact the upstream service owner; this is a producer-side problem (their vectorClock isn't monotonic).

7. AI orchestrator unavailable (or budget cap exceeded)

User impact: Multilingual intent parse falls back to keyword-only search; semantic re-rank skipped (Phase 2+). User experience slightly worse for non-English queries but service is fully functional.

Detection:

ai_search_intent_fallbacks_total / ai_search_intent_calls_total > 20 % over 15 min ⇒ P3.
ai.orchestrator span error rate > 50 % over 5 min ⇒ P3.

Auto-mitigation: 800 ms timeout, then fallback path. query.executed.v1 carries intentSource: "fallback". Cache continues to serve hits from the prior 30-day window.

Operator handle: ops/runbooks/search-aggregation-service/ai-orchestrator-degraded.md.

8. Allow-list breach (forbidden field detected in projection)

User impact: Potentially severe — risk of cross-tenant data leak to anonymous public traffic. The breach itself does not necessarily mean a user saw the field, because the OpenSearch template is dynamic: "strict" and would reject the document — but a Postgres breach combined with a code path that returns the row to the consumer is a P1 security incident.

Detection:

projection_field_stripped_total ≥ 1 ⇒ P1 security page.
Nightly ProjectionExposureAuditor finds an unexpected column or doc field ⇒ P1.

Auto-mitigation:

L2 projection-policy strips the field and records the counter; the persisted row is safe.
L3 OpenSearch rejects any document with an unknown field; the inbox marks the event dlq and a projection.failed.v1 event is emitted.
Read APIs return only allow-listed fields by construction (DTO is the type, not the row).

Operator handle: ops/runbooks/search-aggregation-service/allow-list-breach-ir.md. First action: page security on-call, freeze releases, rotate boost-rule admin tokens (precaution), capture the offending event(s) from inbox + DLQ for forensic analysis.

9. Cross-tenant query log inference

User impact: A logged SearchQuery.text could in theory carry tenant-identifiable strings (e.g. brand name). At scale this could leak which tenants exist. Public anyway, but quantifiable.

Detection: Manual quarterly review of search_queries.text distribution against a list of high-PII brand tokens.

Auto-mitigation: PII redaction on text before persistence; nightly anonymizer nullifies text and user_bucket at 30 d.

Operator handle: Tighten redaction list; consider stricter sampling.

10. Boost rule scope violation attempted

User impact: None to consumers (rejected). Indicates either an operator with the wrong tenant context or a JWT / OPA misconfiguration.

Detection: MELMASTOON.SEARCH.BOOST_RULE_SCOPE_VIOLATION count by tenantId > 0 ⇒ P3 ticket; sustained > 10/min ⇒ P2 page (potential malicious tenant operator).

Auto-mitigation: Service rejects 403 with the canonical error code; audit log entry + Slack #sec-ops notification.

11. Cursor forgery / replay attack

User impact: Forged cursor returns 400 (MELMASTOON.SEARCH.CURSOR_INVALID); replayed valid cursor is honored (idempotent paging).

Detection: cursor_invalid_total rate > 5 / min from a single IP ⇒ Cloud Armor IP-based rate-limit kicks in automatically.

Auto-mitigation: Cursor validates HMAC signature with current key; rotated keys retain validity 24 h; bad cursor rejected.

12. Index lag (Postgres canonical ahead of OpenSearch)

User impact: Stale results — a freshly published property may not appear until OpenSearch catches up.

Detection: index_lag_docs gauge > 5 000 for 5 min ⇒ P2; > 50 000 for 1 min ⇒ P1.

Auto-mitigation: OpenSearch mirror writer batches and retries; drift sweep job re-emits projection.updated.v1 for any row whose hash differs.

Operator handle: ops/runbooks/search-aggregation-service/index-lag-reconcile.md.

13. Index rebuild failure (during a swap)

User impact: None — the existing alias keeps serving. The new index is left orphaned and reaped after 24 h.

Detection: IndexBuild.status='failed'; index.health_alert.v1 event.

Auto-mitigation: Build process aborts on phase failure, leaving alias unchanged. Partial new index marked failed in index_builds.

Operator handle: ops/runbooks/search-aggregation-service/index-rebuild.md — investigate phase logs (replay vs catching_up vs swapping); restart with adjusted parameters.

14. Tenant cascade purge incomplete

User impact: A deleted tenant's hotels remain in search until purge completes. Compliance risk if exceeds the 60-min SLO window.

Detection: Purge handler emits tenant.purge_completed.v1 carrying rowsDeleted, openSearchDocsDeleted, durationMs. If duration > 60 min ⇒ P2; if no completion event for an acked tenant.deleted.v1 within 90 min ⇒ P1.

Auto-mitigation: Handler partitions the cascade into chunks of 1 000 rows; resumable via tenant_id index.

Operator handle: ops/runbooks/search-aggregation-service/tenant-purge.md.

15. Hot query DoS (single repeated expensive query)

User impact: Targeted query saturates OpenSearch shards; other queries slow down.

Detection: opensearch_query_duration_ms p99 spike correlated with a single query.canonicalHash ⇒ P2.

Auto-mitigation:

Per-canonical-hash short circuit: if a query took > 1 500 ms in the last 60 s window, identical queries are served from a negative-cache (last result + degradationLevel="rate_limited") until the cool-down passes.
Cloud Armor per-IP rate limit absorbs naive amplification.

Operator handle: Add the canonical-hash to a temporary block list via flags-service; engage product to reshape the query if it's a legitimate use case.

16. Pub/Sub topic / subscription misconfiguration

User impact: Either silent data loss (subscription deleted) or duplicate consumption (multiple subscriptions per consumer).

Detection: Synthetic event flow (probe.search.heartbeat.v1) every 5 min with end-to-end latency check; missing heartbeat for 15 min ⇒ P1.

Auto-mitigation: None — IaC drift is what creates this. Terraform plan on every release flags drift.

Operator handle: Reconcile via Terraform apply; drain DLQ if needed.

17. JWKS fetch failure (operator routes broken)

User impact: Operator boost-rule and admin routes return 503; consumer routes unaffected.

Detection: jwks_fetch_errors_total rate > 0 over 5 min ⇒ P2.

Auto-mitigation: JWKS cache (10 min) absorbs short outages; service starts a background refresh with backoff.

Operator handle: Verify iam-service health; if extended, manually rotate the cached JWKS via Secret Manager break-glass entry.

18. Region misconfiguration (property indexed in wrong region)

User impact: Hotel doesn't appear for users targeting their region; appears for the wrong region.

Detection: Auditor counter region_mismatch_total > 0 ⇒ P3 ticket; sustained > 50/h ⇒ P2.

Auto-mitigation: None — requires correct upstream region or a code fix in RegionPinningPolicy.

Operator handle: Patch policy / property; trigger a partial reindex of affected properties via POST /api/v1/search/index:rebuild with a property-id subset (admin extension).

19. Disk pressure on Cloud SQL (partitioned tables overflow)

User impact: Postgres slows down ⇒ projection lag ⇒ freshness SLO burn.

Detection: Cloud SQL disk usage > 80 % ⇒ P3, > 90 % ⇒ P2.

Auto-mitigation: Auto-resize on (Cloud SQL feature). Range-partitioned tables shed old partitions monthly via nightly-partition-prune job.

Operator handle: Verify partition prune is current; emergency: drop oldest click_events and search_queries partitions older than retention policy.

20. Ranking regression after an algorithm change

User impact: Subtle UX degradation — clicks per impression drops, conversion drops downstream.

Detection: Daily Looker report comparing CTR by query.canonicalHash cohort vs the prior 7-day baseline; > 10 % drop ⇒ P3 ticket.

Auto-mitigation: None — feature flags allow instant rollback of ranking changes (search.ranking.algorithm_version).

Operator handle: Toggle flag back to prior algorithm; collect samples; revert via PR.

21. Time skew on consumer pods

User impact: Wrong occurredAt on emitted events ⇒ analytics drift, possibly stale-event misclassification.

Detection: clock_skew_seconds (NTP-synced offset, gauge) > 5 s ⇒ P3; > 30 s ⇒ P2.

Auto-mitigation: Cloud Run hosts are NTP-synced; service refuses to start if clock_skew_seconds > 30 at boot.

22. Catastrophic data loss (region-wide outage, both Postgres + OpenSearch)

User impact: Search down in the affected region.

Detection: Synthetic checks fail in the region; Cloud Status page red.

Auto-mitigation: External LB routes traffic to the surviving region's Cloud Run revision. Surviving region operates with its own (possibly slightly behind) projection.

Operator handle: DR runbook (per DEPLOYMENT_TOPOLOGY § 8). Restore from PITR + rebuild OpenSearch from BigQuery archive in the recovered region. Quarterly game day rehearses this scenario.

Summary table

#	Mode	Severity	Auto-mitigation	Runbook
1	OpenSearch down	P1	Postgres fallback	opensearch-degraded
2	Postgres down	P1	failover	postgres-failover (platform)
3	Redis down	P3	bypass	memorystore-recovery
4	Outbox stuck	P2	backoff	outbox-stuck
5	DLQ spike	P2	none	dlq-drain
6	Out-of-order events	n/a (handled)	vector-clock skip	n/a
7	AI down	P3	fallback	ai-orchestrator-degraded
8	Allow-list breach	P1 sec	strip + reject	allow-list-breach-ir
9	Query-log inference	P4	redact + anonymize	n/a
10	Boost-rule scope violation	P2/P3	reject + audit	n/a
11	Cursor forgery	P3	reject	n/a
12	Index lag	P2/P1	retry + sweep	index-lag-reconcile
13	Index rebuild failure	P3	abort safely	index-rebuild
14	Tenant purge incomplete	P2/P1	resumable cascade	tenant-purge
15	Hot-query DoS	P2	negative cache + Cloud Armor	n/a
16	Pub/Sub misconfig	P1	none (IaC)	n/a
17	JWKS fetch fail	P2	cache	n/a
18	Region mismatch	P3/P2	none	n/a
19	Disk pressure	P3/P2	auto-resize + prune	n/a
20	Ranking regression	P3	flag rollback	n/a
21	Clock skew	P3/P2	NTP + boot guard	n/a
22	Region-wide outage	P1	LB failover + DR	platform DR

Severity legend​

1. OpenSearch unavailable (cluster red, network partition, Aiven outage)​

2. Postgres unavailable (Cloud SQL primary down, connection pool exhausted)​

3. Memorystore Redis unavailable​

4. Pub/Sub publish failure (outbox stuck)​

5. Pub/Sub consume failure (DLQ spike)​

6. Out-of-order events from upstream​

7. AI orchestrator unavailable (or budget cap exceeded)​

8. Allow-list breach (forbidden field detected in projection)​

9. Cross-tenant query log inference​

10. Boost rule scope violation attempted​

11. Cursor forgery / replay attack​

12. Index lag (Postgres canonical ahead of OpenSearch)​

13. Index rebuild failure (during a swap)​

14. Tenant cascade purge incomplete​

15. Hot query DoS (single repeated expensive query)​

16. Pub/Sub topic / subscription misconfiguration​

17. JWKS fetch failure (operator routes broken)​

18. Region misconfiguration (property indexed in wrong region)​

19. Disk pressure on Cloud SQL (partitioned tables overflow)​

20. Ranking regression after an algorithm change​

21. Time skew on consumer pods​

22. Catastrophic data loss (region-wide outage, both Postgres + OpenSearch)​

Summary table​

Severity legend

1. OpenSearch unavailable (cluster red, network partition, Aiven outage)

2. Postgres unavailable (Cloud SQL primary down, connection pool exhausted)

3. Memorystore Redis unavailable

4. Pub/Sub publish failure (outbox stuck)

5. Pub/Sub consume failure (DLQ spike)

6. Out-of-order events from upstream

7. AI orchestrator unavailable (or budget cap exceeded)

8. Allow-list breach (forbidden field detected in projection)

9. Cross-tenant query log inference

10. Boost rule scope violation attempted

11. Cursor forgery / replay attack

12. Index lag (Postgres canonical ahead of OpenSearch)

13. Index rebuild failure (during a swap)

14. Tenant cascade purge incomplete

15. Hot query DoS (single repeated expensive query)

16. Pub/Sub topic / subscription misconfiguration

17. JWKS fetch failure (operator routes broken)

18. Region misconfiguration (property indexed in wrong region)

19. Disk pressure on Cloud SQL (partitioned tables overflow)

20. Ranking regression after an algorithm change

21. Time skew on consumer pods

22. Catastrophic data loss (region-wide outage, both Postgres + OpenSearch)

Summary table