Skip to main content

property-service — FAILURE_MODES

Companion: OBSERVABILITY · SYNC_CONTRACT · DEPLOYMENT_TOPOLOGY · API_CONTRACTS · ../../docs/standards/ERROR_CODES.md

This is the canonical failure catalog for property-service. Each entry: failure ID, what breaks, blast radius (consumer / tenant / backoffice), detection signal, mitigation / runbook, and the user-visible degradation. The table format is binding; tooling parses it for status-page mapping and on-call cards.

Severities: P0 = data loss / cross-tenant exposure / platform-wide outage. P1 = service down or major SLO breach. P2 = degraded but functional. P3 = cosmetic.


1. Catalog (summary)

#Failure IDSeverityBlast radius
F-01PG_PRIMARY_DOWNP1Tenant: writes blocked; reads OK on replica
F-02PG_REPLICA_LAG_HIGHP2Tenant: stale reads up to N seconds
F-03PG_RLS_POLICY_REGRESSIONP0Cross-tenant exposure
F-04OUTBOX_PUBLISHER_STALLP2Downstream services drift; eventual consistency exceeded
F-05INBOX_DLQ_GROWTHP2Tenant.deleted, housekeeping, maintenance signals delayed
F-06PUBSUB_REGIONAL_OUTAGEP1Outbox grows; eventual consistency widens; no data loss
F-07MEMORYSTORE_DOWNP22–3× read latency; SLO at risk
F-08SIGNED_URL_ISSUER_FAILP2New photo uploads blocked; existing photos served fine
F-09MEDIA_SCAN_TIMEOUTP2New photos stuck uploaded; not surfaced publicly
F-10MEDIA_SCAN_INFECTEDper-row P3Single photo quarantined; operator notified
F-11AI_ORCHESTRATOR_UNAVAILABLEP3AI suggestions unavailable; manual flows fine
F-12AI_OUTPUT_SCHEMA_MISMATCHP2Specific capability disabled; alert on-call
F-13AI_QUOTA_EXHAUSTED (per tenant)P3Tenant: AI suggestions throttled
F-14GEO_SERVICE_UNAVAILABLEP2Address forms degrade to manual lat/lng entry
F-15IAM_JWKS_FETCH_FAILP1All authenticated routes 5xx until cache refreshes
F-16OPA_BUNDLE_STALEP2Authorization decisions use last-good policy; alert
F-17BFF_SYNC_PUSH_STORMP2Per-tenant write spike; backpressure kicks in
F-18SYNC_CURSOR_REGRESSIONP1Devices stuck or doing full resets
F-19CROSS_TENANT_REFERENCE_DETECTEDP0Writer aborted; alert fires; investigate
F-20IDEMPOTENCY_KEY_REPLAYED_DIFFERENT_BODYP3Single client misuse; retried call rejected
F-21BULK_ROOM_CREATE_PARTIAL_FAILURE (transactional)P3Atomic rollback; no partial state; client retries
F-22LOCK_DEVICE_BINDING_CONFLICTP3Device id reuse rejected
F-23PROPERTY_PUBLISH_INVARIANT_VIOLATIONP3Publish rejected with explicit code; UI guides operator
F-24CLOUD_SQL_FAILOVERP130–60 s of write blackout; reads OK
F-25MIGRATION_ROLLBACK_REQUIREDP1Release rolled back; runbook executed
F-26AUDIT_MIRROR_LAGP3audit-service mirror behind; local truth intact
F-27SEARCH_PROJECTION_LAGP2Consumer meta layer shows stale property card briefly
F-28RESERVATION_PORT_TIMEOUT (room archive)P3Archive request times out; safe — no data change

2. Detail (one section per failure)

F-01 — PG_PRIMARY_DOWN

  • Detection. PropertyWriteAvailabilityBurn page; Cloud SQL HA event; database.connection.errors_total spike.
  • Effect. Writes return 503 MELMASTOON.GENERAL.SERVICE_DEGRADED. Reads continue via replica with stale-by-up-to-2s data.
  • Mitigation. Cloud SQL HA auto-fails over within ~30–60 s. Service connection pool detects and reconnects.
  • User impact. Backoffice sees a non-blocking banner: "Saving temporarily unavailable, retrying…". Consumer meta is unaffected (read replica + cache).
  • Runbook. runbooks/property/pg-primary-down.md.

F-02 — PG_REPLICA_LAG_HIGH

  • Detection. Cloud SQL replica lag metric > 5 s for 5 min; user reports of "I just changed it but I don't see it".
  • Effect. Consumer meta and tenant booking surfaces may render slightly stale data.
  • Mitigation. Auto: route critical reads (operator console post-write) to primary via read_after_write flag for 30 s. Manual: investigate replication; consider scaling.
  • Runbook. runbooks/property/pg-replica-lag.md.

F-03 — PG_RLS_POLICY_REGRESSION

  • Detection. property_tenant_isolation_audit_failures_total > 0 (nightly + on-demand auditor).
  • Effect. Potential cross-tenant data exposure.
  • Mitigation. Immediate. Pause traffic via gateway kill switch; restore prior schema migration; run diff to identify any leaked rows; notify security; legal-hold the audit log.
  • User impact. P0 incident. May require breach disclosure; security owns customer comms.
  • Runbook. runbooks/security/tenant-isolation-breach.md.

F-04 — OUTBOX_PUBLISHER_STALL

  • Detection. property_outbox_lag_seconds > 30 for 5 min.
  • Effect. Downstream search projection, BFF caches, and audit mirror lag.
  • Mitigation. Inspect publisher logs (outbox.publish failures); restart publisher worker; if Pub/Sub-side, see F-06.
  • User impact. Up to N minutes of stale property cards on consumer meta and tenant booking pages.
  • Runbook. runbooks/property/outbox-backlog.md.

F-05 — INBOX_DLQ_GROWTH

  • Detection. DLQ subscription depth > 50; property_inbox_processed_total{result="error"} rate sustained.
  • Effect. Auto-OOO from housekeeping or auto-RTS from maintenance can be missed.
  • Mitigation. Inspect DLQ message; common cause is schema-version skew → roll forward consumer; replay via replay-dlq admin script.
  • Runbook. runbooks/property/inbox-dlq.md.

F-06 — PUBSUB_REGIONAL_OUTAGE

  • Detection. Pub/Sub error spike across all topics; Cloud status board.
  • Effect. Outbox grows; reads + writes still succeed locally; downstream consistency widens.
  • Mitigation. Wait out; outbox absorbs. If duration > 1 h, switch publisher to a secondary regional topic per the Pub/Sub regional fallback runbook.
  • Runbook. runbooks/platform/pubsub-region-down.md.

F-07 — MEMORYSTORE_DOWN

  • Detection. Redis client errors; property_cache_total{result="miss"} ≈ 100 %.
  • Effect. Reads fall through to Postgres; latency 2–3× baseline; CPU pressure on Cloud SQL.
  • Mitigation. Restore Memorystore (auto in most cases); if extended, scale Cloud SQL temporarily.
  • Runbook. runbooks/property/memorystore-down.md.

F-08 — SIGNED_URL_ISSUER_FAIL

  • Detection. file-storage-service 5xx rate > 5 %.
  • Effect. New photo uploads blocked. Existing photos unaffected.
  • Mitigation. Coordinate with file-storage-service on-call; meanwhile UI surfaces "Photo upload temporarily unavailable" without blocking other operations.
  • Runbook. runbooks/file-storage/issuer-down.md.

F-09 — MEDIA_SCAN_TIMEOUT

  • Detection. property_photo_pipeline_seconds{stage="ready"} p95 > 60 s.
  • Effect. Photos stay uploaded longer; not yet visible.
  • Mitigation. Investigate file-storage-service scanner; no client action required; the photo will transition once the scan event arrives.
  • Runbook. runbooks/property/photo-pipeline.md.

F-10 — MEDIA_SCAN_INFECTED

  • Detection. media.asset.scanned.v1 with verdict infected.
  • Effect. Photo set quarantined; never made visible.
  • Mitigation. Operator notified via in-app message; can re-upload a clean file.

F-11 — AI_ORCHESTRATOR_UNAVAILABLE

  • Detection. property_ai_runs_total{result="error"} spike; orchestrator health check.
  • Effect. "AI suggest" buttons disabled; manual flows continue.
  • Mitigation. Wait out; orchestrator owns recovery.

F-12 — AI_OUTPUT_SCHEMA_MISMATCH

  • Detection. property_ai_schema_violations_total > 0.
  • Effect. Specific capability rejected; suggestion not staged.
  • Mitigation. Disable capability via tenant-config flag; coordinate with orchestrator team.

F-13 — AI_QUOTA_EXHAUSTED

  • Detection. 429 MELMASTOON.AI.QUOTA_EXHAUSTED rate.
  • Effect. Tenant cannot generate further suggestions until quota resets / upgraded.
  • Mitigation. UX shows quota meter + "Try again at HH:MM" guidance.

F-14 — GEO_SERVICE_UNAVAILABLE

  • Detection. geo-service 5xx rate.
  • Effect. Geocoding/reverse-geocoding fails; address form falls back to manual lat/lng + AI fallback (per AI_INTEGRATION §7, gated on operator opt-in).
  • Mitigation. Wait; alternate provider behind geo-service if extended.

F-15 — IAM_JWKS_FETCH_FAIL

  • Detection. Token verification failures; cache age past TTL.
  • Effect. All authenticated routes return 401 MELMASTOON.IAM.UNAUTHENTICATED.
  • Mitigation. Service holds last-good JWKS for 24 h; alert if refresh fails > 15 min; coordinate with iam-service on-call.

F-16 — OPA_BUNDLE_STALE

  • Detection. Bundle age > 30 min; bundle fetch errors.
  • Effect. Authorization uses cached policy; new role grants delayed.
  • Mitigation. Investigate iam-service bundle endpoint; force re-fetch.

F-17 — BFF_SYNC_PUSH_STORM

  • Detection. Per-tenant push rate > baseline ×10.
  • Effect. Service applies backpressure (429, Retry-After); some operations queued client-side.
  • Mitigation. Investigate cause (often a misbehaving device or a re-sync after a long outage); throttle device via admin API if malicious.

F-18 — SYNC_CURSOR_REGRESSION

  • Detection. Devices report mass conflicts after a deploy; rare.
  • Effect. Device may trigger a full reset (bandwidth spike, brief unavailability of offline data).
  • Mitigation. Roll back the suspect deploy; coordinate with desktop team; do not patch cursors in place.

F-19 — CROSS_TENANT_REFERENCE_DETECTED

  • Detection. Aggregate constructor or outbox writer raises CrossTenantReferenceError; counter spikes.
  • Effect. Single write aborted; alarm fires.
  • Mitigation. Treat as P0 unless reproducibly attributable to a known data fixture; investigate the request chain; review recent code paths.

F-20 — IDEMPOTENCY_KEY_REPLAYED_DIFFERENT_BODY

  • Detection. MELMASTOON.GENERAL.IDEMPOTENCY_KEY_REPLAYED_DIFFERENT_BODY rate spikes for one client.
  • Effect. Client retries fail safely; no data corruption.
  • Mitigation. Notify the client team; usually a code bug in the client.

F-21 — BULK_ROOM_CREATE_PARTIAL_FAILURE (transactional, no partial state)

  • Detection. POST /properties/:id/rooms/bulk 5xx.
  • Effect. Whole batch rolled back; no rooms created.
  • Mitigation. Client retries with the same idempotency key after fixing root cause (typically a duplicate room number).

F-22 — LOCK_DEVICE_BINDING_CONFLICT

  • Detection. lock-integration-service event for lock_device_id already bound to another room.
  • Effect. Binding rejected; lock-integration retains "unbound" view of the device.
  • Mitigation. Operator unbinds the conflicting room first.

F-23 — PROPERTY_PUBLISH_INVARIANT_VIOLATION

  • Detection. Publish endpoint returns MELMASTOON.PROPERTY.NO_ACTIVE_ROOMS_TO_PUBLISH / MISSING_DEFAULT_LOCALE_TRANSLATION / MISSING_HERO_PHOTO / MISSING_GEO.
  • Effect. Operator sees a guided checklist UI; not a service incident.

F-24 — CLOUD_SQL_FAILOVER

  • Detection. Cloud SQL HA event; brief write 5xx.
  • Effect. 30–60 s of write blackout; reads continue.
  • Mitigation. Auto.

F-25 — MIGRATION_ROLLBACK_REQUIRED

  • Detection. Migration job failure; smoke test failure post-deploy.
  • Effect. Cloud Deploy auto-rollback; service held on prior revision.
  • Mitigation. Apply paired down.sql; analyze; reissue migration following expand → backfill → contract.

F-26 — AUDIT_MIRROR_LAG

  • Detection. audit-service consumer lag.
  • Effect. Compliance UIs show recent actions late; local audit table remains source of truth.
  • Mitigation. Investigate audit consumer; non-blocking.

F-27 — SEARCH_PROJECTION_LAG

  • Detection. Echo metric from search-aggregation-service reports projection age.
  • Effect. Consumer meta layer shows older property card for up to N minutes.
  • Mitigation. Often a Pub/Sub or projection consumer issue; coordinate with search team.

F-28 — RESERVATION_PORT_TIMEOUT

  • Detection. reservation-service RPC timeout on archive precondition check.
  • Effect. Archive request fails fast (504 MELMASTOON.GENERAL.UPSTREAM_TIMEOUT); no data change.
  • Mitigation. Operator retries; no risk of orphan archives.

3. Cross-Cutting Patterns

  • Fail-closed on multi-tenant safety. Any uncertainty about tenant context aborts the operation.
  • Fail-open on AI. AI is non-essential to write paths; failure degrades UX, never blocks publish.
  • No silent retries on writes. Service emits explicit Retry-After and surfaces error codes; the client decides.
  • Graceful degradation for caches and projections. Staleness is acceptable, surface it; never fabricate.

Each entry above resolves to a runbook in runbooks/property/* (and runbooks/security/* for tenant-isolation-class incidents). Alerts in OBSERVABILITY link to those same runbooks.