property-service — FAILURE_MODES
Companion: OBSERVABILITY · SYNC_CONTRACT · DEPLOYMENT_TOPOLOGY · API_CONTRACTS · ../../docs/standards/ERROR_CODES.md
This is the canonical failure catalog for property-service. Each entry: failure ID, what breaks, blast radius (consumer / tenant / backoffice), detection signal, mitigation / runbook, and the user-visible degradation. The table format is binding; tooling parses it for status-page mapping and on-call cards.
Severities: P0 = data loss / cross-tenant exposure / platform-wide outage. P1 = service down or major SLO breach. P2 = degraded but functional. P3 = cosmetic.
1. Catalog (summary)
| # | Failure ID | Severity | Blast radius |
|---|---|---|---|
| F-01 | PG_PRIMARY_DOWN | P1 | Tenant: writes blocked; reads OK on replica |
| F-02 | PG_REPLICA_LAG_HIGH | P2 | Tenant: stale reads up to N seconds |
| F-03 | PG_RLS_POLICY_REGRESSION | P0 | Cross-tenant exposure |
| F-04 | OUTBOX_PUBLISHER_STALL | P2 | Downstream services drift; eventual consistency exceeded |
| F-05 | INBOX_DLQ_GROWTH | P2 | Tenant.deleted, housekeeping, maintenance signals delayed |
| F-06 | PUBSUB_REGIONAL_OUTAGE | P1 | Outbox grows; eventual consistency widens; no data loss |
| F-07 | MEMORYSTORE_DOWN | P2 | 2–3× read latency; SLO at risk |
| F-08 | SIGNED_URL_ISSUER_FAIL | P2 | New photo uploads blocked; existing photos served fine |
| F-09 | MEDIA_SCAN_TIMEOUT | P2 | New photos stuck uploaded; not surfaced publicly |
| F-10 | MEDIA_SCAN_INFECTED | per-row P3 | Single photo quarantined; operator notified |
| F-11 | AI_ORCHESTRATOR_UNAVAILABLE | P3 | AI suggestions unavailable; manual flows fine |
| F-12 | AI_OUTPUT_SCHEMA_MISMATCH | P2 | Specific capability disabled; alert on-call |
| F-13 | AI_QUOTA_EXHAUSTED (per tenant) | P3 | Tenant: AI suggestions throttled |
| F-14 | GEO_SERVICE_UNAVAILABLE | P2 | Address forms degrade to manual lat/lng entry |
| F-15 | IAM_JWKS_FETCH_FAIL | P1 | All authenticated routes 5xx until cache refreshes |
| F-16 | OPA_BUNDLE_STALE | P2 | Authorization decisions use last-good policy; alert |
| F-17 | BFF_SYNC_PUSH_STORM | P2 | Per-tenant write spike; backpressure kicks in |
| F-18 | SYNC_CURSOR_REGRESSION | P1 | Devices stuck or doing full resets |
| F-19 | CROSS_TENANT_REFERENCE_DETECTED | P0 | Writer aborted; alert fires; investigate |
| F-20 | IDEMPOTENCY_KEY_REPLAYED_DIFFERENT_BODY | P3 | Single client misuse; retried call rejected |
| F-21 | BULK_ROOM_CREATE_PARTIAL_FAILURE (transactional) | P3 | Atomic rollback; no partial state; client retries |
| F-22 | LOCK_DEVICE_BINDING_CONFLICT | P3 | Device id reuse rejected |
| F-23 | PROPERTY_PUBLISH_INVARIANT_VIOLATION | P3 | Publish rejected with explicit code; UI guides operator |
| F-24 | CLOUD_SQL_FAILOVER | P1 | 30–60 s of write blackout; reads OK |
| F-25 | MIGRATION_ROLLBACK_REQUIRED | P1 | Release rolled back; runbook executed |
| F-26 | AUDIT_MIRROR_LAG | P3 | audit-service mirror behind; local truth intact |
| F-27 | SEARCH_PROJECTION_LAG | P2 | Consumer meta layer shows stale property card briefly |
| F-28 | RESERVATION_PORT_TIMEOUT (room archive) | P3 | Archive request times out; safe — no data change |
2. Detail (one section per failure)
F-01 — PG_PRIMARY_DOWN
- Detection.
PropertyWriteAvailabilityBurnpage; Cloud SQL HA event;database.connection.errors_totalspike. - Effect. Writes return
503 MELMASTOON.GENERAL.SERVICE_DEGRADED. Reads continue via replica with stale-by-up-to-2s data. - Mitigation. Cloud SQL HA auto-fails over within ~30–60 s. Service connection pool detects and reconnects.
- User impact. Backoffice sees a non-blocking banner: "Saving temporarily unavailable, retrying…". Consumer meta is unaffected (read replica + cache).
- Runbook. runbooks/property/pg-primary-down.md.
F-02 — PG_REPLICA_LAG_HIGH
- Detection. Cloud SQL replica lag metric > 5 s for 5 min; user reports of "I just changed it but I don't see it".
- Effect. Consumer meta and tenant booking surfaces may render slightly stale data.
- Mitigation. Auto: route critical reads (operator console post-write) to primary via
read_after_writeflag for 30 s. Manual: investigate replication; consider scaling. - Runbook. runbooks/property/pg-replica-lag.md.
F-03 — PG_RLS_POLICY_REGRESSION
- Detection.
property_tenant_isolation_audit_failures_total > 0(nightly + on-demand auditor). - Effect. Potential cross-tenant data exposure.
- Mitigation. Immediate. Pause traffic via gateway kill switch; restore prior schema migration; run diff to identify any leaked rows; notify security; legal-hold the audit log.
- User impact. P0 incident. May require breach disclosure; security owns customer comms.
- Runbook. runbooks/security/tenant-isolation-breach.md.
F-04 — OUTBOX_PUBLISHER_STALL
- Detection.
property_outbox_lag_seconds > 30for 5 min. - Effect. Downstream search projection, BFF caches, and audit mirror lag.
- Mitigation. Inspect publisher logs (
outbox.publishfailures); restart publisher worker; if Pub/Sub-side, see F-06. - User impact. Up to N minutes of stale property cards on consumer meta and tenant booking pages.
- Runbook. runbooks/property/outbox-backlog.md.
F-05 — INBOX_DLQ_GROWTH
- Detection. DLQ subscription depth > 50;
property_inbox_processed_total{result="error"}rate sustained. - Effect. Auto-OOO from housekeeping or auto-RTS from maintenance can be missed.
- Mitigation. Inspect DLQ message; common cause is schema-version skew → roll forward consumer; replay via
replay-dlqadmin script. - Runbook. runbooks/property/inbox-dlq.md.
F-06 — PUBSUB_REGIONAL_OUTAGE
- Detection. Pub/Sub error spike across all topics; Cloud status board.
- Effect. Outbox grows; reads + writes still succeed locally; downstream consistency widens.
- Mitigation. Wait out; outbox absorbs. If duration > 1 h, switch publisher to a secondary regional topic per the Pub/Sub regional fallback runbook.
- Runbook. runbooks/platform/pubsub-region-down.md.
F-07 — MEMORYSTORE_DOWN
- Detection. Redis client errors;
property_cache_total{result="miss"}≈ 100 %. - Effect. Reads fall through to Postgres; latency 2–3× baseline; CPU pressure on Cloud SQL.
- Mitigation. Restore Memorystore (auto in most cases); if extended, scale Cloud SQL temporarily.
- Runbook. runbooks/property/memorystore-down.md.
F-08 — SIGNED_URL_ISSUER_FAIL
- Detection.
file-storage-service5xx rate > 5 %. - Effect. New photo uploads blocked. Existing photos unaffected.
- Mitigation. Coordinate with
file-storage-serviceon-call; meanwhile UI surfaces "Photo upload temporarily unavailable" without blocking other operations. - Runbook. runbooks/file-storage/issuer-down.md.
F-09 — MEDIA_SCAN_TIMEOUT
- Detection.
property_photo_pipeline_seconds{stage="ready"}p95 > 60 s. - Effect. Photos stay
uploadedlonger; not yet visible. - Mitigation. Investigate
file-storage-servicescanner; no client action required; the photo will transition once the scan event arrives. - Runbook. runbooks/property/photo-pipeline.md.
F-10 — MEDIA_SCAN_INFECTED
- Detection.
media.asset.scanned.v1with verdictinfected. - Effect. Photo set
quarantined; never made visible. - Mitigation. Operator notified via in-app message; can re-upload a clean file.
F-11 — AI_ORCHESTRATOR_UNAVAILABLE
- Detection.
property_ai_runs_total{result="error"}spike; orchestrator health check. - Effect. "AI suggest" buttons disabled; manual flows continue.
- Mitigation. Wait out; orchestrator owns recovery.
F-12 — AI_OUTPUT_SCHEMA_MISMATCH
- Detection.
property_ai_schema_violations_total > 0. - Effect. Specific capability rejected; suggestion not staged.
- Mitigation. Disable capability via tenant-config flag; coordinate with orchestrator team.
F-13 — AI_QUOTA_EXHAUSTED
- Detection.
429 MELMASTOON.AI.QUOTA_EXHAUSTEDrate. - Effect. Tenant cannot generate further suggestions until quota resets / upgraded.
- Mitigation. UX shows quota meter + "Try again at HH:MM" guidance.
F-14 — GEO_SERVICE_UNAVAILABLE
- Detection.
geo-service5xx rate. - Effect. Geocoding/reverse-geocoding fails; address form falls back to manual lat/lng + AI fallback (per AI_INTEGRATION §7, gated on operator opt-in).
- Mitigation. Wait; alternate provider behind
geo-serviceif extended.
F-15 — IAM_JWKS_FETCH_FAIL
- Detection. Token verification failures; cache age past TTL.
- Effect. All authenticated routes return
401 MELMASTOON.IAM.UNAUTHENTICATED. - Mitigation. Service holds last-good JWKS for 24 h; alert if refresh fails > 15 min; coordinate with
iam-serviceon-call.
F-16 — OPA_BUNDLE_STALE
- Detection. Bundle age > 30 min; bundle fetch errors.
- Effect. Authorization uses cached policy; new role grants delayed.
- Mitigation. Investigate
iam-servicebundle endpoint; force re-fetch.
F-17 — BFF_SYNC_PUSH_STORM
- Detection. Per-tenant push rate > baseline ×10.
- Effect. Service applies backpressure (
429,Retry-After); some operations queued client-side. - Mitigation. Investigate cause (often a misbehaving device or a re-sync after a long outage); throttle device via admin API if malicious.
F-18 — SYNC_CURSOR_REGRESSION
- Detection. Devices report mass conflicts after a deploy; rare.
- Effect. Device may trigger a full reset (bandwidth spike, brief unavailability of offline data).
- Mitigation. Roll back the suspect deploy; coordinate with desktop team; do not patch cursors in place.
F-19 — CROSS_TENANT_REFERENCE_DETECTED
- Detection. Aggregate constructor or outbox writer raises
CrossTenantReferenceError; counter spikes. - Effect. Single write aborted; alarm fires.
- Mitigation. Treat as P0 unless reproducibly attributable to a known data fixture; investigate the request chain; review recent code paths.
F-20 — IDEMPOTENCY_KEY_REPLAYED_DIFFERENT_BODY
- Detection.
MELMASTOON.GENERAL.IDEMPOTENCY_KEY_REPLAYED_DIFFERENT_BODYrate spikes for one client. - Effect. Client retries fail safely; no data corruption.
- Mitigation. Notify the client team; usually a code bug in the client.
F-21 — BULK_ROOM_CREATE_PARTIAL_FAILURE (transactional, no partial state)
- Detection.
POST /properties/:id/rooms/bulk5xx. - Effect. Whole batch rolled back; no rooms created.
- Mitigation. Client retries with the same idempotency key after fixing root cause (typically a duplicate room number).
F-22 — LOCK_DEVICE_BINDING_CONFLICT
- Detection.
lock-integration-serviceevent forlock_device_idalready bound to another room. - Effect. Binding rejected; lock-integration retains "unbound" view of the device.
- Mitigation. Operator unbinds the conflicting room first.
F-23 — PROPERTY_PUBLISH_INVARIANT_VIOLATION
- Detection. Publish endpoint returns
MELMASTOON.PROPERTY.NO_ACTIVE_ROOMS_TO_PUBLISH/MISSING_DEFAULT_LOCALE_TRANSLATION/MISSING_HERO_PHOTO/MISSING_GEO. - Effect. Operator sees a guided checklist UI; not a service incident.
F-24 — CLOUD_SQL_FAILOVER
- Detection. Cloud SQL HA event; brief write 5xx.
- Effect. 30–60 s of write blackout; reads continue.
- Mitigation. Auto.
F-25 — MIGRATION_ROLLBACK_REQUIRED
- Detection. Migration job failure; smoke test failure post-deploy.
- Effect. Cloud Deploy auto-rollback; service held on prior revision.
- Mitigation. Apply paired
down.sql; analyze; reissue migration following expand → backfill → contract.
F-26 — AUDIT_MIRROR_LAG
- Detection.
audit-serviceconsumer lag. - Effect. Compliance UIs show recent actions late; local audit table remains source of truth.
- Mitigation. Investigate audit consumer; non-blocking.
F-27 — SEARCH_PROJECTION_LAG
- Detection. Echo metric from
search-aggregation-servicereports projection age. - Effect. Consumer meta layer shows older property card for up to N minutes.
- Mitigation. Often a Pub/Sub or projection consumer issue; coordinate with search team.
F-28 — RESERVATION_PORT_TIMEOUT
- Detection.
reservation-serviceRPC timeout on archive precondition check. - Effect. Archive request fails fast (
504 MELMASTOON.GENERAL.UPSTREAM_TIMEOUT); no data change. - Mitigation. Operator retries; no risk of orphan archives.
3. Cross-Cutting Patterns
- Fail-closed on multi-tenant safety. Any uncertainty about tenant context aborts the operation.
- Fail-open on AI. AI is non-essential to write paths; failure degrades UX, never blocks publish.
- No silent retries on writes. Service emits explicit
Retry-Afterand surfaces error codes; the client decides. - Graceful degradation for caches and projections. Staleness is acceptable, surface it; never fabricate.
Each entry above resolves to a runbook in
runbooks/property/*(andrunbooks/security/*for tenant-isolation-class incidents). Alerts in OBSERVABILITY link to those same runbooks.