Skip to main content

maintenance-service · FAILURE_MODES

Catalog of how this service can fail, what we detect it with, the containment / compensation, the user impact, and the runbook + tests that prove the recovery.

1. Choreography failures (room block / relocation)

1.1 work_order.room_blocked.v1 not acknowledged by property-service

  • Detection: alert mnt.room_block_no_response (no property.room.taken_out_of_order.v1 for our correlationId within 60 s).
  • Compensation: worker re-publishes the request once after 60 s; if still no ack after 5 min, the WO is flagged room_block_unconfirmed=true (UI badge), GM is notified, and the room is not automatically re-OOO from our side.
  • User impact: room may show as in-service while a high-severity issue is open; GM can manually OOO via property-service.
  • Runbook: runbook://maintenance/room-block — verify property-service health, replay event, manual OOO fallback.
  • Test: integration/room_block_no_ack.spec.ts.

1.2 property-service rejects the block (room.block_rejected.v1)

  • Detection: inbox handler.
  • Compensation: set causedRoomBlock=false; emit WorkOrderEscalated with reason room_block_rejected; notify GM. WO continues normally.
  • Test: integration/room_block_rejected.spec.ts.

1.3 Relocation required but no replacement room available

  • Detection: reservation-service publishes reservation.modification.failed.v1 with kind=room_change, reason=no_inventory.
  • Compensation: auto-emit WorkOrderEscalated to GM; UI surfaces "guest waiting on relocation"; we do not auto-cancel the WO.
  • Test: integration/relocation_no_inventory.spec.ts.

2. Vendor & assignment failures

2.1 Vendor no-show (no acknowledgement after N reminders)

  • Detection: VendorReminderWorker after 3 cycles past vendorReminderMinutes.
  • Compensation: auto-escalate (escalated.v1 with reason=vendor_no_show); revert WO to assigned with assignee cleared after 4th cycle so a new vendor can be picked.
  • User impact: GM is paged; staff sees a "vendor unresponsive" badge.
  • Runbook: runbook://maintenance/vendor-no-show.
  • Test: integration/vendor_no_show.spec.ts.

2.2 Vendor assigned with channelPreference = call_only and staff didn't manually ack

  • Detection: the assign use case rejects with VENDOR_CHANNEL_MISMATCH.
  • Compensation: UI shows the message: "this vendor must be reached by phone; record the manual acknowledgement." Staff records RecordVendorAcknowledgementUseCase.
  • Test: application/__tests__/assign_call_only.spec.ts.

2.3 Notification dispatch fails

  • Detection: notification-service returns 5xx; we log, do not block.
  • Compensation: notification-service has its own retry policy and DLQ; vendor.assigned.v1 event still fires, so analytics and audit are intact.
  • User impact: none for the WO itself; vendor may not be informed digitally — UI shows a "notification not delivered" badge after notification-service publishes its own DLQ event we consume.
  • Test: integration/notification_5xx.spec.ts.

3. Parts & cost failures

3.1 Part out of stock at resolve time

  • Detection: PartRepository.decrementOnHand raises PART_OUT_OF_STOCK.
  • Compensation: ResolveWorkOrderUseCase aborts, returns 409 to caller. UI prompts to update parts (purchase or override quantity). Optionally the WO is auto-blocked with reason=awaiting_part.
  • User impact: technician must reconcile parts before resolution.
  • Test: integration/parts_out_of_stock.spec.ts.

3.2 Cost-currency mismatch

  • Detection: domain validation at resolve.
  • Compensation: 422; UI uses the tenant base currency dropdown.
  • Test: domain/__tests__/cost_currency.spec.ts.

4. Concurrency & ordering failures

4.1 OCC stale version

  • Detection: UPDATE … WHERE version = ? returns 0 rows; repository raises OccConflict.
  • Compensation: controller returns MELMASTOON.SYS.OCC_CONFLICT with current version + status; BFF retries up to 2× with re-fetch.
  • Test: every state transition has an OCC conflict spec.

4.2 Out-of-order Pub/Sub delivery

  • Detection: none (we don't try to detect; we make every consumer idempotent and order-tolerant).
  • Compensation: consumer handlers are idempotent (messageId in inbox) and pure functions of payload + current state; reordering of upstream events does not change the final state.
  • Test: integration/out_of_order_inbox.spec.ts shuffles 10 events for the same WO and asserts identical final state to the in-order run.

4.3 Two staff resolve the same WO simultaneously

  • Detection: OCC; first wins.
  • Compensation: second gets 409; UI shows the resolved state.
  • Test: integration/concurrent_resolve.spec.ts.

4.4 Preventive scheduler tick overlap (two pods process same row)

  • Detection: unique constraint on preventive_fires(scheduleId, due_at_bucket_hour).
  • Compensation: second insert raises unique violation → handler logs PREVENTIVE_DUPLICATE_FIRE debug and returns 200.
  • Test: integration/preventive_duplicate_fire.spec.ts.

5. Storage failures

5.1 Cloud SQL unavailability (planned failover)

  • Detection: connection errors; readiness probe fails; pods drained.
  • Compensation: Cloud Run waits with retries + circuit breaker; new request returns MELMASTOON.SYS.UPSTREAM_UNAVAILABLE (503). Client retries with exponential backoff.
  • User impact: ~30 s blip during regional failover.
  • Test: chaos: pnpm chaos:db-failover.

5.2 Outbox relay stalled

  • Detection: mnt.outbox.lag_seconds > 60 for 5 min.
  • Compensation: alert P2; runbook restarts relay or scales it; events catch up.
  • User impact: downstream services see delayed events but no data loss.
  • Test: integration/outbox_relay_stalled.spec.ts.

5.3 Inbox dedupe corruption

  • Detection: assertion in handler: same messageId processed but inbox row missing.
  • Compensation: handler still treats as already-processed for current request (the natural dedupe via state checks holds), and emits an alert. Operator runs tools/rebuild-inbox.ts from event log to repair.
  • Test: integration/inbox_corruption.spec.ts.

5.4 Disk full / Cloud SQL storage exhausted

  • Detection: alert db_storage_high at 80%, db_storage_critical at 90%.
  • Compensation: auto-grow enabled to 200 GB; archiver runs to evict closed WOs > 24 mo.
  • Test: simulated in staging with synthetic load.

6. Worker failures

6.1 Preventive scheduler stuck

  • Detection: mnt.preventive.due_pending_count rising without scheduler progress; worker last_tick_at lag > 5 min.
  • Compensation: restart Cloud Run revision; backlog drains in N ticks since dedupe is robust.
  • Test: chaos: kill the worker pod mid-tick and verify recovery.

6.2 SLA scanner double-counts breaches

  • Detection: breach_count jumping > 1 per minute.
  • Cause: missing dedupe within minute; not possible per current code, but tested.
  • Test: integration/sla_double_count.spec.ts.

6.3 Asset health forecaster pollutes scores

  • Detection: alert if mnt.assets.health_index drops > 30 points across a property in one tick.
  • Compensation: auto-revert via RecordAssetHealthUpdateUseCase with override; suspend forecaster for the tenant; investigate model output.
  • Test: golden tests on forecaster bound it.

7. Auto-create path failures (inbound choreography)

7.1 Housekeeping flag arrives but no propertyId resolvable

  • Detection: handler validation; emit metric mnt.inbox.invalid_payload_total.
  • Compensation: push to DLQ for human review; do not retry forever.
  • Test: integration/inbox_invalid_payload.spec.ts.

7.2 Lock health alert references unknown deviceId

  • Detection: asset upsert proceeds (we register a new asset), but displayName is generic.
  • Compensation: UI surfaces a "needs naming" badge on the new asset; technician renames.
  • Test: integration/lock_unknown_device.spec.ts.

7.3 Auto-created WO collides with existing open one (invariant #4)

  • Detection: repository find-open returns a row.
  • Compensation: comment append on existing WO with the new source/note; original WO id returned in originRef.
  • Test: integration/auto_create_collision.spec.ts.

8. Sync failures

8.1 Push command with stale OCC

  • Detection: OCC conflict.
  • Compensation: server returns OCC_CONFLICT; client surfaces conflict view; technician re-fetches.
  • Test: integration/sync_push_occ.spec.ts.

8.2 Push commands batch partial failure

  • Detection: per-command result.
  • Compensation: other WOs proceed; failed WO's subsequent commands in the batch are skipped (server breaks at first failure per-WO).
  • Test: integration/sync_push_partial.spec.ts.

8.3 Device clock far skewed

  • Detection: server compares deviceClock vs serverNow.
  • Compensation: if skew > 15 min, push is rejected with MELMASTOON.SYS.CLOCK_SKEW_EXCESSIVE; technician syncs OS clock.
  • Test: integration/sync_clock_skew.spec.ts.

9. AI failures

9.1 Orchestrator timeout / 5xx

  • Compensation: fail-soft; WO created without AI assist; provenance logs failure.
  • Test: integration/ai_orchestrator_timeout.spec.ts.

9.2 Model returns invalid enum

  • Compensation: discard and log; do not persist; fall back to staff input.
  • Test: integration/ai_invalid_enum.spec.ts.

9.3 Tenant AI budget exhausted

  • Compensation: fail-soft per capability matrix in AI_INTEGRATION.md §5; WO flow continues.
  • Test: integration/ai_budget_exhausted.spec.ts.

10. Edge-case operational failures

10.1 High-severity WO opened on a room with an active reservation, no replacement available, guest already at desk

  • Detection: reservation.modification.failed.v1 with kind=room_change.
  • Compensation: WO escalated to GM; UI displays "guest waiting" banner; check-in saga (reservation-service) handles guest reaccommodation manually.
  • User impact: guest waits; GM intervenes.
  • Test: scripted E2E in staging.

10.2 Vendor invoice file upload fails

  • Detection: signed-URL upload returns 4xx; BFF surfaces; we never persist the WO change because the file ref was never confirmed.
  • Compensation: retry; or skip file upload, just record amount.
  • Test: BFF E2E.

10.3 Multiple lock health alerts for the same device in 1 second

  • Detection: inbox dedupe + (deviceId, alertCode, dayBucket) natural dedupe.
  • Compensation: at most one WO per dedupe bucket; subsequent alerts append a comment.
  • Test: integration/lock_alert_dedupe.spec.ts.

10.4 Generator run-hours regress (sensor or human input goes backward)

  • Detection: Asset.runHours is max-of per sync contract; but cloud-side direct REST update to a lower value is rejected with MELMASTOON.MAINTENANCE.ASSET_RUN_HOURS_REGRESSION.
  • Compensation: UI prompts: "current value is X, you entered Y. Confirm reset?" — if confirmed, RecordAssetHealthUpdateUseCase with force=true (audited).
  • Test: integration/asset_run_hours_regression.spec.ts.

10.5 Schedule nextDueAt regression after manual override

  • Detection: invariant #9.
  • Compensation: request rejected; UI explains that next-due cannot move backward without manually completing the in-flight schedule.
  • Test: domain/__tests__/schedule_next_due_regression.spec.ts.

11. Generic failure principles

  • Fail-soft on advisory dependencies (AI, notification dispatch, asset health forecaster) — primary flow continues, badge surfaced.
  • Fail-fast on authoritative dependencies (DB, auth, RLS) — return 5xx, let client retry.
  • Audit everything — every failure produces a structured log + a metric increment. Silent failures are bugs.
  • No partial state — every command writes WO + outbox in one transaction; either both happen or neither.