maintenance-service · FAILURE_MODES

Catalog of how this service can fail, what we detect it with, the containment / compensation, the user impact, and the runbook + tests that prove the recovery.

1. Choreography failures (room block / relocation)

1.1 `work_order.room_blocked.v1` not acknowledged by `property-service`

Detection: alert mnt.room_block_no_response (no property.room.taken_out_of_order.v1 for our correlationId within 60 s).
Compensation: worker re-publishes the request once after 60 s; if still no ack after 5 min, the WO is flagged room_block_unconfirmed=true (UI badge), GM is notified, and the room is not automatically re-OOO from our side.
User impact: room may show as in-service while a high-severity issue is open; GM can manually OOO via property-service.
Runbook: runbook://maintenance/room-block — verify property-service health, replay event, manual OOO fallback.
Test: integration/room_block_no_ack.spec.ts.

1.2 `property-service` rejects the block (`room.block_rejected.v1`)

Detection: inbox handler.
Compensation: set causedRoomBlock=false; emit WorkOrderEscalated with reason room_block_rejected; notify GM. WO continues normally.
Test: integration/room_block_rejected.spec.ts.

1.3 Relocation required but no replacement room available

Detection: reservation-service publishes reservation.modification.failed.v1 with kind=room_change, reason=no_inventory.
Compensation: auto-emit WorkOrderEscalated to GM; UI surfaces "guest waiting on relocation"; we do not auto-cancel the WO.
Test: integration/relocation_no_inventory.spec.ts.

2. Vendor & assignment failures

2.1 Vendor no-show (no acknowledgement after N reminders)

Detection: VendorReminderWorker after 3 cycles past vendorReminderMinutes.
Compensation: auto-escalate (escalated.v1 with reason=vendor_no_show); revert WO to assigned with assignee cleared after 4th cycle so a new vendor can be picked.
User impact: GM is paged; staff sees a "vendor unresponsive" badge.
Runbook: runbook://maintenance/vendor-no-show.
Test: integration/vendor_no_show.spec.ts.

2.2 Vendor assigned with `channelPreference = call_only` and staff didn't manually ack

Detection: the assign use case rejects with VENDOR_CHANNEL_MISMATCH.
Compensation: UI shows the message: "this vendor must be reached by phone; record the manual acknowledgement." Staff records RecordVendorAcknowledgementUseCase.
Test: application/__tests__/assign_call_only.spec.ts.

2.3 Notification dispatch fails

Detection: notification-service returns 5xx; we log, do not block.
Compensation: notification-service has its own retry policy and DLQ; vendor.assigned.v1 event still fires, so analytics and audit are intact.
User impact: none for the WO itself; vendor may not be informed digitally — UI shows a "notification not delivered" badge after notification-service publishes its own DLQ event we consume.
Test: integration/notification_5xx.spec.ts.

3. Parts & cost failures

3.1 Part out of stock at resolve time

Detection: PartRepository.decrementOnHand raises PART_OUT_OF_STOCK.
Compensation: ResolveWorkOrderUseCase aborts, returns 409 to caller. UI prompts to update parts (purchase or override quantity). Optionally the WO is auto-blocked with reason=awaiting_part.
User impact: technician must reconcile parts before resolution.
Test: integration/parts_out_of_stock.spec.ts.

3.2 Cost-currency mismatch

Detection: domain validation at resolve.
Compensation: 422; UI uses the tenant base currency dropdown.
Test: domain/__tests__/cost_currency.spec.ts.

4. Concurrency & ordering failures

4.1 OCC stale version

Detection: UPDATE … WHERE version = ? returns 0 rows; repository raises OccConflict.
Compensation: controller returns MELMASTOON.SYS.OCC_CONFLICT with current version + status; BFF retries up to 2× with re-fetch.
Test: every state transition has an OCC conflict spec.

4.2 Out-of-order Pub/Sub delivery

Detection: none (we don't try to detect; we make every consumer idempotent and order-tolerant).
Compensation: consumer handlers are idempotent (messageId in inbox) and pure functions of payload + current state; reordering of upstream events does not change the final state.
Test: integration/out_of_order_inbox.spec.ts shuffles 10 events for the same WO and asserts identical final state to the in-order run.

4.3 Two staff resolve the same WO simultaneously

Detection: OCC; first wins.
Compensation: second gets 409; UI shows the resolved state.
Test: integration/concurrent_resolve.spec.ts.

4.4 Preventive scheduler tick overlap (two pods process same row)

Detection: unique constraint on preventive_fires(scheduleId, due_at_bucket_hour).
Compensation: second insert raises unique violation → handler logs PREVENTIVE_DUPLICATE_FIRE debug and returns 200.
Test: integration/preventive_duplicate_fire.spec.ts.

5. Storage failures

5.1 Cloud SQL unavailability (planned failover)

Detection: connection errors; readiness probe fails; pods drained.
Compensation: Cloud Run waits with retries + circuit breaker; new request returns MELMASTOON.SYS.UPSTREAM_UNAVAILABLE (503). Client retries with exponential backoff.
User impact: ~30 s blip during regional failover.
Test: chaos: pnpm chaos:db-failover.

5.2 Outbox relay stalled

Detection: mnt.outbox.lag_seconds > 60 for 5 min.
Compensation: alert P2; runbook restarts relay or scales it; events catch up.
User impact: downstream services see delayed events but no data loss.
Test: integration/outbox_relay_stalled.spec.ts.

5.3 Inbox dedupe corruption

Detection: assertion in handler: same messageId processed but inbox row missing.
Compensation: handler still treats as already-processed for current request (the natural dedupe via state checks holds), and emits an alert. Operator runs tools/rebuild-inbox.ts from event log to repair.
Test: integration/inbox_corruption.spec.ts.

5.4 Disk full / Cloud SQL storage exhausted

Detection: alert db_storage_high at 80%, db_storage_critical at 90%.
Compensation: auto-grow enabled to 200 GB; archiver runs to evict closed WOs > 24 mo.
Test: simulated in staging with synthetic load.

6. Worker failures

6.1 Preventive scheduler stuck

Detection: mnt.preventive.due_pending_count rising without scheduler progress; worker last_tick_at lag > 5 min.
Compensation: restart Cloud Run revision; backlog drains in N ticks since dedupe is robust.
Test: chaos: kill the worker pod mid-tick and verify recovery.

6.2 SLA scanner double-counts breaches

Detection: breach_count jumping > 1 per minute.
Cause: missing dedupe within minute; not possible per current code, but tested.
Test: integration/sla_double_count.spec.ts.

6.3 Asset health forecaster pollutes scores

Detection: alert if mnt.assets.health_index drops > 30 points across a property in one tick.
Compensation: auto-revert via RecordAssetHealthUpdateUseCase with override; suspend forecaster for the tenant; investigate model output.
Test: golden tests on forecaster bound it.

7. Auto-create path failures (inbound choreography)

7.1 Housekeeping flag arrives but no `propertyId` resolvable

Detection: handler validation; emit metric mnt.inbox.invalid_payload_total.
Compensation: push to DLQ for human review; do not retry forever.
Test: integration/inbox_invalid_payload.spec.ts.

7.2 Lock health alert references unknown deviceId

Detection: asset upsert proceeds (we register a new asset), but displayName is generic.
Compensation: UI surfaces a "needs naming" badge on the new asset; technician renames.
Test: integration/lock_unknown_device.spec.ts.

7.3 Auto-created WO collides with existing open one (invariant #4)

Detection: repository find-open returns a row.
Compensation: comment append on existing WO with the new source/note; original WO id returned in originRef.
Test: integration/auto_create_collision.spec.ts.

8. Sync failures

8.1 Push command with stale OCC

Detection: OCC conflict.
Compensation: server returns OCC_CONFLICT; client surfaces conflict view; technician re-fetches.
Test: integration/sync_push_occ.spec.ts.

8.2 Push commands batch partial failure

Detection: per-command result.
Compensation: other WOs proceed; failed WO's subsequent commands in the batch are skipped (server breaks at first failure per-WO).
Test: integration/sync_push_partial.spec.ts.

8.3 Device clock far skewed

Detection: server compares deviceClock vs serverNow.
Compensation: if skew > 15 min, push is rejected with MELMASTOON.SYS.CLOCK_SKEW_EXCESSIVE; technician syncs OS clock.
Test: integration/sync_clock_skew.spec.ts.

9. AI failures

9.1 Orchestrator timeout / 5xx

Compensation: fail-soft; WO created without AI assist; provenance logs failure.
Test: integration/ai_orchestrator_timeout.spec.ts.

9.2 Model returns invalid enum

Compensation: discard and log; do not persist; fall back to staff input.
Test: integration/ai_invalid_enum.spec.ts.

9.3 Tenant AI budget exhausted

Compensation: fail-soft per capability matrix in AI_INTEGRATION.md §5; WO flow continues.
Test: integration/ai_budget_exhausted.spec.ts.

10. Edge-case operational failures

10.1 High-severity WO opened on a room with an active reservation, no replacement available, guest already at desk

Detection: reservation.modification.failed.v1 with kind=room_change.
Compensation: WO escalated to GM; UI displays "guest waiting" banner; check-in saga (reservation-service) handles guest reaccommodation manually.
User impact: guest waits; GM intervenes.
Test: scripted E2E in staging.

10.2 Vendor invoice file upload fails

Detection: signed-URL upload returns 4xx; BFF surfaces; we never persist the WO change because the file ref was never confirmed.
Compensation: retry; or skip file upload, just record amount.
Test: BFF E2E.

10.3 Multiple lock health alerts for the same device in 1 second

Detection: inbox dedupe + (deviceId, alertCode, dayBucket) natural dedupe.
Compensation: at most one WO per dedupe bucket; subsequent alerts append a comment.
Test: integration/lock_alert_dedupe.spec.ts.

10.4 Generator run-hours regress (sensor or human input goes backward)

Detection: Asset.runHours is max-of per sync contract; but cloud-side direct REST update to a lower value is rejected with MELMASTOON.MAINTENANCE.ASSET_RUN_HOURS_REGRESSION.
Compensation: UI prompts: "current value is X, you entered Y. Confirm reset?" — if confirmed, RecordAssetHealthUpdateUseCase with force=true (audited).
Test: integration/asset_run_hours_regression.spec.ts.

10.5 Schedule `nextDueAt` regression after manual override

Detection: invariant #9.
Compensation: request rejected; UI explains that next-due cannot move backward without manually completing the in-flight schedule.
Test: domain/__tests__/schedule_next_due_regression.spec.ts.

11. Generic failure principles

Fail-soft on advisory dependencies (AI, notification dispatch, asset health forecaster) — primary flow continues, badge surfaced.
Fail-fast on authoritative dependencies (DB, auth, RLS) — return 5xx, let client retry.
Audit everything — every failure produces a structured log + a metric increment. Silent failures are bugs.
No partial state — every command writes WO + outbox in one transaction; either both happen or neither.

1. Choreography failures (room block / relocation)​

1.1 work_order.room_blocked.v1 not acknowledged by property-service​

1.2 property-service rejects the block (room.block_rejected.v1)​

1.3 Relocation required but no replacement room available​

2. Vendor & assignment failures​

2.1 Vendor no-show (no acknowledgement after N reminders)​

2.2 Vendor assigned with channelPreference = call_only and staff didn't manually ack​

2.3 Notification dispatch fails​

3. Parts & cost failures​

3.1 Part out of stock at resolve time​

3.2 Cost-currency mismatch​

4. Concurrency & ordering failures​

4.1 OCC stale version​

4.2 Out-of-order Pub/Sub delivery​

4.3 Two staff resolve the same WO simultaneously​

4.4 Preventive scheduler tick overlap (two pods process same row)​

5. Storage failures​

5.1 Cloud SQL unavailability (planned failover)​

5.2 Outbox relay stalled​

5.3 Inbox dedupe corruption​

5.4 Disk full / Cloud SQL storage exhausted​

6. Worker failures​

6.1 Preventive scheduler stuck​

6.2 SLA scanner double-counts breaches​

6.3 Asset health forecaster pollutes scores​

7. Auto-create path failures (inbound choreography)​

7.1 Housekeeping flag arrives but no propertyId resolvable​

7.2 Lock health alert references unknown deviceId​

7.3 Auto-created WO collides with existing open one (invariant #4)​

8. Sync failures​

8.1 Push command with stale OCC​

8.2 Push commands batch partial failure​

8.3 Device clock far skewed​

9. AI failures​

9.1 Orchestrator timeout / 5xx​

9.2 Model returns invalid enum​

9.3 Tenant AI budget exhausted​

10. Edge-case operational failures​

10.1 High-severity WO opened on a room with an active reservation, no replacement available, guest already at desk​

10.2 Vendor invoice file upload fails​

10.3 Multiple lock health alerts for the same device in 1 second​

10.4 Generator run-hours regress (sensor or human input goes backward)​

10.5 Schedule nextDueAt regression after manual override​

11. Generic failure principles​

1. Choreography failures (room block / relocation)

1.1 `work_order.room_blocked.v1` not acknowledged by `property-service`

1.2 `property-service` rejects the block (`room.block_rejected.v1`)

1.3 Relocation required but no replacement room available

2. Vendor & assignment failures

2.1 Vendor no-show (no acknowledgement after N reminders)

2.2 Vendor assigned with `channelPreference = call_only` and staff didn't manually ack

2.3 Notification dispatch fails

3. Parts & cost failures

3.1 Part out of stock at resolve time

3.2 Cost-currency mismatch

4. Concurrency & ordering failures

4.1 OCC stale version

4.2 Out-of-order Pub/Sub delivery

4.3 Two staff resolve the same WO simultaneously

4.4 Preventive scheduler tick overlap (two pods process same row)

5. Storage failures

5.1 Cloud SQL unavailability (planned failover)

5.2 Outbox relay stalled

5.3 Inbox dedupe corruption

5.4 Disk full / Cloud SQL storage exhausted

6. Worker failures

6.1 Preventive scheduler stuck

6.2 SLA scanner double-counts breaches

6.3 Asset health forecaster pollutes scores

7. Auto-create path failures (inbound choreography)

7.1 Housekeeping flag arrives but no `propertyId` resolvable

7.2 Lock health alert references unknown deviceId

7.3 Auto-created WO collides with existing open one (invariant #4)

8. Sync failures

8.1 Push command with stale OCC

8.2 Push commands batch partial failure

8.3 Device clock far skewed

9. AI failures

9.1 Orchestrator timeout / 5xx

9.2 Model returns invalid enum

9.3 Tenant AI budget exhausted

10. Edge-case operational failures

10.1 High-severity WO opened on a room with an active reservation, no replacement available, guest already at desk

10.2 Vendor invoice file upload fails

10.3 Multiple lock health alerts for the same device in 1 second

10.4 Generator run-hours regress (sensor or human input goes backward)

10.5 Schedule `nextDueAt` regression after manual override

11. Generic failure principles