housekeeping-service — FAILURE_MODES

Catalog of how this service fails, what the user sees, what the system does automatically, and what an on-call should do. Each row is a runbook anchor.

1. Storage failures

1.1 Cloud SQL primary unavailable

Detection: repository errors spike; Cloud SQL connection failures metric > 0; healthcheck /health/ready flips to 503.
System behaviour: Cloud Run drains traffic from instances failing healthchecks; HTTPS LB returns 503; Pub/Sub push subscribers nack and back off; outbox relay halts on published_at IS NULL.
User-visible: REST 503 with Retry-After: 5; desktop sync queues operations locally.
Auto-mitigation: Cloud SQL HA failover within ~60 s; healthcheck recovers; Pub/Sub redelivers backlog with exponential backoff.
Manual: On-call confirms failover via Cloud SQL console; checks outbox.unpublished_rows returned to baseline within 15 min; if not, triggers outbox-backlog runbook.

1.2 Slow queries / partition explosion

Detection: melmastoon.housekeeping.api.latency_ms p99 > 250 ms; Cloud SQL slow query log spikes.
Root causes: missing partition prune (no created_at predicate), missing index after a schema change, runaway analyst query.
Manual: check EXPLAIN plan against partition-pruning.spec.ts baselines; verify pg_partman jobs ran; revert offending query/route.

1.3 Disk pressure

Detection: Cloud SQL storage > 80% (auto-grow on, but slow).
Manual: verify monthly partitions are being detached + archived; run extra archive job; resize.

2. Outbox / relay failures

2.1 Outbox backlog

Detection: outbox.unpublished_rows > 1000 for 5 min.
System behaviour: events delayed but not lost.
Manual: check outbox.append_to_publish_lag_ms; verify Pub/Sub topic exists and SA has pubsub.publisher; check relay process health; if stuck, restart Cloud Run instances; outbox-backlog.md runbook.

2.2 Relay double-publish

Possible cause: crash between publish and UPDATE published_at.
Mitigation: consumers are idempotent on (subject, event_id); effect = ack-only on duplicate.
Manual: none required.

2.3 Outbox table corruption (extremely unlikely)

Manual: restore from PITR to a checkpoint table; replay from there.

3. Pub/Sub consumer failures

3.1 Consumer lag

Detection: subscription/oldest_unacked_message_age > 60 s for 5 min.
Causes: handler error storm, downstream dependency slow, instance count too low.
System behaviour: Pub/Sub retries with backoff; messages eventually DLQ after max_delivery_attempts=10.
Manual: check Sentry for handler errors; consider raising max instances; see consumer-lag.md.

3.2 DLQ growth

Detection: melmastoon.dlq.housekeeping > 10 in 15 min.
Manual: inspect DLQ payloads; classify (poison message vs persistent dependency outage); drain via the dlq-replay Cloud Run Job after fixing root cause.

3.3 Wrong / missing OIDC token

Detection: /internal/events/* returns 401; auth-failure metric spikes.
Manual: verify push subscription's oidc_token.service_account_email matches the audience configured in this service; confirm the SA has roles/run.invoker on this Cloud Run service.

4. Domain / use-case failures

4.1 Concurrency conflict storm

Detection: 409 CONCURRENCY_CONFLICT rate > 10/s for 60 s on the same aggregate.
Cause: two clients (desktop + automation) racing on the same task; bug in retry logic.
System behaviour: API returns 409; clients retry with backoff.
Manual: identify offending client via requestId correlation; throttle if needed.

4.2 `MELMASTOON.HOUSEKEEPING.STAFF_UNAVAILABLE`

Cause: stale shift cache; staff off-duty before shift end propagated.
Mitigation: cache TTL is 60 s; refresh on shift events.
Manual: if persistent, force cache flush; verify staff.shift.ended.v1 consumer is processing.

4.3 `MELMASTOON.HOUSEKEEPING.LINEN_OUT_OF_STOCK` blocks task

System behaviour: assignment refused; alert fires.
Manual: supervisor adjusts on-hand via POST /linen/{id}/issue after physical resupply, or marks linen_required=false on the affected task (audit-flagged).

4.4 `MELMASTOON.HOUSEKEEPING.ROOM_STATE_CONFLICT`

Cause: illegal manual flip or lagging room state.
System behaviour: API returns 422; sync push surfaces conflict toast on desktop.
Manual: investigate via room_status_audit history; one-shot manual flip with reason if legitimate.

5. Sync failures

5.1 Cursor expired (`410 SYNC_CURSOR_EXPIRED`)

Cause: desktop offline > snapshot retention window.
System behaviour: desktop falls back to full re-sync.
User-visible: "Refreshing data" indicator; takes seconds to a minute depending on volume.
Manual: none expected.

5.2 Push retry storm

Detection: sync.push.ops{outcome=deferred} > 100/s.
Cause: server back-pressure; renderer not honoring backoff.
Manual: verify renderer version; bump server min-instances temporarily.

5.3 Local SQLite corruption

Detection: desktop logs SQLITE_CORRUPT.
Mitigation: desktop deletes housekeeping.db and full re-syncs.
Manual: support ticket; root-cause investigation if pattern.

6. AI port failures

6.1 Routing port timeout

Detection: ai.routing.fallback counter > 0.
System behaviour: empty suggestion; manual mode.
Manual: if persistent (> 30 min), confirm ai-orchestrator-service health; consider disabling auto-apply gate via tenant settings.

6.2 Routing applied bad assignment

Cause: routing model returned invalid staffId (off-shift, wrong tenant).
System behaviour: application layer rejects via STAFF_UNAVAILABLE; row dropped from suggestion; audit row created.
Manual: capture suggestion ID and report to AI team for model regression test.

7. Multi-service interactions

7.1 `reservation.checked_out.v1` arrives but room not in `property-service`

Cause: event-out-of-order (room archived just before checkout).
System behaviour: CreateTaskUseCase looks up room, gets 404 → emits task.cancelled.v1 with reason=room_archived; logs WARN.
Manual: none required; investigated only if frequent.

7.2 `maintenance.work_order.completed.v1` for room with no prior maintenance task

Cause: maintenance opened standalone (not via cleaning hand-off).
System behaviour: unblock room, create post_maintenance task.
Manual: none.

7.3 `staff.shift.ended.v1` while staff has open task

Cause: shift ended before completion.
System behaviour: open tasks reset to pending; router re-suggests; task.reassigned.v1 emitted on next assignment; escalated.v1 if no replacement found within 5 min.
Manual: supervisor action via board.

8. Security events

8.1 RLS deny in production

Detection: any RLS deny in prod = page on-call.
Manual: open SEV-2; this should never happen given our current_tenant_id() setup. Investigate the offending request immediately.

8.2 Unexpected JWT role escalation

Detection: auth.role.unexpected (a non-supervisor calling supervisor-only routes) > 0.
Manual: rotate JWT signing key; force iam-service re-issuance.

9. Data integrity

9.1 Outbox event with invalid payload

Detection: consumer rejects with schema-validation error.
Mitigation: event versioning; producer must emit valid vN+1 only after consumer migration.
Manual: if consumer is ours: hot-fix payload field; for external producer: coordinate via owner.

9.2 `room_status` row missing

Cause: brand-new room not yet projected.
Mitigation: FlipRoomStatusUseCase and the room-event consumer create the row on first touch.
Manual: none.

10. Operational checklist for incident response

Confirm the alert in Grafana housekeeping folder.
Check outbox.unpublished_rows, consumer lag, and 5xx rate dashboards.
Pull last 50 ERROR logs filtered to the affected route/use case.
Identify if the issue is internal (code bug) or external (DB, Pub/Sub, dependency).
If incident exceeds 15 min, post in #hk-ops, escalate per platform standard, start an incident doc.
Once mitigated, file a post-mortem and link to the runbook updated as part of the fix.

11. Cross-link

SLOs and alerts: OBSERVABILITY.md.
Risk register (longer-horizon): SERVICE_RISK_REGISTER.md.
Topology context: DEPLOYMENT_TOPOLOGY.md.

1. Storage failures​

1.1 Cloud SQL primary unavailable​

1.2 Slow queries / partition explosion​

1.3 Disk pressure​

2. Outbox / relay failures​

2.1 Outbox backlog​

2.2 Relay double-publish​

2.3 Outbox table corruption (extremely unlikely)​

3. Pub/Sub consumer failures​

3.1 Consumer lag​

3.2 DLQ growth​

3.3 Wrong / missing OIDC token​

4. Domain / use-case failures​

4.1 Concurrency conflict storm​

4.2 MELMASTOON.HOUSEKEEPING.STAFF_UNAVAILABLE​

4.3 MELMASTOON.HOUSEKEEPING.LINEN_OUT_OF_STOCK blocks task​

4.4 MELMASTOON.HOUSEKEEPING.ROOM_STATE_CONFLICT​

5. Sync failures​

5.1 Cursor expired (410 SYNC_CURSOR_EXPIRED)​

5.2 Push retry storm​

5.3 Local SQLite corruption​

6. AI port failures​

6.1 Routing port timeout​

6.2 Routing applied bad assignment​

7. Multi-service interactions​

7.1 reservation.checked_out.v1 arrives but room not in property-service​

7.2 maintenance.work_order.completed.v1 for room with no prior maintenance task​

7.3 staff.shift.ended.v1 while staff has open task​

8. Security events​

8.1 RLS deny in production​

8.2 Unexpected JWT role escalation​

9. Data integrity​

9.1 Outbox event with invalid payload​

9.2 room_status row missing​

10. Operational checklist for incident response​

11. Cross-link​

1. Storage failures

1.1 Cloud SQL primary unavailable

1.2 Slow queries / partition explosion

1.3 Disk pressure

2. Outbox / relay failures

2.1 Outbox backlog

2.2 Relay double-publish

2.3 Outbox table corruption (extremely unlikely)

3. Pub/Sub consumer failures

3.1 Consumer lag

3.2 DLQ growth

3.3 Wrong / missing OIDC token

4. Domain / use-case failures

4.1 Concurrency conflict storm

4.2 `MELMASTOON.HOUSEKEEPING.STAFF_UNAVAILABLE`

4.3 `MELMASTOON.HOUSEKEEPING.LINEN_OUT_OF_STOCK` blocks task

4.4 `MELMASTOON.HOUSEKEEPING.ROOM_STATE_CONFLICT`

5. Sync failures

5.1 Cursor expired (`410 SYNC_CURSOR_EXPIRED`)

5.2 Push retry storm

5.3 Local SQLite corruption

6. AI port failures

6.1 Routing port timeout

6.2 Routing applied bad assignment

7. Multi-service interactions

7.1 `reservation.checked_out.v1` arrives but room not in `property-service`

7.2 `maintenance.work_order.completed.v1` for room with no prior maintenance task

7.3 `staff.shift.ended.v1` while staff has open task

8. Security events

8.1 RLS deny in production

8.2 Unexpected JWT role escalation

9. Data integrity

9.1 Outbox event with invalid payload

9.2 `room_status` row missing

10. Operational checklist for incident response

11. Cross-link