Skip to main content

housekeeping-service — FAILURE_MODES

Catalog of how this service fails, what the user sees, what the system does automatically, and what an on-call should do. Each row is a runbook anchor.


1. Storage failures

1.1 Cloud SQL primary unavailable

  • Detection: repository errors spike; Cloud SQL connection failures metric > 0; healthcheck /health/ready flips to 503.
  • System behaviour: Cloud Run drains traffic from instances failing healthchecks; HTTPS LB returns 503; Pub/Sub push subscribers nack and back off; outbox relay halts on published_at IS NULL.
  • User-visible: REST 503 with Retry-After: 5; desktop sync queues operations locally.
  • Auto-mitigation: Cloud SQL HA failover within ~60 s; healthcheck recovers; Pub/Sub redelivers backlog with exponential backoff.
  • Manual: On-call confirms failover via Cloud SQL console; checks outbox.unpublished_rows returned to baseline within 15 min; if not, triggers outbox-backlog runbook.

1.2 Slow queries / partition explosion

  • Detection: melmastoon.housekeeping.api.latency_ms p99 > 250 ms; Cloud SQL slow query log spikes.
  • Root causes: missing partition prune (no created_at predicate), missing index after a schema change, runaway analyst query.
  • Manual: check EXPLAIN plan against partition-pruning.spec.ts baselines; verify pg_partman jobs ran; revert offending query/route.

1.3 Disk pressure

  • Detection: Cloud SQL storage > 80% (auto-grow on, but slow).
  • Manual: verify monthly partitions are being detached + archived; run extra archive job; resize.

2. Outbox / relay failures

2.1 Outbox backlog

  • Detection: outbox.unpublished_rows > 1000 for 5 min.
  • System behaviour: events delayed but not lost.
  • Manual: check outbox.append_to_publish_lag_ms; verify Pub/Sub topic exists and SA has pubsub.publisher; check relay process health; if stuck, restart Cloud Run instances; outbox-backlog.md runbook.

2.2 Relay double-publish

  • Possible cause: crash between publish and UPDATE published_at.
  • Mitigation: consumers are idempotent on (subject, event_id); effect = ack-only on duplicate.
  • Manual: none required.

2.3 Outbox table corruption (extremely unlikely)

  • Manual: restore from PITR to a checkpoint table; replay from there.

3. Pub/Sub consumer failures

3.1 Consumer lag

  • Detection: subscription/oldest_unacked_message_age > 60 s for 5 min.
  • Causes: handler error storm, downstream dependency slow, instance count too low.
  • System behaviour: Pub/Sub retries with backoff; messages eventually DLQ after max_delivery_attempts=10.
  • Manual: check Sentry for handler errors; consider raising max instances; see consumer-lag.md.

3.2 DLQ growth

  • Detection: melmastoon.dlq.housekeeping > 10 in 15 min.
  • Manual: inspect DLQ payloads; classify (poison message vs persistent dependency outage); drain via the dlq-replay Cloud Run Job after fixing root cause.

3.3 Wrong / missing OIDC token

  • Detection: /internal/events/* returns 401; auth-failure metric spikes.
  • Manual: verify push subscription's oidc_token.service_account_email matches the audience configured in this service; confirm the SA has roles/run.invoker on this Cloud Run service.

4. Domain / use-case failures

4.1 Concurrency conflict storm

  • Detection: 409 CONCURRENCY_CONFLICT rate > 10/s for 60 s on the same aggregate.
  • Cause: two clients (desktop + automation) racing on the same task; bug in retry logic.
  • System behaviour: API returns 409; clients retry with backoff.
  • Manual: identify offending client via requestId correlation; throttle if needed.

4.2 MELMASTOON.HOUSEKEEPING.STAFF_UNAVAILABLE

  • Cause: stale shift cache; staff off-duty before shift end propagated.
  • Mitigation: cache TTL is 60 s; refresh on shift events.
  • Manual: if persistent, force cache flush; verify staff.shift.ended.v1 consumer is processing.

4.3 MELMASTOON.HOUSEKEEPING.LINEN_OUT_OF_STOCK blocks task

  • System behaviour: assignment refused; alert fires.
  • Manual: supervisor adjusts on-hand via POST /linen/{id}/issue after physical resupply, or marks linen_required=false on the affected task (audit-flagged).

4.4 MELMASTOON.HOUSEKEEPING.ROOM_STATE_CONFLICT

  • Cause: illegal manual flip or lagging room state.
  • System behaviour: API returns 422; sync push surfaces conflict toast on desktop.
  • Manual: investigate via room_status_audit history; one-shot manual flip with reason if legitimate.

5. Sync failures

5.1 Cursor expired (410 SYNC_CURSOR_EXPIRED)

  • Cause: desktop offline > snapshot retention window.
  • System behaviour: desktop falls back to full re-sync.
  • User-visible: "Refreshing data" indicator; takes seconds to a minute depending on volume.
  • Manual: none expected.

5.2 Push retry storm

  • Detection: sync.push.ops{outcome=deferred} > 100/s.
  • Cause: server back-pressure; renderer not honoring backoff.
  • Manual: verify renderer version; bump server min-instances temporarily.

5.3 Local SQLite corruption

  • Detection: desktop logs SQLITE_CORRUPT.
  • Mitigation: desktop deletes housekeeping.db and full re-syncs.
  • Manual: support ticket; root-cause investigation if pattern.

6. AI port failures

6.1 Routing port timeout

  • Detection: ai.routing.fallback counter > 0.
  • System behaviour: empty suggestion; manual mode.
  • Manual: if persistent (> 30 min), confirm ai-orchestrator-service health; consider disabling auto-apply gate via tenant settings.

6.2 Routing applied bad assignment

  • Cause: routing model returned invalid staffId (off-shift, wrong tenant).
  • System behaviour: application layer rejects via STAFF_UNAVAILABLE; row dropped from suggestion; audit row created.
  • Manual: capture suggestion ID and report to AI team for model regression test.

7. Multi-service interactions

7.1 reservation.checked_out.v1 arrives but room not in property-service

  • Cause: event-out-of-order (room archived just before checkout).
  • System behaviour: CreateTaskUseCase looks up room, gets 404 → emits task.cancelled.v1 with reason=room_archived; logs WARN.
  • Manual: none required; investigated only if frequent.

7.2 maintenance.work_order.completed.v1 for room with no prior maintenance task

  • Cause: maintenance opened standalone (not via cleaning hand-off).
  • System behaviour: unblock room, create post_maintenance task.
  • Manual: none.

7.3 staff.shift.ended.v1 while staff has open task

  • Cause: shift ended before completion.
  • System behaviour: open tasks reset to pending; router re-suggests; task.reassigned.v1 emitted on next assignment; escalated.v1 if no replacement found within 5 min.
  • Manual: supervisor action via board.

8. Security events

8.1 RLS deny in production

  • Detection: any RLS deny in prod = page on-call.
  • Manual: open SEV-2; this should never happen given our current_tenant_id() setup. Investigate the offending request immediately.

8.2 Unexpected JWT role escalation

  • Detection: auth.role.unexpected (a non-supervisor calling supervisor-only routes) > 0.
  • Manual: rotate JWT signing key; force iam-service re-issuance.

9. Data integrity

9.1 Outbox event with invalid payload

  • Detection: consumer rejects with schema-validation error.
  • Mitigation: event versioning; producer must emit valid vN+1 only after consumer migration.
  • Manual: if consumer is ours: hot-fix payload field; for external producer: coordinate via owner.

9.2 room_status row missing

  • Cause: brand-new room not yet projected.
  • Mitigation: FlipRoomStatusUseCase and the room-event consumer create the row on first touch.
  • Manual: none.

10. Operational checklist for incident response

  1. Confirm the alert in Grafana housekeeping folder.
  2. Check outbox.unpublished_rows, consumer lag, and 5xx rate dashboards.
  3. Pull last 50 ERROR logs filtered to the affected route/use case.
  4. Identify if the issue is internal (code bug) or external (DB, Pub/Sub, dependency).
  5. If incident exceeds 15 min, post in #hk-ops, escalate per platform standard, start an incident doc.
  6. Once mitigated, file a post-mortem and link to the runbook updated as part of the fix.