housekeeping-service — FAILURE_MODES
Catalog of how this service fails, what the user sees, what the system does automatically, and what an on-call should do. Each row is a runbook anchor.
1. Storage failures
1.1 Cloud SQL primary unavailable
- Detection: repository errors spike;
Cloud SQL connection failuresmetric > 0; healthcheck/health/readyflips to 503. - System behaviour: Cloud Run drains traffic from instances failing healthchecks; HTTPS LB returns 503; Pub/Sub push subscribers nack and back off; outbox relay halts on
published_at IS NULL. - User-visible: REST 503 with
Retry-After: 5; desktop sync queues operations locally. - Auto-mitigation: Cloud SQL HA failover within ~60 s; healthcheck recovers; Pub/Sub redelivers backlog with exponential backoff.
- Manual: On-call confirms failover via Cloud SQL console; checks
outbox.unpublished_rowsreturned to baseline within 15 min; if not, triggersoutbox-backlogrunbook.
1.2 Slow queries / partition explosion
- Detection:
melmastoon.housekeeping.api.latency_msp99 > 250 ms; Cloud SQL slow query log spikes. - Root causes: missing partition prune (no
created_atpredicate), missing index after a schema change, runaway analyst query. - Manual: check
EXPLAINplan againstpartition-pruning.spec.tsbaselines; verifypg_partmanjobs ran; revert offending query/route.
1.3 Disk pressure
- Detection: Cloud SQL storage > 80% (auto-grow on, but slow).
- Manual: verify monthly partitions are being detached + archived; run extra archive job; resize.
2. Outbox / relay failures
2.1 Outbox backlog
- Detection:
outbox.unpublished_rows > 1000for 5 min. - System behaviour: events delayed but not lost.
- Manual: check
outbox.append_to_publish_lag_ms; verify Pub/Sub topic exists and SA haspubsub.publisher; check relay process health; if stuck, restart Cloud Run instances;outbox-backlog.mdrunbook.
2.2 Relay double-publish
- Possible cause: crash between publish and
UPDATE published_at. - Mitigation: consumers are idempotent on
(subject, event_id); effect = ack-only on duplicate. - Manual: none required.
2.3 Outbox table corruption (extremely unlikely)
- Manual: restore from PITR to a checkpoint table; replay from there.
3. Pub/Sub consumer failures
3.1 Consumer lag
- Detection:
subscription/oldest_unacked_message_age > 60 sfor 5 min. - Causes: handler error storm, downstream dependency slow, instance count too low.
- System behaviour: Pub/Sub retries with backoff; messages eventually DLQ after
max_delivery_attempts=10. - Manual: check Sentry for handler errors; consider raising max instances; see
consumer-lag.md.
3.2 DLQ growth
- Detection:
melmastoon.dlq.housekeeping> 10 in 15 min. - Manual: inspect DLQ payloads; classify (poison message vs persistent dependency outage); drain via the
dlq-replayCloud Run Job after fixing root cause.
3.3 Wrong / missing OIDC token
- Detection:
/internal/events/*returns 401; auth-failure metric spikes. - Manual: verify push subscription's
oidc_token.service_account_emailmatches the audience configured in this service; confirm the SA hasroles/run.invokeron this Cloud Run service.
4. Domain / use-case failures
4.1 Concurrency conflict storm
- Detection:
409 CONCURRENCY_CONFLICTrate > 10/s for 60 s on the same aggregate. - Cause: two clients (desktop + automation) racing on the same task; bug in retry logic.
- System behaviour: API returns 409; clients retry with backoff.
- Manual: identify offending client via
requestIdcorrelation; throttle if needed.
4.2 MELMASTOON.HOUSEKEEPING.STAFF_UNAVAILABLE
- Cause: stale shift cache; staff off-duty before shift end propagated.
- Mitigation: cache TTL is 60 s; refresh on shift events.
- Manual: if persistent, force cache flush; verify
staff.shift.ended.v1consumer is processing.
4.3 MELMASTOON.HOUSEKEEPING.LINEN_OUT_OF_STOCK blocks task
- System behaviour: assignment refused; alert fires.
- Manual: supervisor adjusts on-hand via
POST /linen/{id}/issueafter physical resupply, or markslinen_required=falseon the affected task (audit-flagged).
4.4 MELMASTOON.HOUSEKEEPING.ROOM_STATE_CONFLICT
- Cause: illegal manual flip or lagging room state.
- System behaviour: API returns 422; sync push surfaces conflict toast on desktop.
- Manual: investigate via
room_status_audithistory; one-shot manual flip with reason if legitimate.
5. Sync failures
5.1 Cursor expired (410 SYNC_CURSOR_EXPIRED)
- Cause: desktop offline > snapshot retention window.
- System behaviour: desktop falls back to full re-sync.
- User-visible: "Refreshing data" indicator; takes seconds to a minute depending on volume.
- Manual: none expected.
5.2 Push retry storm
- Detection:
sync.push.ops{outcome=deferred}> 100/s. - Cause: server back-pressure; renderer not honoring backoff.
- Manual: verify renderer version; bump server min-instances temporarily.
5.3 Local SQLite corruption
- Detection: desktop logs
SQLITE_CORRUPT. - Mitigation: desktop deletes
housekeeping.dband full re-syncs. - Manual: support ticket; root-cause investigation if pattern.
6. AI port failures
6.1 Routing port timeout
- Detection:
ai.routing.fallbackcounter > 0. - System behaviour: empty suggestion; manual mode.
- Manual: if persistent (> 30 min), confirm
ai-orchestrator-servicehealth; consider disabling auto-apply gate via tenant settings.
6.2 Routing applied bad assignment
- Cause: routing model returned invalid
staffId(off-shift, wrong tenant). - System behaviour: application layer rejects via
STAFF_UNAVAILABLE; row dropped from suggestion; audit row created. - Manual: capture suggestion ID and report to AI team for model regression test.
7. Multi-service interactions
7.1 reservation.checked_out.v1 arrives but room not in property-service
- Cause: event-out-of-order (room archived just before checkout).
- System behaviour:
CreateTaskUseCaselooks up room, gets 404 → emitstask.cancelled.v1withreason=room_archived; logs WARN. - Manual: none required; investigated only if frequent.
7.2 maintenance.work_order.completed.v1 for room with no prior maintenance task
- Cause: maintenance opened standalone (not via cleaning hand-off).
- System behaviour: unblock room, create
post_maintenancetask. - Manual: none.
7.3 staff.shift.ended.v1 while staff has open task
- Cause: shift ended before completion.
- System behaviour: open tasks reset to
pending; router re-suggests;task.reassigned.v1emitted on next assignment;escalated.v1if no replacement found within 5 min. - Manual: supervisor action via board.
8. Security events
8.1 RLS deny in production
- Detection: any RLS deny in prod = page on-call.
- Manual: open SEV-2; this should never happen given our
current_tenant_id()setup. Investigate the offending request immediately.
8.2 Unexpected JWT role escalation
- Detection:
auth.role.unexpected(a non-supervisor calling supervisor-only routes) > 0. - Manual: rotate JWT signing key; force
iam-servicere-issuance.
9. Data integrity
9.1 Outbox event with invalid payload
- Detection: consumer rejects with schema-validation error.
- Mitigation: event versioning; producer must emit valid
vN+1only after consumer migration. - Manual: if consumer is ours: hot-fix payload field; for external producer: coordinate via owner.
9.2 room_status row missing
- Cause: brand-new room not yet projected.
- Mitigation:
FlipRoomStatusUseCaseand the room-event consumer create the row on first touch. - Manual: none.
10. Operational checklist for incident response
- Confirm the alert in Grafana housekeeping folder.
- Check
outbox.unpublished_rows, consumer lag, and 5xx rate dashboards. - Pull last 50 ERROR logs filtered to the affected route/use case.
- Identify if the issue is internal (code bug) or external (DB, Pub/Sub, dependency).
- If incident exceeds 15 min, post in
#hk-ops, escalate per platform standard, start an incident doc. - Once mitigated, file a post-mortem and link to the runbook updated as part of the fix.
11. Cross-link
- SLOs and alerts:
OBSERVABILITY.md. - Risk register (longer-horizon):
SERVICE_RISK_REGISTER.md. - Topology context:
DEPLOYMENT_TOPOLOGY.md.