housekeeping-service — OBSERVABILITY

Logs (Cloud Logging), metrics (Cloud Monitoring + Datadog), traces (Cloud Trace + OpenTelemetry), errors (Sentry). SLOs declared in slo.yaml; alert routes in alerts.yaml. Aligned with the platform standard.

1. SLOs

SLO	Target	Window	Indicator
Turnover task auto-create latency	p95 < 2 s	7-day rolling	`event_received_at` → `outbox_appended_at` for `task.created.v1` from `reservation.checked_out.v1`
Board read latency	p99 < 250 ms	7-day rolling	`GET /board` server time
Board write latency	p99 < 400 ms	7-day rolling	`POST /tasks/*/assign` server time including outbox commit
API availability	99.9%	30-day rolling	(1 - 5xx_rate) on `/api/v1/housekeeping/*`
Event consumer lag	p99 < 30 s	1-day rolling	Pub/Sub oldest unacked age per consumed subscription
Sync push success	≥ 99%	7-day rolling	`applied + conflict_recoverable / total`

Error budgets and burn-rate alerts (1h/6h/3d) are defined in slo.yaml.

2. Logs

Structured JSON, 1 line per request/event. Mandatory fields:

{
  "ts":"2026-04-22T12:00:00.123Z",
  "level":"INFO",
  "service":"housekeeping-service",
  "version":"1.4.2",
  "tenantId":"tnt_…",          // never the raw value in lower envs; hashed in production "tnt_h_<sha256-prefix>"
  "requestId":"req_…",
  "traceId":"…",
  "spanId":"…",
  "actor":{"type":"user","id":"stf_…"},
  "route":"POST /tasks/:id/complete",
  "useCase":"CompleteTaskUseCase",
  "taskId":"hkt_…",
  "roomId":"rom_…",
  "outcome":"ok",
  "latencyMs":143,
  "msg":"task completed"
}

Levels: DEBUG (off in prod), INFO (default), WARN (recoverable), ERROR (use-case failure), FATAL (process-killing). Sensitive fields redacted by redactor.middleware.ts per SECURITY_MODEL.md §8. Retention: 30 days hot, 365 days archived in GCS.

3. Metrics

Naming: melmastoon.housekeeping.<area>.<measure> with labels (tenant_hash, property_id, outcome, route|use_case|topic). Histograms preferred over averages.

3.1 Hot-path

melmastoon.housekeeping.api.requests (counter; labels: route, status_code)
melmastoon.housekeeping.api.latency_ms (histogram; labels: route)
melmastoon.housekeeping.use_case.duration_ms (histogram; labels: use_case, outcome)
melmastoon.housekeeping.outbox.append_to_publish_lag_ms (histogram)
melmastoon.housekeeping.outbox.unpublished_rows (gauge; alert > 1k for 5m)

3.2 Domain

melmastoon.housekeeping.tasks.created (counter; labels: kind, source)
melmastoon.housekeeping.tasks.completed (counter; labels: kind, with_inspection)
melmastoon.housekeeping.tasks.duration_minutes (histogram; labels: kind)
melmastoon.housekeeping.room.time_to_ready_minutes (histogram; labels: property_hash)
melmastoon.housekeeping.tasks.escalated (counter; labels: reason)
melmastoon.housekeeping.tasks.failed (counter; labels: reason)
melmastoon.housekeeping.linen.on_hand (gauge; labels: property_hash, line)
melmastoon.housekeeping.linen.low_stock_alerts_emitted (counter)
melmastoon.housekeeping.shift.staffing_gap_detected (counter)
melmastoon.housekeeping.inspection.outcome (counter; labels: outcome)

3.3 Sync

melmastoon.housekeeping.sync.pull.row_count (histogram; labels: aggregate)
melmastoon.housekeeping.sync.push.ops (counter; labels: outcome)
melmastoon.housekeeping.sync.conflicts (counter; labels: aggregate, field)

3.4 AI port

melmastoon.housekeeping.ai.routing.requests (counter; labels: outcome, hitl_mode)
melmastoon.housekeeping.ai.routing.latency_ms (histogram)
melmastoon.housekeeping.ai.routing.applied_rows (histogram)
melmastoon.housekeeping.ai.routing.fallback (counter)

3.5 System

Cloud Run native metrics (CPU, mem, instance count, request concurrency).
Cloud SQL native metrics (connections, slow queries, replication lag, CPU).
Pub/Sub: subscription/oldest_unacked_message_age, delivery_attempts.

4. Traces

OpenTelemetry. Every request and event handler is a root span. Standard span names:

http.request (attributes: route, status, request_id)
housekeeping.use_case.<name> (attributes: tenant_hash, aggregate_id)
housekeeping.repo.<table>.<op> (attributes: rows, version)
housekeeping.outbox.append
housekeeping.pubsub.consume.<subject>
housekeeping.ai.routing.suggest

traceparent propagated via the event envelope; consumers continue the trace.

Sample rate: 100% for ERROR, 10% for INFO under load, 100% on canary.

5. Errors (Sentry)

All ERROR-level logs auto-shipped to Sentry with the trace context.
Event handlers tag errors with topic, message_id, delivery_attempt.
Domain errors are not Sentry events — they are expected business outcomes (4xx). Only 5xx, panics, and Concurrency conflict storms (10+ in 60 s on the same aggregate) page on-call.

6. Dashboards

Grafana folder housekeeping:

Overview — request rate, latency p50/p95/p99, 4xx/5xx, error budget burn.
Turnover Saga — tasks created/started/completed/min, time-to-ready histogram, current open count by status.
Board — GET /board p99, board cache hit ratio, sync push outcomes.
Linen — on-hand by line per property, low-stock alerts, runway minutes.
Shifts & Routing — staffing-gap events, routing requests/applied/fallback, hitl mode by tenant.
Database — Cloud SQL CPU, slow query top 10, partition sizes, RLS denies.
Pub/Sub — consumer lag per subscription, DLQ size, retry counts.

7. Alerts

Alert	Condition	Route
`OutboxBacklog`	unpublished rows > 1000 for 5 min	on-call (housekeeping)
`BoardLatencyHigh`	board p99 > 400 ms for 10 min	on-call
`TurnoverCreateLagHigh`	p95 > 5 s for 10 min	on-call
`ConsumerLagHigh`	oldest unacked > 60 s for 5 min on any subscription	on-call
`DLQGrowth`	DLQ size > 10 in 15 min	on-call
`ErrorBudgetBurnFast`	1h burn rate > 14×	on-call (page)
`ErrorBudgetBurnSlow`	6h burn rate > 6×	on-call (ticket)
`RLSDenyStorm`	RLS denies > 0 for 5 min	security on-call (this should never happen)
`RoutingFallbackPersistent`	fallback counter > 0 for 30 min	engineering Slack (warn)
`LinenLowStockSpike`	low-stock alerts > 5/property in 1 h	analytics Slack (warn)

Pages go to PagerDuty housekeeping; warnings to Slack #hk-ops.

8. Runbooks

Each alert has a runbook in runbooks/:

outbox-backlog.md
board-latency.md
turnover-create-lag.md
consumer-lag.md
dlq-growth.md
rls-deny.md
routing-fallback.md

Runbooks include: how to confirm, immediate mitigation, escalation, post-mortem template.

9. Audit visibility

audit_events is queryable via the internal admin tool (/internal/admin/audit?aggregate_id=…) — restricted to tenant_admin and platform support roles. Mirrored to audit-service over Pub/Sub.

10. Cross-link

Platform observability defaults: docs/observability.md (project-wide).
Failure modes & runbooks: FAILURE_MODES.md.
Capacity planning inputs: DATA_MODEL.md §6.

1. SLOs​

2. Logs​

3. Metrics​

3.1 Hot-path​

3.2 Domain​

3.3 Sync​

3.4 AI port​

3.5 System​

4. Traces​

5. Errors (Sentry)​

6. Dashboards​

7. Alerts​

8. Runbooks​

9. Audit visibility​

10. Cross-link​