Skip to main content

housekeeping-service — OBSERVABILITY

Logs (Cloud Logging), metrics (Cloud Monitoring + Datadog), traces (Cloud Trace + OpenTelemetry), errors (Sentry). SLOs declared in slo.yaml; alert routes in alerts.yaml. Aligned with the platform standard.


1. SLOs

SLOTargetWindowIndicator
Turnover task auto-create latencyp95 < 2 s7-day rollingevent_received_atoutbox_appended_at for task.created.v1 from reservation.checked_out.v1
Board read latencyp99 < 250 ms7-day rollingGET /board server time
Board write latencyp99 < 400 ms7-day rollingPOST /tasks/*/assign server time including outbox commit
API availability99.9%30-day rolling(1 - 5xx_rate) on /api/v1/housekeeping/*
Event consumer lagp99 < 30 s1-day rollingPub/Sub oldest unacked age per consumed subscription
Sync push success≥ 99%7-day rollingapplied + conflict_recoverable / total

Error budgets and burn-rate alerts (1h/6h/3d) are defined in slo.yaml.

2. Logs

Structured JSON, 1 line per request/event. Mandatory fields:

{
"ts":"2026-04-22T12:00:00.123Z",
"level":"INFO",
"service":"housekeeping-service",
"version":"1.4.2",
"tenantId":"tnt_…", // never the raw value in lower envs; hashed in production "tnt_h_<sha256-prefix>"
"requestId":"req_…",
"traceId":"…",
"spanId":"…",
"actor":{"type":"user","id":"stf_…"},
"route":"POST /tasks/:id/complete",
"useCase":"CompleteTaskUseCase",
"taskId":"hkt_…",
"roomId":"rom_…",
"outcome":"ok",
"latencyMs":143,
"msg":"task completed"
}

Levels: DEBUG (off in prod), INFO (default), WARN (recoverable), ERROR (use-case failure), FATAL (process-killing). Sensitive fields redacted by redactor.middleware.ts per SECURITY_MODEL.md §8. Retention: 30 days hot, 365 days archived in GCS.

3. Metrics

Naming: melmastoon.housekeeping.<area>.<measure> with labels (tenant_hash, property_id, outcome, route|use_case|topic). Histograms preferred over averages.

3.1 Hot-path

  • melmastoon.housekeeping.api.requests (counter; labels: route, status_code)
  • melmastoon.housekeeping.api.latency_ms (histogram; labels: route)
  • melmastoon.housekeeping.use_case.duration_ms (histogram; labels: use_case, outcome)
  • melmastoon.housekeeping.outbox.append_to_publish_lag_ms (histogram)
  • melmastoon.housekeeping.outbox.unpublished_rows (gauge; alert > 1k for 5m)

3.2 Domain

  • melmastoon.housekeeping.tasks.created (counter; labels: kind, source)
  • melmastoon.housekeeping.tasks.completed (counter; labels: kind, with_inspection)
  • melmastoon.housekeeping.tasks.duration_minutes (histogram; labels: kind)
  • melmastoon.housekeeping.room.time_to_ready_minutes (histogram; labels: property_hash)
  • melmastoon.housekeeping.tasks.escalated (counter; labels: reason)
  • melmastoon.housekeeping.tasks.failed (counter; labels: reason)
  • melmastoon.housekeeping.linen.on_hand (gauge; labels: property_hash, line)
  • melmastoon.housekeeping.linen.low_stock_alerts_emitted (counter)
  • melmastoon.housekeeping.shift.staffing_gap_detected (counter)
  • melmastoon.housekeeping.inspection.outcome (counter; labels: outcome)

3.3 Sync

  • melmastoon.housekeeping.sync.pull.row_count (histogram; labels: aggregate)
  • melmastoon.housekeeping.sync.push.ops (counter; labels: outcome)
  • melmastoon.housekeeping.sync.conflicts (counter; labels: aggregate, field)

3.4 AI port

  • melmastoon.housekeeping.ai.routing.requests (counter; labels: outcome, hitl_mode)
  • melmastoon.housekeeping.ai.routing.latency_ms (histogram)
  • melmastoon.housekeeping.ai.routing.applied_rows (histogram)
  • melmastoon.housekeeping.ai.routing.fallback (counter)

3.5 System

  • Cloud Run native metrics (CPU, mem, instance count, request concurrency).
  • Cloud SQL native metrics (connections, slow queries, replication lag, CPU).
  • Pub/Sub: subscription/oldest_unacked_message_age, delivery_attempts.

4. Traces

OpenTelemetry. Every request and event handler is a root span. Standard span names:

  • http.request (attributes: route, status, request_id)
  • housekeeping.use_case.<name> (attributes: tenant_hash, aggregate_id)
  • housekeeping.repo.<table>.<op> (attributes: rows, version)
  • housekeeping.outbox.append
  • housekeeping.pubsub.consume.<subject>
  • housekeeping.ai.routing.suggest

traceparent propagated via the event envelope; consumers continue the trace.

Sample rate: 100% for ERROR, 10% for INFO under load, 100% on canary.

5. Errors (Sentry)

  • All ERROR-level logs auto-shipped to Sentry with the trace context.
  • Event handlers tag errors with topic, message_id, delivery_attempt.
  • Domain errors are not Sentry events — they are expected business outcomes (4xx). Only 5xx, panics, and Concurrency conflict storms (10+ in 60 s on the same aggregate) page on-call.

6. Dashboards

Grafana folder housekeeping:

  1. Overview — request rate, latency p50/p95/p99, 4xx/5xx, error budget burn.
  2. Turnover Saga — tasks created/started/completed/min, time-to-ready histogram, current open count by status.
  3. BoardGET /board p99, board cache hit ratio, sync push outcomes.
  4. Linen — on-hand by line per property, low-stock alerts, runway minutes.
  5. Shifts & Routing — staffing-gap events, routing requests/applied/fallback, hitl mode by tenant.
  6. Database — Cloud SQL CPU, slow query top 10, partition sizes, RLS denies.
  7. Pub/Sub — consumer lag per subscription, DLQ size, retry counts.

7. Alerts

AlertConditionRoute
OutboxBacklogunpublished rows > 1000 for 5 minon-call (housekeeping)
BoardLatencyHighboard p99 > 400 ms for 10 minon-call
TurnoverCreateLagHighp95 > 5 s for 10 minon-call
ConsumerLagHigholdest unacked > 60 s for 5 min on any subscriptionon-call
DLQGrowthDLQ size > 10 in 15 minon-call
ErrorBudgetBurnFast1h burn rate > 14×on-call (page)
ErrorBudgetBurnSlow6h burn rate > 6×on-call (ticket)
RLSDenyStormRLS denies > 0 for 5 minsecurity on-call (this should never happen)
RoutingFallbackPersistentfallback counter > 0 for 30 minengineering Slack (warn)
LinenLowStockSpikelow-stock alerts > 5/property in 1 hanalytics Slack (warn)

Pages go to PagerDuty housekeeping; warnings to Slack #hk-ops.

8. Runbooks

Each alert has a runbook in runbooks/:

  • outbox-backlog.md
  • board-latency.md
  • turnover-create-lag.md
  • consumer-lag.md
  • dlq-growth.md
  • rls-deny.md
  • routing-fallback.md

Runbooks include: how to confirm, immediate mitigation, escalation, post-mortem template.

9. Audit visibility

audit_events is queryable via the internal admin tool (/internal/admin/audit?aggregate_id=…) — restricted to tenant_admin and platform support roles. Mirrored to audit-service over Pub/Sub.