maintenance-service · OBSERVABILITY
Stack: OpenTelemetry SDK (Node) → OTLP/gRPC → Cloud Operations + SigNoz. Logs structured JSON to stdout → Cloud Logging. Metrics scraped via OTLP. Traces sampled at 10% baseline, 100% on errors and on
propertyIddebug allow-list.
1. Required span attributes
Every server span (HTTP, Pub/Sub handler, worker tick) must carry:
| Attribute | Source | Notes |
|---|---|---|
service.name | resource | maintenance-service |
service.namespace | resource | melmastoon |
service.version | resource | from build env BUILD_VERSION |
tenant.id | request middleware | from JWT / event envelope |
property.id | request middleware | when applicable |
correlation.id | header / envelope | propagated end-to-end |
user.id | JWT | hashed for analytics export |
device.id | JWT | desktop sync only |
work_order.id | use case | when applicable |
asset.id | use case | |
vendor.id | use case | |
wo.severity, wo.category, wo.source, wo.status_from, wo.status_to | transitions | |
saga.step | reserved | choreography hop label, e.g., mnt.room_block_request |
outbox.event.type, outbox.event.id | outbox publish span | |
inbox.event.type, inbox.message_id | inbox handler span | |
ai.capability, ai.model, ai.score, ai.human_accepted | AI calls |
2. Structured log fields
All logs must be JSON with at minimum:
{
"timestamp": "2026-04-22T14:03:21.812Z",
"level": "INFO",
"msg": "work order resolved",
"service": "maintenance-service",
"tenantId": "tnt_...",
"propertyId": "prop_...",
"correlationId": "01HXY...",
"spanId": "abc123",
"traceId": "def456",
"workOrderId": "mnt_...",
"fields": { "severity": "high", "category": "hvac", "costAmountMicro": "150000000", "currency": "AFN" }
}
PII never appears in logs. Vendor phone numbers / emails are referenced by vendorId only.
3. SLIs / SLOs
| SLI | SLO target | Measurement window |
|---|---|---|
API availability (5xx / total) | ≥ 99.9% | 30-day rolling |
API p95 latency for GET /work-orders (filtered, ≤ 5k rows) | ≤ 400 ms | 30-day rolling |
API p95 latency for POST /work-orders (no AI) | ≤ 600 ms | 30-day rolling |
| API p99 latency for state transitions | ≤ 1 s | 30-day rolling |
| Outbox publish lag (enqueue → published) p99 | ≤ 5 s | 7-day rolling |
| Preventive scheduler lag (due → draft WO created) p99 | ≤ 60 s | 7-day rolling |
| SLA breach detection lag p99 | ≤ 60 s | 7-day rolling |
| Inbox handler success rate | ≥ 99.5% | 7-day rolling |
| Pub/Sub DLQ rate | ≤ 0.05% | 7-day rolling |
Error budget burn is alerted at 50% and 80% within window.
4. RED + USE metrics
RED (per route + per Pub/Sub subscription + per worker)
mnt.requests_total{route, code}mnt.request_duration_seconds_bucket{route}(histogram)mnt.errors_total{route, code}mnt.inbox_messages_total{subscription, result}whereresult ∈ {processed, dedup, failed}mnt.worker_ticks_total{worker, result}
USE (resource)
mnt.db.connections{pool=primary}gaugemnt.db.query_duration_seconds{op}histogrammnt.cache.hit_total{key}/mnt.cache.miss_total{key}mnt.outbox.lag_secondsgauge (now − oldest unpublishedenqueued_at)mnt.outbox.unpublished_countgaugemnt.preventive.due_pending_countgauge
Domain-specific
mnt.work_orders.created_total{source, severity, category}countermnt.work_orders.resolved_total{category, severity}countermnt.work_orders.sla_breached_total{category, severity}countermnt.work_orders.room_blocked_total{}countermnt.work_orders.relocations_required_total{}countermnt.work_orders.escalations_total{hop}countermnt.preventive.fired_total{cadence_kind}countermnt.parts.out_of_stock_total{partId}countermnt.assets.health_indexgauge labelled(propertyId, class)(sampled, not per-asset)mnt.ai.calls_total{capability, accepted}countermnt.ai.cost_micro_usd_total{capability}countermnt.ai.latency_seconds_bucket{capability}histogram
5. Dashboards
5.1 Service health (default)
- Request rate, error rate, p50/p95/p99 latency per route.
- Outbox lag and unpublished count.
- DB connection saturation, query p95 by op.
- Cache hit ratios for
mnt:open:*andmnt:vendors:*. - Pod CPU/mem, request count per pod.
5.2 Operations
- Open WOs by status × severity × category (heatmap).
- Median time-in-status per category.
- SLA breach count + open breach count.
- Top 10 assets by WO count last 30 days.
- Vendor responsiveness: avg time
assigned → acknowledgedper vendor. - Preventive due / overdue counts.
- Cost per resolved WO (rolling 30 day) by category.
5.3 AI
- Calls per capability, accept rate, avg score, avg latency, cost per call, daily total cost vs budget.
- HITL rejection rate (suggestion vs final value diff).
5.4 Sync
- Pull / push rates, push failure rate by error code, conflicts detected by aggregate, average device clock skew.
6. Alerts (selected; full catalog in runbook repo)
| Alert | Condition | Severity | Runbook |
|---|---|---|---|
mnt.api_5xx_burn_fast | error budget burn rate × 14.4 in last 1h | P1 | runbook://maintenance/api-5xx |
mnt.outbox_lag_high | mnt.outbox.lag_seconds > 60 for 5 min | P2 | runbook://maintenance/outbox-lag |
mnt.outbox_unpublished_high | mnt.outbox.unpublished_count > 1000 for 10 min | P2 | same |
mnt.preventive_lag_high | mnt.preventive.due_pending_count > 50 for 5 min | P2 | runbook://maintenance/preventive-lag |
mnt.sla_breach_storm | mnt.work_orders.sla_breached_total rate × 3 vs 7-day baseline | P3 | runbook://maintenance/sla-storm |
mnt.dlq_rate_high | DLQ rate > 0.5% over 30 min | P2 | runbook://maintenance/dlq |
mnt.ai_budget_exhausted | tenant budget used 100% | P3 (per tenant) | runbook://maintenance/ai-budget |
mnt.sync_push_failure_high | push failure rate > 5% over 5 min | P2 | runbook://maintenance/sync-push |
mnt.db_connection_saturation | connection pool ≥ 90% used for 5 min | P2 | runbook://maintenance/db-connections |
mnt.room_block_no_response | room_blocked request not acknowledged by property-service within 60 s | P3 | runbook://maintenance/room-block |
7. Tracing patterns
7.1 Auto-create WO from housekeeping flag
[housekeeping-service] hk.flag.publish ──► causationId
[Pub/Sub] transport
[maintenance-service] inbox.handler ──► correlationId (= housekeeping causationId)
└── mnt.create_work_order
├── ai.severity-suggestion (optional)
├── ai.category-classify (optional)
├── repo.work_order.save (tx)
├── outbox.append (work_order.created.v1)
├── outbox.append (work_order.room_blocked.v1) // if applicable
└── outbox.append (work_order.relocation_required.v1) // if applicable
[outbox-relay] publishes → property-service inbox → ...
Same correlationId flows through every hop. Span links join causationId from the upstream event for full causal graph.
7.2 Saga: room block → relocation
Reservation-service's room_change.saga.step1.evaluate span includes a span link back to our work_order.relocation_required.v1 outbox publish span. The full causal graph is reconstructable in SigNoz.
8. Replay & backfill observability
When the outbox-relay replays events from a checkpoint, every replayed publish span carries replay=true. Dashboards filter these out by default to avoid noise.
When migrating data (see MIGRATION_PLAN.md), the importer worker emits its own metric series prefixed mnt.migrate.* with labels (tenantId, source) so we can watch progress and pause if errors spike.
9. Synthetic checks
Cloud Monitoring uptime checks (every 60 s, from 3 regions):
GET /healthz(liveness)GET /readyz(readiness; checks DB + Pub/Sub auth + outbox-relay heartbeat)GET /api/v1/maintenance/work-orders?limit=1with synthetic test tenant
Per-tenant synthetic ("smoke") flow runs every 15 minutes in staging:
- Create WO → assert created
- Assign to test staff → assert assigned
- Start → resolve → verify → assert verified and OOO released
Failures page on-call.
10. Data export to BigQuery
- All events archived 1× daily to
melmastoon_events_v1.maintenance_*partitioned byproducedAtdate. - Operational metrics exported via Cloud Operations sink → BigQuery for long-term trend analysis.
- Sync conflict logs exported nightly for cross-tenant pattern detection.