Skip to main content

maintenance-service · SERVICE_OVERVIEW

Bounded Context: Maintenance (Supporting) · Aggregate root: WorkOrder · Owner: PMS Operations · Storage: Cloud SQL Postgres + outbox · Runtime: Node 20 + NestJS · Catalog: docs/03-microservices/maintenance-service.md

maintenance-service is the canonical home for everything in a Ghasi Melmastoon property that breaks, wears out, or needs to be kept alive on a schedule. It is the only service that mints WorkOrder aggregates, advances their lifecycle, attaches assignees (internal staff or external vendors), reconciles parts and labour cost, runs the preventive-schedule cron, and owns the Asset registry per property.

It is intentionally a supporting context — it does not own room state, folio postings, or the lock vendor relationship. Instead, it publishes events that the right owning context reacts to. Its centre of gravity is the WorkOrder lifecycle and the surrounding choreography with housekeeping-service, property-service, lock-integration-service, and (occasionally) reservation-service.

1. Bounded context

ConcernIn maintenance-serviceLives elsewhere
Lifecycle of a maintenance issueWorkOrder aggregate
OOO/OOS state of a Room❌ (we only request)property-service (Room.status)
Folio posting for a vendor invoice❌ (we only record)billing-service
Lock device pairing / vendor secret❌ (we only consume health)lock-integration-service
Housekeeping turn-down task❌ (we re-queue via event)housekeeping-service
Reservation re-accommodation flow❌ (we only emit relocation_required.v1)reservation-service
Preventive cadence per assetPreventiveSchedule
Asset registry per propertyAsset (HVAC, generator, water tank, lock device, linen, furniture)
Light parts inventoryPart + PartUsageHeavy multi-warehouse stock would belong to a future inventory-service
Vendor records (no-email allowed)Vendor
SLA timers + auto-escalationTenant-level escalation tree comes from tenant-service settings
Cost roll-up onto WO✅ (labor + parts + vendor invoice)analytics-service rolls cost-per-room-night

2. Aggregates

AggregateIdentityResponsibilitiesNotes
WorkOrder (root)mnt_<ULID>State machine, severity, category, asset link, assignee, cost rollup, SLA timer, OCC versionSame conceptual entity as MaintenanceTicket in NAMING.md; mnt_ prefix preserved
MaintenanceTaskmtk_<ULID>Sub-step inside a WO ("drain → refill → test"); ordered, can be checked off independentlyOptional; only used when staff opt for a multi-step plan
PreventiveSchedulepsch_<ULID>Recurring rule keyed to an Asset or asset class; emits preventive.due.v1Cron windows in tenant timezone
Assetast_<ULID>HVAC unit, generator, water tank, lock device, linen lot, furniture lot, IT devicehealthIndex 0–100 maintained by AI signals
PartUsagepus_<ULID>A part consumed against a WO (qty, unit cost)Decrements Part.onHand
Vendorvnd_<ULID>External contractor or supplieremail optional; channelPreferencewhatsapp | sms | email | call_only
MaintenanceCategorymcat_<ULID>Tenant-overridable taxonomy on top of 9 canonical defaultsDefaults seeded per tenant on provisioning

3. State machine — WorkOrder.status

(cancel from any non-terminal)
┌──────────────────────────────────┐
│ │
│ ▼
┌──────┴───┐ assign ┌──────────┐ start ┌──────────────┐
│ open ├─────────────▶│ assigned ├─────────▶│ in_progress │
└──────────┘ └──────────┘ └──────┬───────┘

block (reason, eta?) │

┌──────────┐
resume │ blocked │
┌──────────────────┤ │
│ └──────────┘

┌──────────────┐ resolve ┌──────────┐ verify ┌──────────┐
│ in_progress ├───────────▶│ resolved ├──────────▶│ verified │
└──────────────┘ └──────────┘ └──────────┘

(terminal) ▼

Allowed transitions (full table in DOMAIN_MODEL.md §state machine):

FromToCommand
openassignedassign(assignee)
opencancelledcancel(reason)
assignedopenunassign()
assignedin_progressstart()
assignedcancelledcancel(reason)
in_progressblockedblock(reason, eta?)
in_progressresolvedresolve(resolution, costLines, parts[])
in_progresscancelledcancel(reason)
blockedin_progressresume(note)
blockedcancelledcancel(reason)
resolvedverifiedverify(verifierId, note?)
resolvedin_progressreopen(reason)
verifiedterminal
cancelledterminal

Re-opens are tracked as WorkOrder.reopenCount and emit work_order.reopened.v1 (Phase 1.1; not in v1 event registry yet — kept as a forward-compat field).

4. Choreography (ASCII)

┌────────────────────┐ ┌──────────────────┐
│ housekeeping │ housekeeping.room.maintenance_required.v1 │ lock-integration │
└─────────┬──────────┘ ─────────────────────────────────────────┐ ┌──────────────────▶ device.health │
│ ▼ │ │ alert.v1 │
│ ┌─────────┴────┐ └──────────────────┘
│ guest complaint │ │
│ (Phase 2) │ maintenance │
├────────────────────────────────────────────▶ │ │
│ │ -service │
│ │ │
┌──────▼──────────┐ └─────────┬────┘
│ bff-back │ POST /work-orders │
│ office │ ──────────────────────────────────────────▶ │
└─────────────────┘ │

work_order.created.v1 ────────────────────────────────┤
work_order.assigned.v1 ───────────────────────────────┤
work_order.room_blocked.v1 ─────────────▶ property-service (Room → OOO)
work_order.relocation_required.v1 ──────▶ reservation-service (room_change saga)
work_order.resolved.v1 ─────────────────▶ analytics-service, housekeeping-service (re-queue)
vendor.invoice_recorded.v1 ─────────────▶ billing-service (post to ledger)
part_usage.recorded.v1 ─────────────────▶ analytics-service, search-aggregation
preventive.due.v1 ◀──── cron worker ──── PreventiveSchedule

5. Key invariants

  1. No status skip. Transitions must follow the matrix in §3. Direct open → resolved is rejected with MELMASTOON.MAINTENANCE.INVALID_STATUS_TRANSITION.
  2. Asset-scoped severity. Severity critical requires either assetId set or roomId set; "critical with no thing affected" is rejected.
  3. Room block is severity-gated. Only severity ∈ {high, critical} on a room or room-attached asset publishes work_order.room_blocked.v1. Lower severities never auto-OOO.
  4. One open WO per (asset, category). Creating a new WO for the same asset+category collapses to a comment append on the open one, unless allowDuplicate=true is passed. (Prevents 14 separate "AC not cooling" tickets for the same room.)
  5. Vendor channel must match record. If Vendor.channelPreference = call_only, attempting to send a WhatsApp template fails with MELMASTOON.MAINTENANCE.VENDOR_CHANNEL_MISMATCH.
  6. Cost lines tenant-bound. costLines.currency must match tenant.baseCurrency; cross-currency conversion is analytics-service's job, not ours.
  7. Preventive completion is idempotent. A schedule firing twice within the same window must not create two draft WOs (dedupe key = (scheduleId, dueAt-truncated-to-hour)).
  8. Verify is owner/GM-only. RBAC enforces roles ⊇ {gm, owner} on POST /verify.

6. Hot read paths (cache strategy)

ReadWhereTTL / Invalidation
Open WO list per property (backoffice landing)Memorystore Redis (mnt:open:<propertyId>)TTL 30 s + write-through on every status transition
Asset health snapshot per propertyMemorystore Redis (mnt:assets:<propertyId>)TTL 5 min + write-through on asset.health_changed.v1
WO detailPostgres direct
Vendor lookup (assignment dropdown)Memorystore Redis (mnt:vendors:<tenantId>)TTL 5 min + write-through on vendor CRUD
SLA breach candidates (worker)Postgres index ix_work_orders_sla_duescanned every 60 s

7. Cost & scale envelope (Phase 0 → Phase 1)

DimensionPhase 0 (50 properties, 100k room-nights/yr)Phase 1 (500 properties, 1M room-nights/yr)
WOs created/day (all tenants)~150~1,500
Open WOs at any moment< 1k< 10k
Preventive schedules total< 5k< 50k
Assets total< 20k< 200k
Pub/Sub publishes/day~600~6,000
Storage (12 months)~1.5 GB~15 GB
Cloud Run instances (api)min 2, max 4min 4, max 12
Cloud Run worker (preventive + SLA)min 1, max 2min 1, max 3

8. Decision log (entry pointers — full ADRs in docs/architecture/)

  • WorkOrder is the aggregate; MaintenanceTicket is its alias in NAMING.md. We kept the existing mnt_ ULID prefix to avoid breaking the registry; WorkOrder is the implementation/domain name.
  • Room OOO is published, never directly written. Property-service is the only writer for Room.status. We send a request event; they decide. Avoids the classic "two services own one column" anti-pattern.
  • Parts inventory stays light. Heavy SKU/distributor logic is deferred to a future inventory-service; we only need on-hand decrement and reorder threshold for the operator.
  • Vendor email is optional. Many service providers in operating markets are phone-only. WhatsApp/SMS bridge via notification-service is first-class.
  • AI-suggested severity, never AI-decided OOO. ai-orchestrator-service may suggest severity = critical based on description embeddings, but the OOO publish is a deterministic function of severity after a human accepts (or after the auto-policy threshold).
  • Schedule cron is its own Cloud Run service. Same image as the API (NestJS modules), separate entrypoint; isolates noisy-neighbour risk and allows independent scaling.

9. Out of scope (explicit non-goals)

  • Workforce scheduling / rostering — staff-service owns the roster; we only consume staff.shift.started.v1.
  • Predictive failure modeling beyond Asset.healthIndex — heavy ML lives in analytics-service / ai-orchestrator-service.
  • Procurement workflows for parts (PO approval, receiving) — out of scope; only consumption tracking.
  • Long-term capex planning — out of scope; we hand cost data to analytics-service.

10. Where to next

  • DOMAIN_MODEL.md — pure TS interfaces, value objects, full state matrix, invariants with error codes.
  • APPLICATION_LOGIC.md — use cases, ports, sagas (preventive cron, SLA scanner, vendor reminder).
  • API_CONTRACTS.md — full REST surface with examples.
  • EVENT_SCHEMAS.md — every event with TS interface and example payload.
  • DATA_MODEL.md — Postgres DDL with RLS.
  • SYNC_CONTRACT.md — desktop replication scope and conflict policy.
  • AI_INTEGRATION.md — capabilities (severity suggestion, category classification, root-cause hints, asset-health forecast).
  • SECURITY_MODEL.md, OBSERVABILITY.md, TESTING_STRATEGY.md, DEPLOYMENT_TOPOLOGY.md, FAILURE_MODES.md, LOCAL_DEV_SETUP.md, SERVICE_READINESS.md, SERVICE_RISK_REGISTER.md, MIGRATION_PLAN.md.