maintenance-service · SERVICE_OVERVIEW
Bounded Context: Maintenance (Supporting) · Aggregate root: WorkOrder · Owner: PMS Operations · Storage: Cloud SQL Postgres + outbox · Runtime: Node 20 + NestJS · Catalog:
docs/03-microservices/maintenance-service.md
maintenance-service is the canonical home for everything in a Ghasi Melmastoon property that breaks, wears out, or needs to be kept alive on a schedule. It is the only service that mints WorkOrder aggregates, advances their lifecycle, attaches assignees (internal staff or external vendors), reconciles parts and labour cost, runs the preventive-schedule cron, and owns the Asset registry per property.
It is intentionally a supporting context — it does not own room state, folio postings, or the lock vendor relationship. Instead, it publishes events that the right owning context reacts to. Its centre of gravity is the WorkOrder lifecycle and the surrounding choreography with housekeeping-service, property-service, lock-integration-service, and (occasionally) reservation-service.
1. Bounded context
| Concern | In maintenance-service | Lives elsewhere |
|---|---|---|
| Lifecycle of a maintenance issue | ✅ WorkOrder aggregate | — |
OOO/OOS state of a Room | ❌ (we only request) | property-service (Room.status) |
| Folio posting for a vendor invoice | ❌ (we only record) | billing-service |
| Lock device pairing / vendor secret | ❌ (we only consume health) | lock-integration-service |
| Housekeeping turn-down task | ❌ (we re-queue via event) | housekeeping-service |
| Reservation re-accommodation flow | ❌ (we only emit relocation_required.v1) | reservation-service |
| Preventive cadence per asset | ✅ PreventiveSchedule | — |
| Asset registry per property | ✅ Asset (HVAC, generator, water tank, lock device, linen, furniture) | — |
| Light parts inventory | ✅ Part + PartUsage | Heavy multi-warehouse stock would belong to a future inventory-service |
| Vendor records (no-email allowed) | ✅ Vendor | — |
| SLA timers + auto-escalation | ✅ | Tenant-level escalation tree comes from tenant-service settings |
| Cost roll-up onto WO | ✅ (labor + parts + vendor invoice) | analytics-service rolls cost-per-room-night |
2. Aggregates
| Aggregate | Identity | Responsibilities | Notes |
|---|---|---|---|
WorkOrder (root) | mnt_<ULID> | State machine, severity, category, asset link, assignee, cost rollup, SLA timer, OCC version | Same conceptual entity as MaintenanceTicket in NAMING.md; mnt_ prefix preserved |
MaintenanceTask | mtk_<ULID> | Sub-step inside a WO ("drain → refill → test"); ordered, can be checked off independently | Optional; only used when staff opt for a multi-step plan |
PreventiveSchedule | psch_<ULID> | Recurring rule keyed to an Asset or asset class; emits preventive.due.v1 | Cron windows in tenant timezone |
Asset | ast_<ULID> | HVAC unit, generator, water tank, lock device, linen lot, furniture lot, IT device | healthIndex 0–100 maintained by AI signals |
PartUsage | pus_<ULID> | A part consumed against a WO (qty, unit cost) | Decrements Part.onHand |
Vendor | vnd_<ULID> | External contractor or supplier | email optional; channelPreference ∈ whatsapp | sms | email | call_only |
MaintenanceCategory | mcat_<ULID> | Tenant-overridable taxonomy on top of 9 canonical defaults | Defaults seeded per tenant on provisioning |
3. State machine — WorkOrder.status
(cancel from any non-terminal)
┌──────────────────────────────────┐
│ │
│ ▼
┌──────┴───┐ assign ┌──────────┐ start ┌──────────────┐
│ open ├─────────────▶│ assigned ├─────────▶│ in_progress │
└──────────┘ └──────────┘ └──────┬───────┘
│
block (reason, eta?) │
▼
┌──────────┐
resume │ blocked │
┌──────────────────┤ │
│ └──────────┘
▼
┌──────────────┐ resolve ┌──────────┐ verify ┌──────────┐
│ in_progress ├───────────▶│ resolved ├──────────▶│ verified │
└──────────────┘ └──────────┘ └──────────┘
│
(terminal) ▼
Allowed transitions (full table in DOMAIN_MODEL.md §state machine):
| From | To | Command |
|---|---|---|
open | assigned | assign(assignee) |
open | cancelled | cancel(reason) |
assigned | open | unassign() |
assigned | in_progress | start() |
assigned | cancelled | cancel(reason) |
in_progress | blocked | block(reason, eta?) |
in_progress | resolved | resolve(resolution, costLines, parts[]) |
in_progress | cancelled | cancel(reason) |
blocked | in_progress | resume(note) |
blocked | cancelled | cancel(reason) |
resolved | verified | verify(verifierId, note?) |
resolved | in_progress | reopen(reason) |
verified | — | terminal |
cancelled | — | terminal |
Re-opens are tracked as WorkOrder.reopenCount and emit work_order.reopened.v1 (Phase 1.1; not in v1 event registry yet — kept as a forward-compat field).
4. Choreography (ASCII)
┌────────────────────┐ ┌──────────────────┐
│ housekeeping │ housekeeping.room.maintenance_required.v1 │ lock-integration │
└─────────┬──────────┘ ─────────────────────────────────────────┐ ┌──────────────────▶ device.health │
│ ▼ │ │ alert.v1 │
│ ┌─────────┴────┐ └──────────────────┘
│ guest complaint │ │
│ (Phase 2) │ maintenance │
├────────────────────────────────────────────▶ │ │
│ │ -service │
│ │ │
┌──────▼──────────┐ └─────────┬────┘
│ bff-back │ POST /work-orders │
│ office │ ──────────────────────────────────────────▶ │
└─────────────────┘ │
│
work_order.created.v1 ────────────────────────────────┤
work_order.assigned.v1 ───────────────────────────────┤
work_order.room_blocked.v1 ─────────────▶ property-service (Room → OOO)
work_order.relocation_required.v1 ──────▶ reservation-service (room_change saga)
work_order.resolved.v1 ─────────────────▶ analytics-service, housekeeping-service (re-queue)
vendor.invoice_recorded.v1 ─────────────▶ billing-service (post to ledger)
part_usage.recorded.v1 ─────────────────▶ analytics-service, search-aggregation
preventive.due.v1 ◀──── cron worker ──── PreventiveSchedule
5. Key invariants
- No status skip. Transitions must follow the matrix in §3. Direct
open → resolvedis rejected withMELMASTOON.MAINTENANCE.INVALID_STATUS_TRANSITION. - Asset-scoped severity. Severity
criticalrequires eitherassetIdset orroomIdset; "critical with no thing affected" is rejected. - Room block is severity-gated. Only
severity ∈ {high, critical}on a room or room-attached asset publisheswork_order.room_blocked.v1. Lower severities never auto-OOO. - One open WO per (asset, category). Creating a new WO for the same asset+category collapses to a comment append on the open one, unless
allowDuplicate=trueis passed. (Prevents 14 separate "AC not cooling" tickets for the same room.) - Vendor channel must match record. If
Vendor.channelPreference = call_only, attempting to send a WhatsApp template fails withMELMASTOON.MAINTENANCE.VENDOR_CHANNEL_MISMATCH. - Cost lines tenant-bound.
costLines.currencymust matchtenant.baseCurrency; cross-currency conversion isanalytics-service's job, not ours. - Preventive completion is idempotent. A schedule firing twice within the same window must not create two draft WOs (dedupe key =
(scheduleId, dueAt-truncated-to-hour)). - Verify is owner/GM-only. RBAC enforces
roles ⊇ {gm, owner}onPOST /verify.
6. Hot read paths (cache strategy)
| Read | Where | TTL / Invalidation |
|---|---|---|
| Open WO list per property (backoffice landing) | Memorystore Redis (mnt:open:<propertyId>) | TTL 30 s + write-through on every status transition |
| Asset health snapshot per property | Memorystore Redis (mnt:assets:<propertyId>) | TTL 5 min + write-through on asset.health_changed.v1 |
| WO detail | Postgres direct | — |
| Vendor lookup (assignment dropdown) | Memorystore Redis (mnt:vendors:<tenantId>) | TTL 5 min + write-through on vendor CRUD |
| SLA breach candidates (worker) | Postgres index ix_work_orders_sla_due | scanned every 60 s |
7. Cost & scale envelope (Phase 0 → Phase 1)
| Dimension | Phase 0 (50 properties, 100k room-nights/yr) | Phase 1 (500 properties, 1M room-nights/yr) |
|---|---|---|
| WOs created/day (all tenants) | ~150 | ~1,500 |
| Open WOs at any moment | < 1k | < 10k |
| Preventive schedules total | < 5k | < 50k |
| Assets total | < 20k | < 200k |
| Pub/Sub publishes/day | ~600 | ~6,000 |
| Storage (12 months) | ~1.5 GB | ~15 GB |
| Cloud Run instances (api) | min 2, max 4 | min 4, max 12 |
| Cloud Run worker (preventive + SLA) | min 1, max 2 | min 1, max 3 |
8. Decision log (entry pointers — full ADRs in docs/architecture/)
- WorkOrder is the aggregate;
MaintenanceTicketis its alias inNAMING.md. We kept the existingmnt_ULID prefix to avoid breaking the registry;WorkOrderis the implementation/domain name. - Room OOO is published, never directly written. Property-service is the only writer for
Room.status. We send a request event; they decide. Avoids the classic "two services own one column" anti-pattern. - Parts inventory stays light. Heavy SKU/distributor logic is deferred to a future
inventory-service; we only need on-hand decrement and reorder threshold for the operator. - Vendor email is optional. Many service providers in operating markets are phone-only. WhatsApp/SMS bridge via
notification-serviceis first-class. - AI-suggested severity, never AI-decided OOO.
ai-orchestrator-servicemay suggestseverity = criticalbased on description embeddings, but the OOO publish is a deterministic function of severity after a human accepts (or after the auto-policy threshold). - Schedule cron is its own Cloud Run service. Same image as the API (NestJS modules), separate entrypoint; isolates noisy-neighbour risk and allows independent scaling.
9. Out of scope (explicit non-goals)
- Workforce scheduling / rostering —
staff-serviceowns the roster; we only consumestaff.shift.started.v1. - Predictive failure modeling beyond
Asset.healthIndex— heavy ML lives inanalytics-service/ai-orchestrator-service. - Procurement workflows for parts (PO approval, receiving) — out of scope; only consumption tracking.
- Long-term capex planning — out of scope; we hand cost data to
analytics-service.
10. Where to next
DOMAIN_MODEL.md— pure TS interfaces, value objects, full state matrix, invariants with error codes.APPLICATION_LOGIC.md— use cases, ports, sagas (preventive cron, SLA scanner, vendor reminder).API_CONTRACTS.md— full REST surface with examples.EVENT_SCHEMAS.md— every event with TS interface and example payload.DATA_MODEL.md— Postgres DDL with RLS.SYNC_CONTRACT.md— desktop replication scope and conflict policy.AI_INTEGRATION.md— capabilities (severity suggestion, category classification, root-cause hints, asset-health forecast).SECURITY_MODEL.md,OBSERVABILITY.md,TESTING_STRATEGY.md,DEPLOYMENT_TOPOLOGY.md,FAILURE_MODES.md,LOCAL_DEV_SETUP.md,SERVICE_READINESS.md,SERVICE_RISK_REGISTER.md,MIGRATION_PLAN.md.