Skip to main content

maintenance-service · SERVICE_RISK_REGISTER

Risks rated on Likelihood (L) × Impact (I) on a 1–5 scale; Severity = L × I (1–25). Re-reviewed quarterly. New risks must be added here before they're addressed in code.

1. Risk catalog

#RiskLIPre-mit SeverityMitigationResidual SeverityOwner
R1Auto-OOO loops if property-service rejection event itself triggers re-creation of WO that re-emits block3412Idempotent inbox dedupe; block_rejected handler clears causedRoomBlock and does NOT re-emit; integration/room_block_rejected.spec.ts4Tech lead
R2Vendor no-show with active reservation leaves room OOO indefinitely4416Auto-escalation to GM after N reminders; manual override path; UI surfaces "guest waiting" badge6Product
R3Preventive scheduler stops firing silently due to worker crash loop3515Liveness/readiness probes; mnt.preventive.due_pending_count alert > 50 for 5 min; second worker replica4SRE
R4Cross-tenant data leak through forgotten tenant_id filter2510RLS as last-line defence; tenant-isolation.spec.ts mandatory; lint rule no-unscoped-repository-call; PR review checklist2Security
R5Cost-currency mismatch at scale (multi-currency tenants) leading to wrong cost rollups3412Domain invariant #5 rejects mismatch; tenant base currency resolved per request and asserted3Domain owner
R6Outbox lag spike under burst load (e.g., mass preventive fan-out) blocks downstream services3412Outbox relay scales; mnt.outbox.lag_seconds alert at 60 s; preventive scheduler batch-size cap6SRE
R7Lock-device alert storm floods us with WOs (e.g., entire BLE mesh dropping)3412Per-device dedupe (deviceId, alertCode, dayBucket); per-property cap on auto-create per hour with overflow → single "mesh degraded" WO6Tech lead
R8AI model regression in severity-suggestion causes systematic over-OOO2510Always HITL; provenance + replay; per-tenant budget; per-capability circuit breaker4AI/ML
R9Race between auto-create and manual create for the same housekeeping flag produces duplicate open WOs4312Invariant #4 (one open per asset+category); UNIQUE partial index ux_work_orders_one_open_per_asset_category4Domain owner
R10Date-time bugs in cadence next-due across DST transitions in Asia/Kabul (UTC+04:30, no DST) and tenant tzs that do have DST248Use luxon with explicit timezone; unit tests for spring-forward/fall-back; integration test in Europe/Berlin tenant2Domain owner
R11PII leak via event payloads (description containing guest name)3515PII redaction filter on description before emit; periodic sample audit; AI-orchestrator-side redactor6Security
R12Migration data corruption during legacy import2510Dry-run mandatory; staging shadow import; idempotency by legacy_external_id; rollback by tenant4DBA
R13Sync push from compromised desktop issues commands as another user2510Device JWT bound to user; user JWT short-lived; commands carry actor.userId validated against device's bound user; audit trail4Security
R14Generator run-hours regression wipes preventive cadence (next-due jumps far into future)3412Invariant #9 + Asset max-of policy on runHours; explicit reset flow with audit4Domain owner
R15Vendor invoice posted twice (once by us, once by manual entry in billing)339vendor_invoice_posted_to_folio flag from billing.vendor_invoice.posted.v1; UI shows posted status; audit reconciliation6Finance
R16Notification fan-out fails for entire tenant (notification-service outage)248Notification dispatch is best-effort; events still emitted; UI shows "notification not delivered" badges from notification-service DLQ events4SRE
R17PartUsage atomic decrement race at desktop sync time produces negative on-hand3412PartRepository.decrementOnHand is row-locked + CHECK on_hand >= 0; sync push surfaces failure to client4DBA
R18Vendor-message-draft AI generates inappropriate content in local language248Always HITL; staff edits before sending; tenant tone preset capped to whitelisted styles; audit of last 100 sends per tenant per month4AI/ML, Compliance
R19WhatsApp/SMS bridge cost spike from auto-reminders (e.g., vendor list full of bad numbers)339Per-vendor reminder cap (3); per-tenant daily reminder budget; alert on cost spike4Finance
R20Schedule deduplication false-negative (two schedules for same asset class fire simultaneously, both materialise drafts)339Invariant #4 catches at WO create; explicit "shadow schedule" detection in admin UI4Domain owner
R21Asset healthIndex drift from AI forecaster wrong way for too long339Bound on per-tick delta (auto-suspend if drop > 30); manual reset endpoint; daily report comparing against ground truth WO count4AI/ML
R22Reservation projection staleness causes us to miss relocation requirement3412Projection refreshed on reservation.checked_in/modified/cancelled events; on miss, escalate via daily reconciliation job6Tech lead
R23Migration of mnt_* ULID prefix if we rename WorkOrderMaintenanceTicket155Decision frozen; mnt_ is the canonical prefix forever; MaintenanceTicket is an alias in NAMING.md2Tech lead
R24Cloud SQL regional outage longer than RTO155DR drilled quarterly; cross-region replica; runbook tested3SRE
R25Bundle drift between docs and implementation (events declared but not emitted, or vice versa)4312CI lint compares NestJS event emissions vs EVENT_SCHEMAS.md registry; quarterly audit using epic-spec-implementation-audit skill4Tech lead

2. Heat-map (post-mitigation)

Impact →
1 2 3 4 5
1 [ R23 R24]
L 2 [ R10 R4]
3 [ R15 R19 R3 R8 R11]
4 [ R20 R5 R9 R14 R17 R18 R22 R6 R7 R16 R25 R2]
5 [ R12 R13 R21]

3. Top 5 risks to monitor weekly

  1. R2 — Vendor no-show with active reservation — operational impact on guest experience.
  2. R3 — Preventive scheduler silent stop — safety-critical (generator service, water tank).
  3. R6 — Outbox lag spike — knock-on to entire platform.
  4. R7 — Lock-device alert storm — high blast radius on auto-creation.
  5. R22 — Reservation projection staleness — silent miss of guest impact.

4. Review cadence

  • Weekly in service standup: top 5 risks status; any near-misses logged.
  • Monthly with SRE + Security: full register review; new risks added; severity re-rated.
  • Quarterly with Product + Domain owner: business-impact recalibration.
  • After every incident: relevant risks re-rated; new mitigations linked.

5. Risk acceptance log

DateRisk #ActionApprover
(to be added per quarterly review)

6. Linked artefacts

  • Threat model: SECURITY_MODEL.md §11
  • Failure modes: FAILURE_MODES.md
  • SLO + alert catalog: OBSERVABILITY.md §3, §6
  • DR plan: DEPLOYMENT_TOPOLOGY.md §10
  • Migration safeguards: MIGRATION_PLAN.md