Skip to main content

12 — Risks & Tradeoffs

Companion: 02 Enterprise Architecture · 07 Security, Compliance & Tenancy · 09 Lock & Key Integration · 10 Payments Architecture · ADR-0001 Core stack · ADR-0003 Electron offline-first desktop

This document is the platform-wide register of identified risks and deliberate tradeoffs for Ghasi Melmastoon. It is the single source of truth that every service-level SERVICE_RISK_REGISTER.md, every ADR, every release readiness gate, and every quarterly governance review reads against. Per-service risks may add detail; they may not contradict the entries here.

The shape of the document is deliberately operational rather than narrative. We assume the reader is an engineer, SRE, security reviewer, finance reviewer, or platform-product owner deciding whether to ship, hold, or escalate. Risks are scored, owned, and dated; tradeoffs are explicit, justified, and bound to mitigation work.


1. Purpose

Ghasi Melmastoon operates at the intersection of multi-tenant SaaS, event-driven microservices on GCP, AI-assisted operations, payment rails in low-trust markets, lock hardware in long-tail vendor configurations, and an offline-first Electron desktop application that runs daily hotel operations even when the internet does not. The combinatorics of these surfaces produce a non-trivial risk surface that cannot be discovered after launch.

The purpose of this register is to:

  1. Catalogue every known risk across technical, operational, security, regulatory, market, AI, vendor, and people-and-process categories — at least 60 entries — with consistent scoring, ownership, mitigation, and watchpoint triggers.
  2. Make every accepted tradeoff explicit so that future engineers, auditors, partners, and acquirers can find the rationale where it should be: in the spec, not in someone's head.
  3. Bind risks to mitigations and to release gates so the readiness checklists in docs/roadmap/ and the per-service SERVICE_READINESS.md files can deny a release that drifts from the agreed posture.
  4. Drive governance cadence — monthly platform review, per-release pre-flight, per-incident postmortem feedback into this register, and ADR creation when a tradeoff changes.

A risk that is not in this register but is operationally relevant is a defect in this document. Adding a risk costs nothing; missing one costs incidents.


2. Risk Scoring Model

We use a simple two-axis qualitative scoring model. Both axes are scored 1–5 and multiplied for a composite score in the range 1–25. We pick a qualitative model deliberately: quantitative risk modelling at our scale (early SMB pilots, regional rollout) over-fits to noise. The cadence of review compensates for the coarseness of the scale.

2.1 Likelihood (L)

LLabelApproximate frequency
1RarePlausible at most once across the platform's lifetime; requires an unlikely combination of failures
2UnlikelyPossible once per year per tenant or per service; precedent in similar systems but not in ours
3PossibleExpected to occur multiple times per year across the fleet
4LikelyExpected to occur monthly across the fleet, or quarterly per tenant
5Almost certainExpected weekly across the fleet, or monthly per tenant

2.2 Impact (I)

ILabelOperational meaning
1NegligibleCosmetic; no operator workaround needed; no SLO breach
2MinorOperator workaround exists; one-tenant impact; below SLO threshold
3ModerateMulti-tenant impact, single-region; partial SLO breach; recoverable within shift
4MajorSignificant data correctness, financial, security, or reputational impact; multi-day recovery
5SevereExistential — sustained data leakage, mass financial loss, regulator escalation, multi-week outage, brand-defining incident

2.3 Score thresholds

Score (L × I)PostureRequired action
1–6MonitorOwner records baseline; review during quarterly cadence; no action required unless trigger fires
7–14MitigateMitigation must be in flight or scheduled within the next release; owner reports status monthly
15–25EscalateMitigation must be in production before next release; CTO and Security lead reviewed; ADR or runbook required

A risk that crosses a threshold upward (e.g., Score moves from 12 to 16) auto-escalates. A risk that crosses downward (mitigation lands and verification holds for two consecutive review cycles) may be downgraded.

2.4 Status lifecycle

open → mitigated → accepted → closed
↘ deferred ↗
  • open — identified, mitigation not yet effective.
  • mitigated — mitigation effective; residual risk acknowledged; ongoing monitoring.
  • accepted — explicitly accepted by CTO + business owner; written justification required; revisited each quarter.
  • deferred — mitigation planned for a named future release; rationale recorded.
  • closed — risk no longer applies (e.g., capability removed, vendor replaced, regulation withdrawn).

2.5 Review cadence

  • Per-PR — service-level risks attached to changes touching their domain.
  • Per-release pre-flight — every risk with score ≥ 7 reviewed against release scope.
  • Monthly — platform-wide review of every open risk; status update per row.
  • Quarterly — full register review; tradeoffs revisited; any accepted item re-justified.
  • Per-incident — postmortems must declare which existing risks materialized and propose new ones.

3. Risk Register

The full register lives below. Each risk has a stable ID R-MEL-NNN. IDs are never reused; closed risks remain in the register for archaeology.

3.1 Technical risks — narrative for the top 3

R-MEL-001 — Multi-tenant data leakage from missing RLS guard

A new endpoint or new query that omits the tenant_id predicate, or a Postgres role that is granted broader access than intended, can expose tenant A's data to tenant B. This is the single largest existential risk for a multi-tenant SaaS. The blast radius is regulatory (GDPR-class breach notifications), commercial (chain operators terminate), and reputational (recovery measured in years). Postgres Row-Level Security (RLS) is our primary defense; the secondary defense is the application-layer RequestContext middleware that injects tenant_id into every repository call.

Mitigation stack:

  1. RLS policies on every table that holds tenant data (USING (tenant_id = current_setting('app.tenant_id')::uuid)), including read-only projections and analytics views. RLS enabled by ALTER TABLE … ENABLE ROW LEVEL SECURITY and FORCE ROW LEVEL SECURITY so even table owners cannot bypass.
  2. CI test pattern (docs/standards/RLS_TEST_PATTERN.md): every PR touching SQL or repositories runs a two-tenant fixture test that authenticates as tenant A and asserts a 0-row result for tenant B's records on every endpoint reachable from the BFF. The test harness fails the build on any cross-tenant leak.
  3. No raw SQL outside the repository layer. Drizzle (or pg in hot paths) is the only allowed access; raw SQL strings are linted out.
  4. Quarterly external pen-test with a tenant-isolation category.
  5. search-aggregation-service is the only service permitted to query across tenants, and only against a denormalized read model with explicit aggregation predicates. All other services treat cross-tenant queries as a defect class.

Watchpoint trigger: any cross-tenant test failure in CI; any pen-test finding in the tenant-isolation category; any production query log showing repository calls without tenant_id predicate.

R-MEL-002 — Saga compensation gaps

The booking saga (Booking → Inventory hold → Payment authorize → Reservation confirm → Key issuance → Notification) is multi-step, multi-service, and crosses the offline boundary on the desktop. A compensation that is incorrect, missing, or non-idempotent leaves the platform in a state where inventory says a room is sold but billing has no folio, or payment is captured but the reservation is cancelled. The financial and trust damage is direct.

Mitigation stack:

  1. Each saga is an explicit state machine in reservation-service with named steps, named compensations, deterministic transitions, and a persisted journal. No implicit "if this fails, undo the previous step" logic.
  2. Compensations are idempotent and outbox-driven — every compensation is replayable; idempotency keys are derived from {saga_id, step}.
  3. Saga inspector UI in the control plane lets platform staff see in-flight sagas, trigger replay, or force-close with audit trail.
  4. Chaos tests inject failures at every step boundary in CI.
  5. Provisional-state UX on the desktop: any folio, key credential, or reservation in a transitional state is labelled with a sync-pending badge; operators cannot act on provisional state in irreversible ways.

Watchpoint trigger: any saga journal entry stuck for > 30 minutes without progress; any payment captured without a corresponding folio entry within 5 minutes; any reservation in pending longer than the configured policy window.

R-MEL-003 — Sync conflict bugs corrupting offline data

The Electron desktop app holds a 60-day window of operational data in SQLite via better-sqlite3. When a property runs offline for hours or days (or a chain operator works from multiple devices), the sync engine's per-aggregate conflict resolution decides which version of a folio, room status, housekeeping task, or key credential wins. A bug in conflict resolution can silently overwrite a valid local mutation with stale server state, or vice versa. The damage manifests as "the cash drawer count was right yesterday and is wrong today" — hard to detect, hard to recover, trust-destroying.

Mitigation stack:

  1. Per-aggregate conflict policy declared in services/<svc>/SYNC_CONTRACT.md — never global last-write-wins. Monetary state (folios, payments) requires server-authoritative resolution; operational state (room status, housekeeping) uses domain-specific merge rules; reference data (rate plans) uses LWW.
  2. Outbox + idempotency keys on every mutation, derived from {device_id, local_aggregate_version, mutation_seq} — replay is safe.
  3. Property-based tests in CI that generate random divergent histories and assert merge convergence.
  4. Conflict log table persisted server-side; conflicts requiring operator decision surface as a tray notification on the desktop with a side-by-side diff view and a "restore my version" affordance.
  5. Pre-merge backups of the local aggregate are kept for 7 days locally so an operator can recover their version if the merge UI was misused.

Watchpoint trigger: sync conflict rate > 5 per 1k mutations per tenant per week; any operator complaint of "lost work" matched to a sync window; any property-based test failure in the sync harness.

3.2 Technical risks — full table

IDDescriptionLIScoreMitigationOwnerStatusWatchpoint
R-MEL-001Multi-tenant data leakage from missing RLS guard3515RLS on every table; two-tenant CI suite; quarterly pen-test; no raw SQLPlatform LeadopenCross-tenant test fails in CI
R-MEL-002Saga compensation gap leaves inconsistent state3515Explicit saga state machine; idempotent compensations; chaos tests; saga inspectorReservation owneropenSaga stuck > 30 min
R-MEL-003Sync conflict bug corrupts offline data3515Per-aggregate policy; idempotency; pre-merge backup; conflict UISync LeadopenConflict rate > 5/1k/week
R-MEL-004ONNX model integrity tampering on desktop248Signed model artifacts; signature verified on load; key rotation; tamper telemetryAI LeadmitigatedVerification failure on app start
R-MEL-005Electron auto-update signature breach (rogue update)2510electron-updater with code-signing certificate; staged rollout; rollback channel; cert pinningDesktop LeadmitigatedUnsigned update event
R-MEL-006SQLite/SQLCipher key loss on device wipe339Key in OS keychain via keytar; recovery via re-pair to tenant; documented re-pair flowDesktop LeadmitigatedRe-pair count anomaly
R-MEL-007Pub/Sub backlog after downstream outage causes consumer overload3412Subscriber concurrency caps; dead-letter topics; per-topic ack deadline; outbox replay toolsSREopenBacklog age > 1h on any topic
R-MEL-008Cross-tenant query in search-aggregation surface2510Single explicit service; denormalized read model; cross-tenant queries forbidden elsewherePlatform LeadmitigatedNew service requesting cross-tenant read
R-MEL-009GCP region outage (asia-south1)2510Multi-region from R2 (asia-south1 + asia-southeast1); read replicas; static site CDNSREdeferredRegion SLA breach
R-MEL-010Cloud SQL HA failover gap (sub-minute write loss)339HA primary + standby; PITR; outbox tolerates failover; chaos drill quarterlySREmitigatedFailover RTO > 60s
R-MEL-011pgvector index size growth degrades writes on hot DB339Move embeddings to a separate *-vector schema; HNSW index discipline; nightly REINDEX windowAI LeadopenVector table > 50% of DB size
R-MEL-012BigQuery cost runaway from analytics queries339Slot reservation; per-tenant query budget; partition pruning enforced; BI reviewer rotaFinance + SREopenDaily slot-hour > budget
R-MEL-013Cold start on Cloud Run for ai-gateway (latency spike)428Min instances ≥ 1; CPU-always-allocated for AI surfaces; warm-up endpoint hit per regionAI LeadmitigatedFirst-token p95 > 1.5s
R-MEL-014Memorystore eviction loses idempotency keys mid-flow248Idempotency keys persisted to Postgres; Redis is cache only; TTL > saga horizonPlatform LeadmitigatedDuplicate mutation accepted
R-MEL-015Drizzle migration applied out of order across services248Per-service migrations; CI verifies linear order; production migrate gated on reviewPlatform LeadopenOut-of-order migration in CI
R-MEL-016Outbox table growth pressures hot transactions339Outbox archival job; partition by month; dead-letter compactionSREopenOutbox > 1M unprocessed rows
R-MEL-017better-sqlite3 native build fails on operator OS variant326Pre-built binaries for Win/macOS/Linux x64+arm64; install diagnostics; fallback installerDesktop LeadmitigatedInstall failure rate > 1%
R-MEL-018Pub/Sub message ordering not preserved across partitions339Use ordering key on aggregate id where order matters; document where order is irrelevantPlatform LeadmitigatedOut-of-order event in saga
R-MEL-019Electron preload security boundary leak2510contextIsolation true; nodeIntegration false; typed window.melmastoon; CSP; review checklistDesktop LeadmitigatedNew API exposed without review
R-MEL-020Drift between OpenAPI spec and BFF implementation326Generated client from spec; CI contract test fails on driftPlatform LeadmitigatedContract test fail in CI

3.3 Operational risks — narrative for the top 3

R-MEL-101 — Front-desk operator forgets cash-drawer close

In cash-heavy markets, the end-of-day cash-drawer reconciliation is the financial truth. An operator who forgets to close the drawer (shift change, surprise checkout rush, power loss mid-flow) leaves the next shift inheriting an unbounded variance and the property's daily revenue ledger broken until manually reconstructed. Existing PMS tools either silently allow it or produce alerts the operator dismisses.

Mitigation stack:

  1. Forced close on logout — the desktop app blocks operator logout while the drawer is open with a clear "close drawer" CTA and a count helper.
  2. Auto-close at midnight tenant-local with a "needs review" status on the next-day open; the morning operator must reconcile before processing new payments.
  3. Anomaly callout if the drawer is left open > 8 hours.
  4. EOD report flags every drawer that was auto-closed vs. operator-closed.
  5. Training playbook in the operator onboarding doc with a one-page laminated quick-card.

Watchpoint trigger: > 5% of drawers auto-closed at midnight per tenant per week; > 1 day with EOD variance > 2× tenant baseline.

R-MEL-102 — Manual override abuse without audit trail

Hotel staff need manual overrides — the rate plan does not match the walk-in's negotiated rate, the lock failed and a master key was issued, the folio was adjusted to compensate a complaint. Without an audit trail, manual overrides become the route for fraud, kickbacks, and silent revenue leakage. Without a usable audit trail, the audit is theater.

Mitigation stack:

  1. Every override emits an event (<service>.override.applied.v1) with reason code, free-text justification, operator id, and traceparent.
  2. Override reasons are a closed list per surface (no free-text-only reasons); free-text justification is required and retained.
  3. Override reports in the GM dashboard with per-operator and per-property aggregations and trend lines.
  4. Threshold alerts to the owner persona when an operator's override rate exceeds 2× the property baseline.
  5. Override types touching money require a second-operator approval on the desktop (4-eyes gate).

Watchpoint trigger: any operator with override rate > 2× baseline; any 4-eyes gate bypassed; any override with reason "other" > 5% of total.

R-MEL-103 — Offline window exceeds grace period and access is lost

The Electron desktop app caches a 60-day operational window so a property can run offline for days. But there are limits: lock vendor offline-issuance windows expire (TTLock dynamic codes are time-bounded), payment authorizations cannot be captured offline, and the device's own access tokens expire (refresh tokens have a 30-day default). A property that goes offline for longer than the configured grace period loses front-desk operability.

Mitigation stack:

  1. Documented grace periods per capability in docs/frontend/desktop/06-desktop-app-specification.md; defaults: 30 days for token refresh, 7 days for TTLock dynamic codes (vendor-specific), 60 days for the operational data window.
  2. Proactive warnings at 75% and 90% of grace; visible in the connectivity bar; emailed to GM at 90%.
  3. Last-resort manual mode that allows mechanical-key fallback and paper-folio capture for later digitization, with a forced reconciliation on reconnect.
  4. Per-property "offline duration" KPI in the SRE dashboard.

Watchpoint trigger: any property offline > 14 days; any token refresh failure on reconnect; any TTLock offline-issuance failure on a configured property.

3.4 Operational risks — full table

IDDescriptionLIScoreMitigationOwnerStatusWatchpoint
R-MEL-101Front-desk forgets cash-drawer close4312Forced close on logout; auto-close midnight; EOD flagging; trainingOps LeadopenAuto-closed > 5%/week
R-MEL-102Manual override abuse without audit3412Override event + reason; 4-eyes for money; trend alertsFinance LeadopenOperator > 2× baseline override
R-MEL-103Offline window > grace; access lost248Documented grace; proactive warnings; manual fallbackOps LeadmitigatedOffline > 14 days
R-MEL-104Encoder hardware failure strands guests248Spare encoder per property; fallback to mobile-key + mechanical; vendor SLAOps LeadopenEncoder offline > 4h
R-MEL-105Lock device battery death without alerting339Battery telemetry from lock-integration; tray alert at 20%; vendor heartbeat scheduleOps LeadopenBattery silent > 30 days
R-MEL-106Receipt printer hardware failure428Email/SMS receipt fallback; manual receipt template; spare printer recommendationOps LeadmitigatedPrint failure > 5%/day
R-MEL-107Staff training gap on multi-tenant features (chain)339Per-role onboarding course; in-app tour; chain-operator playbookOps LeadopenSupport tickets in onboarding category
R-MEL-108On-call coverage gaps non-business hours341224/7 rota across 2 timezones; PagerDuty escalation; runbooks per serviceSREopenPage acknowledged > 15 min
R-MEL-109Backup verification skipped (untested DR)2510Quarterly DR drill; restore time measured; documented runbookSREmitigatedDrill skipped or fail
R-MEL-110Deploy outside maintenance window in front-desk hours339Tenant-timezone-aware deploy schedule; pre-deploy notification; rollback ≤ 5 minSREmitigatedDeploy in front-desk hours
R-MEL-111Tenant data export request mishandled248Documented export workflow; per-service export tooling; legal review on cross-borderLegal + PlatformopenRequest open > 30 days
R-MEL-112Operator runs old desktop version offline (drift)428Auto-update enforced when online; min-version gate on sync; in-app bannerDesktop LeadmitigatedOld version sync attempt

3.5 Security risks — narrative for the top 3

R-MEL-201 — Credential phishing on staff

Staff in the target market mix high-trust hospitality culture with low password hygiene and shared devices. A phishing email impersonating "Ghasi Support" asking the front desk to "verify your password" succeeds more often than we would like. The blast radius depends on the role — a front-desk credential exposes one property; a chain-operator credential exposes many.

Mitigation stack:

  1. WebAuthn / passkey as the default second factor for desktop login; passkeys cannot be phished by a fake page.
  2. TOTP fallback for environments where passkeys are not yet supported, but flagged as weaker in the security dashboard.
  3. Device binding for the desktop app — a stolen credential cannot be used from an unbound device without re-pair, which requires an out-of-band code issued via the GM's verified phone.
  4. In-app "we will never ask" reminder during onboarding and printed on the operator quick-card.
  5. Suspicious-login telemetry — new device, new geo, new ASN — surfaces in the GM dashboard with a one-click revoke.
  6. Quarterly phishing drill for chain operators.

Watchpoint trigger: any account compromised by phishing; any login from a new geo without prior device pair; any TOTP-only chain-operator account.

R-MEL-202 — Lost desktop device with cached PII

A lost or stolen laptop with the Electron desktop installed contains 60 days of operational data (guest names, ID-document references, partial payment instruments, lock state). The legal and reputational damage is shaped by what is on disk and how well it is encrypted.

Mitigation stack:

  1. SQLCipher on the local SQLite store; key in OS keychain via keytar, never written to disk in plaintext.
  2. Device-bound key derivation — the SQLCipher key is derived from a server-issued device key plus a local secret; pulling the file off the disk yields encrypted bytes.
  3. Remote revocation — once the device is reported lost, the next successful sync attempt fails; the device key is revoked server-side; subsequent re-pair requires GM out-of-band approval.
  4. Auto-purge after configurable inactivity — by default, 14 days without sync triggers a local-data wipe on next launch.
  5. Remote-wipe API for chain operators (executes on first reconnect; cannot guarantee execution if the device never reconnects).
  6. No PAN data ever stored locally — see R-MEL-209.

Watchpoint trigger: any reported lost device; any device with no sync > 14 days.

R-MEL-203 — Insider threat (chain operator role)

A chain operator role has access to multiple properties' financial and guest data. A malicious insider — disgruntled employee, social-engineered admin — can extract data, manipulate folios, or sabotage operations across the chain. The blast radius is larger than any external attack.

Mitigation stack:

  1. ABAC scoping — chain operators see only the properties on their attribute set; access scoped per property, never global.
  2. Audit log on every read of guest PII at the chain-operator level (not at the front-desk level, which would be too noisy); reviewed weekly.
  3. 4-eyes on financial overrides at the chain level (any override > a configured amount requires a second chain-operator's approval).
  4. Bulk-export rate limit — exporting > 100 guest records in a 24h window triggers a security review notification.
  5. Just-in-time elevation for sensitive actions (delete guest, refund > threshold) — the operator requests, a second operator approves, the elevation is time-bounded and audited.

Watchpoint trigger: chain-operator bulk export > threshold; any chain-operator action without 4-eyes for financial overrides; any abnormal access pattern (off-hours, from new geo).

3.6 Security risks — full table

IDDescriptionLIScoreMitigationOwnerStatusWatchpoint
R-MEL-201Credential phishing on staff4416Passkeys default; device binding; suspicious-login telemetry; phishing drillsSecurityopenNew geo login w/o pair
R-MEL-202Lost device with cached PII3412SQLCipher; device-bound key; remote revocation; auto-purgeSecuritymitigatedLost device report
R-MEL-203Insider threat (chain operator)2510ABAC; audit on PII read; 4-eyes; export rate limit; JIT elevationSecurityopenBulk export > threshold
R-MEL-204RFID card cloning339Vendor-recommended encryption; reservation-bound credentials; revoke on checkoutSecurity + LockmitigatedDuplicate-credential anomaly
R-MEL-205Mobile-key shared with non-guest339Time-bound credentials; per-stay key; vendor abuse signals; audit on extra-device pairSecurity + LockopenMultiple device pair per credential
R-MEL-206Weak tenant-admin passwords3412Password policy; HIBP check at set; passkey nudge; admin MFA mandatorySecuritymitigatedAdmin without MFA
R-MEL-207SQL injection in custom queries155No raw SQL rule; parameterized queries; lint; pen-testPlatform LeadmitigatedLint failure or finding
R-MEL-208XSS in tenant-provided content blocks339Content blocks rendered through sanitizer; no arbitrary HTML; CSP; per-tenant CSP isolation in R2Frontend LeadmitigatedCSP report-only violations
R-MEL-209CSRF on BFF without proper double-submit248SameSite=Lax cookies; double-submit token on state-changing routes; OWASP reviewPlatform LeadmitigatedMissing token in route audit
R-MEL-210Supply-chain compromise of npm dependency3412Renovate + audit; lockfile review; pinned versions; Snyk + Socket.dev scanningSecurityopenCritical advisory in deps
R-MEL-211KMS misconfiguration leaks key material2510Per-tenant DEKs wrapped by KEK; CMEK option; least-privilege IAM; audit logSecuritymitigatedKMS access by non-allowlisted SA
R-MEL-212PAN/PCI data leaks into logs2510No PAN in our perimeter; SAQ-A scope; log scrubber; redaction testsSecurity + PaymentsmitigatedRedaction test fail
R-MEL-213OAuth / OIDC misconfiguration on chain SSO248OIDC discovery doc; nonce + state validation; tested per IdP; chain-onboarding playbookSecurityopenNew IdP without test pass
R-MEL-214Replay of webhook signatures236HMAC + timestamp + nonce window; replay corpus testedPlatform LeadmitigatedReplay accepted in test
R-MEL-215Electron renderer breaks contextBridge boundary2510Periodic Electron security audit; CSP in renderer; preload reviewDesktop LeadmitigatedNew preload export w/o review

3.7 Regulatory risks — narrative for the top 3

R-MEL-301 — Daily guest-registration mandate change in target jurisdictions

Afghanistan, Tajikistan, and Iran each have police-registration regimes for hotel guests with country-specific frequency, format, and enforcement. The format and the deadline can change with little notice. A platform that cannot respond within days becomes the reason a tenant gets fined or shut down.

Mitigation stack:

  1. Per-jurisdiction registration adapter as a port in reservation-service; new format = new adapter, no schema change.
  2. Manual fallback — the desktop app can always export the registration in the latest known format as PDF + CSV for paper submission.
  3. Per-jurisdiction toggle and field set in tenant configuration so we can enable/disable a field without a deploy.
  4. Quarterly regulatory review with a local counsel partner per jurisdiction.

Watchpoint trigger: any tenant report of registration rejection by authorities; any local-counsel quarterly report of changed mandate.

R-MEL-302 — Data residency requirements changing (especially Iran)

Data residency in Iran is the most variable surface in our regulatory landscape. A residency mandate that arrives mid-quarter could require us to host all Iranian-tenant data inside Iran or to route it through specific carriers, neither of which is feasible on GCP today.

Mitigation stack:

  1. Hexagonal architecture — every infrastructure dependency is behind a port, so a future port to a different cloud or to a co-located deployment is feasible without service rewrites.
  2. Per-tenant data classification — what is PII vs. operational vs. analytic — so a residency mandate can be enforced selectively.
  3. Iran exploratory deployment in R2 under sanctions-compliant boundary, with a clear escape hatch (do not onboard Iranian tenants if residency cannot be honored).
  4. CMEK option enabled per-tenant for sensitive data.
  5. Documented "Plan B" — a co-located Postgres + Cloud Run deployment topology kept in IaC even if not deployed.

Watchpoint trigger: any residency mandate published; any sanctions-list change affecting GCP availability in Iran.

R-MEL-303 — KYC mandates for foreign guests

Foreign guests trigger heavier KYC in many target jurisdictions: passport, visa, departure date, accompanying-persons declaration. The exact list is jurisdiction-specific and changes. The GDPR-class minimization principle pushes against retaining more PII than needed; the regulatory pressure pushes the other way.

Mitigation stack:

  1. Per-jurisdiction guest-document schema in reservation-service; tenants on that jurisdiction collect the required set, others do not.
  2. Document storage with short retention by default; per-jurisdiction retention overrides codified.
  3. Encrypted at rest with per-tenant DEK; access audited.
  4. Operator-facing affordance to mark a document as "verified" rather than retaining the image, where the jurisdiction allows.

Watchpoint trigger: any KYC mandate change; any tenant request to retain documents past default retention; any document-storage volume anomaly.

3.8 Regulatory risks — full table

IDDescriptionLIScoreMitigationOwnerStatusWatchpoint
R-MEL-301Daily guest-registration mandate change4312Per-jurisdiction adapter; manual fallback; quarterly counselComplianceopenCounsel reports change
R-MEL-302Data residency change (esp. Iran)3515Hexagonal ports; per-tenant classification; Plan B IaC; CMEKPlatform + ComplianceopenMandate published
R-MEL-303KYC mandate for foreign guests4312Per-jurisdiction schema; short retention; encrypted at rest; verify-not-retainComplianceopenMandate or storage anomaly
R-MEL-304Tax rate changes mid-month428Effective-dated rate table; per-jurisdiction; audit on changeFinancemitigatedRate change without audit
R-MEL-305Sanctions-list update affects payment availability3412OFAC + UN + EU lists synced daily; tenant onboarding screen; payment-route fallbackCompliance + PaymentsopenTenant on updated list
R-MEL-306GDPR-style request handling delay248Per-service export + erase tooling; 30-day SLA; legal queueComplianceopenRequest open > 25 days
R-MEL-307PCI scope creep if PAN slips into our perimeter2510SAQ-A scope; PCI scanner in CI; log redaction; quarterly scope reviewSecurity + PaymentsmitigatedPAN in any service log
R-MEL-308Currency-control rules change for cross-border settlement339Per-jurisdiction settlement rules; FX provider compliance; audit logFinanceopenSettlement bounce by bank
R-MEL-309Local hospitality-licensing change339Per-jurisdiction property metadata; counsel review per market entryComplianceopenLicense renewal failure
R-MEL-310Anti-trafficking flagging mandates339Risk-flagging field on guest aggregate; compliance contact per jurisdictionComplianceopenMandate published
R-MEL-311Cross-border data transfer restrictions (EU expansion)3412SCCs; in-EU region for EU tenants from R3; DPIA per high-risk featureCompliancedeferredEU pilot signed

3.9 Market & business risks — narrative for the top 3

R-MEL-401 — Small target tenant size (financial fragility)

The modal target tenant is an 8–50 room independent property in a market where seasonal revenue swings of 50%+ are normal. A run of bad months and they cannot pay our subscription. We absorb the churn or we lose the relationship. Either way, the unit economics are tight.

Mitigation stack:

  1. Tiered pricing sized for SMB independents — no per-room USD model.
  2. Outcomes-aligned commercial terms in the chain segment — a percentage of incremental direct-booking revenue we capture.
  3. Local-currency billing (AFN, TJS, IRR via local rails) where banking allows, with FX-risk pricing.
  4. Pause-not-cancel option during off-season; data retained, app inactive, low monthly fee.
  5. Diversified geography in R1 to avoid single-market concentration.

Watchpoint trigger: monthly churn > 3%; > 30% of tenants on pause-not-cancel; concentration on a single market > 50%.

R-MEL-402 — Slow PMS migration culture in target markets

Hoteliers in the target market have run their property the same way for years. WhatsApp + Excel + a paper register works. Convincing them to change is a multi-month, high-touch sale. A go-to-market plan that assumes self-serve onboarding will run aground.

Mitigation stack:

  1. Assisted onboarding by a local field rep for every tenant in R1 and R2.
  2. In-language training material — printed quick-cards and short videos in Pashto, Dari, Tajik, Persian.
  3. "Run alongside" pilot mode — tenants run Ghasi alongside their existing process for 30 days; we earn the switch.
  4. Per-region rep model in R2; reseller channel in R3 (white-label reseller program).
  5. Local advocacy — first 5 tenants per market are referenceable case studies.

Watchpoint trigger: average onboarding time > 90 days; first-30-day usage rate < 60%; pilot abandonment > 20%.

R-MEL-403 — Inability to take cards in target markets due to sanctions

Cards are not the dominant rail in our markets, but they matter for foreign guests and for the subset of properties that serve diaspora. Stripe and PayPal availability fluctuates with sanctions postures. A platform that hard-codes Stripe as the only card processor breaks when a tenant's market becomes unavailable.

Mitigation stack:

  1. Pluggable payment-gateway-service with adapters for Stripe, PayPal, MFS providers, cash-on-arrival, and bank transfer.
  2. Per-tenant payment-method enable/disable with jurisdiction defaults.
  3. Cash-on-arrival as first-class — the dominant rail in our markets is treated as a primary, not a workaround.
  4. MFS expansion in R2 — M-PESA, EasyPaisa, AfghanPaisa, Pamir-Pay — to give every market at least one electronic rail.
  5. Reconciliation tooling for cash and bank-transfer flows.

Watchpoint trigger: any payment provider's regional suspension; any tenant with no working electronic rail.

3.10 Market & business risks — full table

IDDescriptionLIScoreMitigationOwnerStatusWatchpoint
R-MEL-401Small target tenant financial fragility4312Tiered pricing; outcomes terms; local-currency billing; pause-not-cancelCommercialopenMonthly churn > 3%
R-MEL-402Slow PMS migration culture4312Assisted onboarding; in-language training; "run alongside" pilot; reseller channelCommercialopenOnboarding > 90 days
R-MEL-403Card unavailability due to sanctions4312Pluggable gateway; per-tenant enable; cash first-class; MFS expansionPayments + ComplianceopenProvider regional suspension
R-MEL-404Competition from regional incumbents339Differentiation (meta+direct, offline, RTL, AI); reference customers; communityCommercialopenMajor incumbent regional play
R-MEL-405Currency volatility (AFN, IRR, TJS)4312FX snapshot at confirm; multi-currency folio; daily FX feed; settlement-currency choiceFinancemitigatedFX swing > 5%/day
R-MEL-406Informal-channel payments not auditable4312Cash + bank-transfer reconciliation; documented "informal payment" capture pathFinanceopenTenant complaint on reconciliation
R-MEL-407Tenant cohort revenue concentrated on 1 market3412Geographic diversification target per release; market-mix dashboardCommercialopenSingle market > 50% revenue
R-MEL-408Pricing perceived as expensive in local currency339Local-currency pricing; per-market adjustment; outcomes termsCommercialopenLocal pricing complaint trend
R-MEL-409OTA push-back on direct-booking thesis236Tenant-owned listings; OTA channel manager R3 anyway; reputation focusCommercialopenOTA delisting threats
R-MEL-410Regulator perceives meta layer as OTA248Counsel review per market; clear tenant-of-record on every booking; legal positioning docComplianceopenRegulator inquiry

3.11 AI risks — narrative for the top 3

R-MEL-501 — Model drift on dynamic pricing causes revenue loss

Dynamic pricing is the highest-leverage AI capability we ship. It is also the one with the most asymmetric downside: a 5% price drift across a 50-tenant fleet, sustained over a quarter, is real money. Drift can come from changed demand patterns (a holiday calendar shift), changed competitor behavior, or upstream model changes from Vertex AI.

Mitigation stack:

  1. Suggestions, not auto-apply — every pricing change is HITL by default until tenant-level acceptance rate clears the readiness bar (≥ 60% over 30 days).
  2. Per-tenant baseline — pricing model is bounded by a tenant-configured rate band; suggestions outside the band are flagged.
  3. A/B and shadow-model evaluation — every model version runs in shadow alongside the live model for 14 days before promotion.
  4. Per-tenant revenue impact dashboard — RevPAR / GPAR delta vs. baseline, per model version.
  5. One-click rollback to the previous model version per tenant.
  6. Quarterly model accuracy eval per tenant cohort.

Watchpoint trigger: RevPAR delta < -5% across a tenant cohort over a week; acceptance rate below 40%; eval scores deviating > 1σ.

R-MEL-502 — Hallucinated guest message text

AI-drafted guest messages (confirmations, pre-arrival, late-checkout) reach the guest in their language. A hallucination — wrong room number, wrong date, wrong policy — damages trust at the worst moment. The risk is amplified by translation: the operator may not be able to verify a Pashto draft by reading it.

Mitigation stack:

  1. HITL by default for all guest-facing AI output — operator must explicitly send.
  2. Structured generation — the prompt is built with the verified facts (dates, room, name, amount); the model fills the prose around them.
  3. Per-template glossary — locale-specific glossary pinned to the prompt; brand voice consistency.
  4. Round-trip verification — translate-back-to-English check on long messages; mismatch flagged for operator review.
  5. Refusal on low confidence — model declines to draft if the inputs are inconsistent.
  6. Audit log of every draft, accepted or rejected, with provenance.

Watchpoint trigger: any guest complaint linked to AI-drafted text; any round-trip mismatch above a configured threshold; any operator override rate > 30%.

R-MEL-503 — AI cost runaway from prompt explosion

A bug, a feature change, or a misconfigured retry policy can multiply AI calls without warning. A 10× spike on Vertex AI is recoverable; a 100× spike for a week is not. The risk is amplified by edge inference fallback that hides degradation.

Mitigation stack:

  1. Per-tenant AI budget — soft-degrade at 80% (cheaper model, less context); hard-stop at 100% (HITL only) until top-up or new period.
  2. Per-feature quotas layered on top of tenant budget.
  3. Cache by prompt hash — identical prompts in a 1h window served from cache.
  4. AI cost dashboard with per-tenant, per-feature, per-model breakdown; alert on > 2× baseline.
  5. Default-off for net-new AI features; tenant opts in.
  6. Vertex AI batch APIs where latency tolerates.

Watchpoint trigger: daily cost > 2× 7-day moving average; budget hit > 80% before mid-period; per-prompt cost > model average.

3.12 AI risks — full table

IDDescriptionLIScoreMitigationOwnerStatusWatchpoint
R-MEL-501Pricing model drift causes revenue loss3412HITL; rate band; shadow model; rollback; quarterly evalAI LeadopenRevPAR < -5% cohort
R-MEL-502Hallucinated guest message text3412HITL default; structured gen; glossary; round-trip; refusal; auditAI LeadopenGuest complaint or override > 30%
R-MEL-503AI cost runaway from prompt explosion3412Tenant budget; cache; dashboard; default-off; batch APIsAI Lead + FinanceopenCost > 2× baseline
R-MEL-504AI suggestion latency > tolerance339Min instances; warm-up; edge fallback; per-surface budgetAI LeadmitigatedFirst-token > 1.5s p95
R-MEL-505Embedding leak via vector index reconstruction248Per-tenant partitioning of pgvector; query filter mandatory; pen-test categoryAI Lead + SecuritymitigatedPen-test finding
R-MEL-506Hosted-model provider deprecation339Multi-provider abstraction in ai-gateway; eval suite vs alternatives; deprecation calendarAI LeadopenVendor deprecation announced
R-MEL-507Edge model fairness across locales (Pashto/Dari/Persian)4312Per-locale eval suite; locale-specific fine-tunes; HITL stricter on weak localesAI LeadopenLocale eval > 1σ from EN
R-MEL-508Prompt injection from tenant content3412Pre-call classifier; system prompt isolation; structured tools; allowlistAI Lead + SecuritymitigatedInjection detected in fuzz
R-MEL-509Bias in upsell or anomaly recommendations339Fairness eval suite; HITL; quarterly review; documented auditAI LeadopenFairness metric drift
R-MEL-510Provenance lost on AI artifact (audit gap)248Domain refuses persistence without provenance; export includes provenanceAI LeadmitigatedPersisted artifact w/o provenance
R-MEL-511Edge ONNX runtime breaking change on app upgrade236Pin runtime version; integration test on each release; staged rolloutDesktop LeadmitigatedRuntime version mismatch

3.13 Vendor risks — narrative for the top 3

R-MEL-601 — TTLock or Salto API breaking change

Lock vendors are mid-tier SaaS providers with their own release cadence. A breaking change to their API in production has happened to other PMS vendors and will happen to us. The blast radius is per-tenant per-vendor; the recovery is hours of unplanned engineering and a window of degraded operations.

Mitigation stack:

  1. Vendor adapter pattern — every vendor lives behind the LockPort interface; a breaking change touches one adapter, not the domain.
  2. Adapter contract tests run nightly against vendor sandboxes.
  3. Vendor-version pinning where APIs offer it; explicit vendor-version metadata in every key event.
  4. Manual fallback — issue a mechanical key, capture the lock event for later sync, do not block check-in.
  5. Generic Wiegand adapter as a backstop for vendors that lose their cloud entirely.

Watchpoint trigger: vendor sandbox contract test fails; vendor advisory of breaking change; any key-issuance success rate drop > 5% per vendor.

R-MEL-602 — Stripe / PayPal regional restriction tightening

Payment providers periodically tighten rules in our markets. The risk is that a tenant who relied on Stripe one month cannot use it the next. The product impact is direct revenue loss for that tenant and a support escalation for us.

Mitigation stack:

  1. Pluggable payment adapters — Stripe and PayPal are two of many.
  2. Per-tenant payment-method config — disable a method per tenant without deploy.
  3. MFS coverage — at least one electronic rail per market.
  4. Cash-on-arrival as the universal fallback.
  5. Compliance alert subscription to provider terms changes; quarterly review.

Watchpoint trigger: provider TOS change in target market; provider account restriction notice; tenant payment-method failure spike.

R-MEL-603 — Vertex AI model deprecation

Vertex AI deprecates models on its own schedule. A deprecation that lands during a release window forces an unplanned migration of every prompt that targeted the deprecated model.

Mitigation stack:

  1. Single AI gateway — model identifier changes happen in one place.
  2. Eval suite per prompt — re-runs against the new model before swap.
  3. Multi-provider abstraction — fallback to a different provider where eval permits.
  4. Subscription to Vertex AI deprecation calendar; quarterly model-currency audit.
  5. Prompt registry with version history so old prompts can be re-evaluated against new models.

Watchpoint trigger: deprecation announced; eval suite drift on a target prompt > 10%.

3.14 Vendor risks — full table

IDDescriptionLIScoreMitigationOwnerStatusWatchpoint
R-MEL-601TTLock or Salto API breaking change3412Adapter pattern; nightly contract tests; manual fallbackLock LeadopenSandbox contract fail
R-MEL-602Stripe / PayPal regional restriction tighten3412Pluggable adapters; per-tenant config; MFS; cash-on-arrivalPayments LeadopenProvider TOS change
R-MEL-603Vertex AI model deprecation339AI gateway centralization; per-prompt eval; multi-provider; deprecation calendarAI LeadopenDeprecation announced
R-MEL-604GCP pricing change (egress, Cloud Run, Pub/Sub)339FinOps dashboard; quarterly cost review; reserve commits where stableFinance + SREopenPricing announcement
R-MEL-605OpenSearch licensing complications133Not adopted; meta search uses Postgres + indexes; OpenSearch deferredPlatform LeadclosedIf adopted, re-open
R-MEL-606ONNX runtime breaking change236Version pinning; release-test matrix; vendor changelog watchDesktop LeadmitigatedMajor version bump
R-MEL-607Twilio (SMS) outage339Multi-provider notification adapter; queue + retry; in-app fallbackNotification LeadopenTwilio incident
R-MEL-608WhatsApp Business platform policy change339Per-tenant template approval; manual fallback to SMS / email; abuse-rate monitoringNotification LeadopenPolicy change
R-MEL-609Resend / SendGrid email outage326Multi-provider; outbox + retry; deliverability dashboardNotification LeadmitigatedProvider incident
R-MEL-610electron-builder release / signing infra outage236Local fallback signing; release pipeline self-host option; staged rolloutDesktop LeadmitigatedPipeline failure
R-MEL-611Vendor sub-processor change (under DPA)339DPA review per vendor; sub-processor change notice; legal queueComplianceopenSub-processor change notice

3.15 People & process risks — narrative for the top 2

R-MEL-701 — Hiring locale-fluent QA

Pashto/Dari/Persian/Tajik QA fluent in the operational domain is a thin pipeline. Without it, our acceptance tests for RTL content and AI translations regress silently.

Mitigation stack:

  1. Local QA contract pool in Kabul, Dushanbe, Mashhad, Herat for paid validation cycles.
  2. Per-locale acceptance gate in CI for translated content.
  3. Tenant beta program — pilot tenants are paid feedback partners.
  4. Internal locale champions — engineers and PMs with native fluency rotate review duty.

Watchpoint trigger: translation defect found in production; per-locale eval drift; QA pipeline > 1 week cycle.

R-MEL-702 — Documentation rot

The 17-doc-per-service standard generates a lot of paper. Without enforcement, the docs drift from code; new engineers stop trusting them; the moat erodes.

Mitigation stack:

  1. SERVICE_READINESS.md audit at every release per service; documented gaps block.
  2. Service readiness audit skill runs in CI per service touched.
  3. Quarterly platform doc audit — random sample reviewed.
  4. PR template requires confirming which doc(s) were updated.
  5. ADR creation policy — if a tradeoff is changed, an ADR is mandatory.

Watchpoint trigger: service readiness score drops > 10 points; ADR backlog > 3 months; PR with code change but no doc check.

3.16 People & process risks — full table

IDDescriptionLIScoreMitigationOwnerStatusWatchpoint
R-MEL-701Hiring locale-fluent QA4312Local QA pool; per-locale CI gate; beta program; championsPeople + QAopenTranslation defect in prod
R-MEL-702Documentation rot4312SERVICE_READINESS audit; skill in CI; quarterly review; ADR policyPlatform LeadopenReadiness drop > 10
R-MEL-703Cross-cultural design (RTL nuances)339Locale champions; native review on UI changes; mirrored screenshots in CIFrontend LeadopenUX regression in RTL
R-MEL-704On-call burnout339Rotation across timezones; runbook quality; auto-remediation; weekly health pulseSRE LeadopenPage count > target/operator
R-MEL-705ADR drift (decision changed without ADR)339ADR policy in PR template; quarterly ADR review; spec-vs-implementation auditPlatform LeadopenCode drifted from ADR
R-MEL-706Founder bus factor2510Doc-heavy culture; pair-on-decision rule; succession plan; access sharedCTOopenSingle-owner critical path > 2
R-MEL-707Hiring senior engineers in target geographies3412Remote-friendly; local hubs; partnerships with universities; relocation supportPeopleopenOpen senior role > 90 days

3.17 Register summary

  • Total risks identified: 67
  • Escalate (≥ 15): 5 — R-MEL-001, R-MEL-002, R-MEL-003, R-MEL-201, R-MEL-302
  • Mitigate (7–14): 50
  • Monitor (≤ 6): 12

Distribution by category:

CategoryCount
Technical20
Operational12
Security15
Regulatory11
Market & business10
AI11
Vendor11
People & process7

(Counts include narrative top-3 plus tables; some risks span categories — the canonical category is the one in the table heading.)


4. Tradeoffs Register

Tradeoffs are decisions where we deliberately accepted a downside in exchange for an upside. Each entry names the alternative, the upside we chose, the downside we accepted, the mitigation that bounds the downside, and the watchpoint that would force us to revisit. Cross-references to the relevant ADR or spec section are included.

TR-MEL-01 — Single shared schema with RLS for most domain data

  • Alternative considered: Schema-per-tenant for every service.
  • Decision: Shared schema + tenant_id column + Postgres RLS for iam, tenant, property, reservation, pricing, inventory, housekeeping, maintenance, staff, theme-config, notification, reporting, analytics, lock-integration, search-aggregation. Schema-per-tenant for billing and payment-gateway only.
  • Why we chose this: Operational simplicity (one migration set per service vs. one per tenant per service), lower cost (Postgres connection pooling sane, no per-tenant connection sprawl), simpler analytics (aggregations across tenants for the platform team without proxying through every tenant DB), and preserved isolation via RLS + application context middleware + CI tests.
  • What we gave up: Maximal isolation. A bug that bypasses RLS exposes more than a bug that bypasses a schema boundary.
  • Mitigation: Two-tenant CI test suite that runs on every PR; mandatory RequestContext middleware; no raw SQL lint; quarterly pen-test; schema-per-tenant for the two services where the financial blast radius justified the operational cost.
  • Cross-reference: ADR-0002 multi-tenancy model; R-MEL-001.
  • Watchpoint to revisit: any cross-tenant CI failure; any pen-test finding in tenant isolation; tenant > 5% of total platform load (consider schema-per-tenant promotion path); regulatory mandate for stronger data segregation in a target market.

TR-MEL-02 — Electron over Tauri for the desktop backoffice

  • Alternative considered: Tauri (Rust + WebView), with a 30 MB bundle vs. Electron's 100 MB+ bundle and a smaller memory footprint.
  • Decision: Electron. Locked, with substitution requiring an explicit ADR and unanimous architecture-team approval.
  • Why we chose this: The lock vendors we must integrate (TTLock, Salto, Assa Abloy) ship Node bindings, not Rust crates. better-sqlite3 is a first-class Node ecosystem package; keytar for OS keychain is Node-native; ONNX Runtime Node is mature; electron-builder + electron-updater deliver one-click signed installers across Windows/macOS/Linux that hotel IT can deploy without an extra toolchain. Hiring profile in our target geographies favors JS/TS over Rust by an order of magnitude. Bundle size of 100 MB+ is irrelevant for a staff-installed line-of-business app that is downloaded once and updated incrementally.
  • What we gave up: Bundle size, memory footprint, and the security advantages of a smaller native attack surface.
  • Mitigation: Strict Electron security configuration (contextIsolation: true, nodeIntegration: false, narrow typed window.melmastoon surface via preload + contextBridge, CSP in renderer, periodic security audit); incremental auto-updates so the 100 MB ships only once.
  • Cross-reference: ADR-0003 Electron offline-first desktop; R-MEL-019, R-MEL-215.
  • Watchpoint to revisit: Tauri 2.x maturity in 2 years specifically around Node-binding interop; any lock vendor that ships Rust-first; any sustained operator complaint about install size on metered networks.

TR-MEL-03 — GCP-only, multi-cloud avoidance

  • Alternative considered: AWS-equivalent stack from day one or active multi-cloud (GCP + AWS).
  • Decision: GCP-only. Cloud Run + Cloud SQL + Pub/Sub + Memorystore + Vertex AI + Cloud Storage + KMS + Secret Manager + Cloud Logging/Monitoring/Trace.
  • Why we chose this: Faster delivery (one cloud's IAM, one set of IaC patterns), Vertex AI co-location matters for our AI-first thesis, and the cost competitiveness at our scale is real.
  • What we gave up: Vendor-lock-in risk to GCP. A GCP pricing surprise, a GCP regional outage, or an Iran-availability change forces a port.
  • Mitigation: Hexagonal architecture — every infrastructure dependency is behind a port. A future port to AWS or Azure exercises new adapters, not domain rewrites. Plan B IaC kept current for a co-located fallback (esp. for Iran residency). Quarterly FinOps review.
  • Cross-reference: ADR-0001 §6; R-MEL-009, R-MEL-302, R-MEL-604.
  • Watchpoint to revisit: GCP regional outage > 4h; pricing change > 20% on a hot service; sanctions blocking GCP availability in a target market; an enterprise tenant whose contract forbids GCP.

TR-MEL-04 — Single AI gateway with provider routing

  • Alternative considered: Direct calls from each service to its preferred model provider.
  • Decision: Single ai-orchestrator-service as the only egress to Vertex AI or any external provider; ONNX Runtime on the desktop is the only edge inference allowed.
  • Why we chose this: Cost control (per-tenant budgets, per-feature quotas, prompt-hash caching live in one place), provenance (every AI artifact has { model, version, promptId, traceId, reviewedBy?, local } from the gateway), HITL governance (one place to enforce that irreversible AI actions go through a human), and vendor-portability (multi-provider routing without service-by-service refactor).
  • What we gave up: Latency overhead of a centralized hop, a central failure point, and the simplicity of "just call the SDK".
  • Mitigation: Min instances ≥ 1 per region; warm-up endpoints; multi-region deployment from R2; per-feature circuit breakers; explicit "AI degraded" UX that hides AI affordances rather than fabricating output.
  • Cross-reference: docs/08-ai-architecture.md; R-MEL-013, R-MEL-503, R-MEL-506.
  • Watchpoint to revisit: sustained gateway latency p95 > 1.5 s; gateway availability < 99.9%; new feature where centralization is provably worse for cost or latency.

TR-MEL-05 — Single React Native consumer mobile for browse + post-booking

  • Alternative considered: Two apps — a browse-and-book consumer app and a post-booking management app for guests during stay.
  • Decision: One React Native consumer app with feature-flag gating per tenant for in-stay management.
  • Why we chose this: Shared codebase, shared design tokens, shared auth, single store presence, lower acquisition cost.
  • What we gave up: Bundle size and complexity in a single binary; per-tenant store-presence customization is harder; the in-stay surface lives at the mercy of the consumer-app release cadence.
  • Mitigation: Feature flags per tenant; lazy-loaded modules per surface; in-stay UI behind a tab that does not affect cold-start.
  • Cross-reference: docs/frontend/01-web-and-mobile-specification.md.
  • Watchpoint to revisit: in-stay surface drives > 30% of app sessions and competes with browse for screen estate; tenant requests for white-label mobile presence (R3 reseller program may force a split).

TR-MEL-06 — PostgreSQL as default datastore

  • Alternative considered: Polyglot persistence — Cassandra for write-heavy aggregates, ElasticSearch for search, dedicated vector DB.
  • Decision: PostgreSQL on Cloud SQL for OLTP, with pgvector for embeddings and Postgres GIN/GIST indexes for search where feasible. Firestore for sync cursors only. BigQuery for analytical sink. Cloud Storage for blobs.
  • Why we chose this: Operational simplicity; one set of backup, restore, IAM, RLS, migration patterns; team expertise; mature tooling; Cloud SQL HA managed.
  • What we gave up: Pure write throughput at scale; some workloads (vector search at very large scale, full-text search at scale) may force later movement to specialized stores.
  • Mitigation: Read replicas; per-aggregate index discipline documented in DATA_MODEL.md; pgvector partitioned per tenant; OpenSearch deferred unless evidence forces it.
  • Cross-reference: docs/06-data-models.md; R-MEL-011.
  • Watchpoint to revisit: vector index size > 50% of DB size on hot service; full-text query p95 > 300 ms after index tuning; tenant-cohort write throughput > 5k TPS sustained.

TR-MEL-07 — No GraphQL on BFFs (REST only)

  • Alternative considered: GraphQL gateway (Apollo or similar) at the BFF layer.
  • Decision: REST + BFF.
  • Why we chose this: Tooling familiarity, simpler edge cache (HTTP cache headers do real work), smaller dependency surface, easier observability (per-route metrics), and the surfaces have small enough response shapes that GraphQL's flexibility does not pay for itself.
  • What we gave up: Per-surface query flexibility; some BFF endpoints will be chatty for surfaces with deep relations.
  • Mitigation: Per-surface BFF resolvers can be added without touching domain services; GraphQL is not banned for internal exploration if a future surface (e.g., advanced reporting) has a strong fit.
  • Cross-reference: docs/05-api-design.md; ADR-0001 Alternatives table.
  • Watchpoint to revisit: any BFF endpoint averaging > 5 round-trips per page-load over a quarter; reporting surface in R3 demands flexible aggregation queries.

TR-MEL-08 — No native iOS / Android backoffice in Phase 1

  • Alternative considered: Native staff app on iOS and Android in parallel with the desktop.
  • Decision: Electron desktop only in R1. React Native consumer app does not carry staff workflows. A React Native staff sub-mode is deferred to R3.
  • Why we chose this: Cost. Two more codebases to ship, two more app-store cycles to maintain, two more security audits. The desktop covers the operational core; mobile is for the consumer.
  • What we gave up: Field-ops mobility — a housekeeper updating room status from the room itself, a maintenance technician from the basement.
  • Mitigation: The desktop UI is touch-friendly so a tablet works; the consumer app's offline cache covers the in-stay guest case; R3 plan includes a React Native staff sub-mode and a kiosk mode.
  • Cross-reference: docs/frontend/01-web-and-mobile-specification.md; R-MEL-403 (deferred).
  • Watchpoint to revisit: > 30% of housekeeping operators using the tablet form-factor over a quarter; R3 reseller channel demands a mobile staff app.

TR-MEL-09 — Single Electron desktop binary per tenant install

  • Alternative considered: Multi-tenant binary with chain-operator switcher in R1.
  • Decision: Single-tenant install in R1; chain multi-tenant switcher added in R2.
  • Why we chose this: Simpler ops in R1 (one device = one tenant = one keychain entry = one sync cursor); the chain-operator persona is a small fraction of R1 tenants.
  • What we gave up: Friction for chain operators in R1 — they install per-property.
  • Mitigation: Documented per-property install playbook; chain-switcher is a R2 commitment.
  • Cross-reference: docs/frontend/desktop/06-desktop-app-specification.md.
  • Watchpoint to revisit: > 10% of R1 tenants are chain operators; chain operator pilots start before R2.

TR-MEL-10 — Custom tenant booking flow config (declarative)

  • Alternative considered: Full WYSIWYG theme editor with arbitrary HTML/CSS per tenant.
  • Decision: Declarative configuration — token model + layout presets + content blocks. No arbitrary HTML.
  • Why we chose this: Cheaper to build; protects accessibility (presets are reviewed); protects performance (no tenant ships an unbounded asset); protects security (no tenant injects a script); covers the 90% of customization needs we observe in the target market.
  • What we gave up: The 10% of customization needs that require arbitrary markup. Some tenants will ask for "but my website has X".
  • Mitigation: Content-block library expands per quarter based on tenant requests; presets grow from 3 (R1) to 8+ (R2); R3 introduces an advanced "block authoring" surface for tenants who pass a vetting gate.
  • Cross-reference: docs/frontend/02-theming-and-tenant-config.md; R-MEL-208.
  • Watchpoint to revisit: > 20% of tenant onboarding requests blocked by missing block; competitive tenant lost on theming flexibility.

TR-MEL-11 — Cash-on-arrival as a first-class payment method

  • Alternative considered: Cash-on-arrival as a workaround under "manual" or "offline" payment.
  • Decision: Cash-on-arrival is a first-class method with a full reconciliation surface, drawer accounting, audit trail, and FX-aware folio handling.
  • Why we chose this: It is the dominant rail in our beachhead markets. Treating it as a workaround would mean treating our majority customer as an exception.
  • What we gave up: Accounting complexity in billing-service and payment-gateway-service; reconciliation features that competitors do not need to build.
  • Mitigation: Drawer-close enforced on logout; auto-close at midnight; EOD variance reporting; per-operator override audit.
  • Cross-reference: docs/10-payments-architecture.md; R-MEL-101, R-MEL-406.
  • Watchpoint to revisit: card share in target market crosses 50% of bookings; regulator mandates electronic-only payments.

TR-MEL-12 — Hexagonal architecture as a pre-paid escape hatch

  • Alternative considered: Direct framework / cloud calls in services for speed.
  • Decision: Hexagonal everywhere. Domain is framework-free. Every infra dependency is a port.
  • Why we chose this: Cheap insurance for cloud port (R-MEL-302), vendor swap (R-MEL-601, R-MEL-603), and AI provider swap (R-MEL-506). The cost is real but bounded; the option value across a 5-year horizon is large.
  • What we gave up: Some boilerplate; some code that "just works" with the framework's defaults must be plumbed through a port.
  • Mitigation: Service template enforces structure; review checklist flags direct framework use in domain.
  • Cross-reference: ADR-0001 §7; R-MEL-302, R-MEL-601, R-MEL-603.
  • Watchpoint to revisit: team reports hexagonal overhead > 10% of feature time on a service; ADR proposing exception.

5. Mitigation Catalog

These are the named, reusable mitigation patterns referenced from the risk register above. Each pattern is a one-page protocol with an owner. The full pattern docs live in docs/standards/mitigations/; this catalog is the index.

MIT-01 — RLS test pattern

For every PR touching SQL, repositories, or BFF route handlers: a two-tenant fixture is loaded; an authenticated request as tenant A hits every endpoint reachable from the BFF; the response is asserted to contain zero rows belonging to tenant B. The test harness fails the build on any cross-tenant leak. Maintained by the Platform Lead.

Used by: R-MEL-001, R-MEL-008.

MIT-02 — Sync conflict UX pattern

When a per-aggregate merge produces a conflict that requires operator decision: the desktop app surfaces a tray notification with a side-by-side diff; the operator sees "your version", "server version", and a "merged proposal"; the operator picks; a pre-merge backup is retained for 7 days. Conflicts that the merge engine resolves automatically are logged but not surfaced. The conflict UI is unified across aggregates so operators learn it once.

Used by: R-MEL-003.

MIT-03 — AI HITL gate pattern

Every AI action in the irreversible or guest-facing class flows through a single HITL component: the AI suggestion is rendered alongside its provenance, the user sees a "Try this?" CTA (not "Apply"), accepting or rejecting both writes telemetry, accepted suggestions are persisted with provenance attached. Bypassing HITL requires an explicit per-tenant policy override and emits an audit event.

Used by: R-MEL-501, R-MEL-502, R-MEL-509.

MIT-04 — Encoder failure-fallback pattern

When the lock vendor adapter fails to issue a key during check-in: the desktop offers an immediate fallback path — issue a mechanical key and capture the lock event for later sync; the front-desk operator is shown the next-step protocol; the failure is queued for retry; an alert goes to the GM. Check-in is never blocked by encoder failure.

Used by: R-MEL-104, R-MEL-601.

MIT-05 — Override audit pattern

Every operator override (rate, folio adjustment, late checkout, manual key issuance) emits a <service>.override.applied.v1 event with a closed-list reason code, free-text justification, operator id, and traceparent. Aggregations roll up to the GM dashboard with per-operator and per-property trend lines. Overrides above a threshold require a 4-eyes approval.

Used by: R-MEL-102, R-MEL-203.

MIT-06 — Per-tenant AI budget pattern

Every tenant has a monthly AI budget per feature. At 80% the system soft-degrades (cheaper model, less context); at 100% it hard-stops (HITL only, no automated AI calls) until top-up or new period. Cost dashboard shows daily burn vs. budget per tenant per feature. The default-off setting on net-new AI features prevents quiet adoption.

Used by: R-MEL-503.

MIT-07 — Vendor adapter contract test pattern

Every adapter (lock, payment, AI provider, notification) has a contract test suite that runs nightly against the vendor's sandbox. Failures page the vendor owner. The contract test asserts every method we use; vendor-side breaking changes are detected before they hit production.

Used by: R-MEL-601, R-MEL-602, R-MEL-603, R-MEL-607, R-MEL-608.

MIT-08 — Offline grace-period warning pattern

The desktop app monitors the freshness of every cached capability that has a vendor-side time bound: token refresh window, TTLock dynamic-code window, key-credential horizon. At 75% and 90% of the configured window, an in-app banner warns the operator; at 90%, an email is sent to the GM. At 100%, the affected capability degrades gracefully (mechanical-key fallback, paper-folio capture, etc.) rather than silently failing.

Used by: R-MEL-103.

MIT-09 — Provenance enforcement pattern

The domain layer of every service that consumes AI output refuses to persist an AI artifact without a complete { model, version, promptId, traceId, reviewedBy?, reviewedAt?, local } provenance object. The TypeScript type system enforces this at compile time; the database constraint enforces it at write time. The export path always includes provenance.

Used by: R-MEL-510.

MIT-10 — Lost-device protocol

When a device is reported lost: the device key is revoked server-side; the next sync attempt from that device fails; the SQLCipher store is unreadable on a different machine because the OS keychain is bound to the original; remote-wipe is queued and executes on first reconnect; re-pair on a new device requires GM out-of-band approval. The protocol is documented in the operator playbook and the security runbook.

Used by: R-MEL-202.

MIT-11 — Two-tenant CI fixture

Every service ships a two-tenant fixture (tenant A and tenant B with overlapping data shapes) used by the test harness for isolation, sync, and event tests. Fixtures are seeded into the CI Postgres and Firestore emulators per test run.

Used by: R-MEL-001, MIT-01.

MIT-12 — DR drill cadence

Every quarter, an SRE-led DR drill restores Cloud SQL from PITR into a sibling project, replays Pub/Sub from a known cursor, validates outbox idempotency, and measures RTO + RPO. Results are recorded; deltas trigger remediation.

Used by: R-MEL-009, R-MEL-010, R-MEL-109.

MIT-13 — Per-jurisdiction adapter pattern

Regulatory surfaces (guest registration, KYC, tax) are exposed as ports. Each jurisdiction has an adapter. New mandate = new adapter; no schema change, no domain rewrite. Per-tenant config selects the adapter and the field set.

Used by: R-MEL-301, R-MEL-303.

MIT-14 — Phishing-resistant auth

WebAuthn / passkey is the default second factor; TOTP is fallback. Device binding adds a per-device cryptographic identity. Suspicious-login telemetry surfaces in the GM dashboard with one-click revoke.

Used by: R-MEL-201, R-MEL-206.

MIT-15 — Outbox + idempotency

Every state-changing API call carries an idempotency key derived from {device_id, local_aggregate_version, mutation_seq} (desktop) or {client_id, request_id} (web). The server-side outbox table dedupes; replays on retry are no-ops. Idempotency keys live in Postgres, not Redis (R-MEL-014).

Used by: R-MEL-002, R-MEL-014.


6. Risk Review Cadence & Governance

6.1 Cadence

CadenceScopeOwnerOutput
Per-PRService-level risks touched by the changeService ownerPR description references affected risk IDs; CI runs targeted mitigations (RLS test, contract test)
Per-release pre-flightEvery risk with score ≥ 7Release CaptainGo / hold decision per release; mitigation status snapshot
MonthlyPlatform-wide; every open riskPlatform LeadStatus update per row; new risks from incidents promoted into register
QuarterlyFull register + every tradeoffCTO + Security Lead + ComplianceRe-justify each "accepted" item; close obsolete; create ADRs for changed decisions
Per-incidentRisks the incident materializedIncident CommanderPostmortem identifies which risks fired; proposes new ones; updates mitigations

6.2 Roles and ownership

  • Platform Lead — owner of register; final say on risk scoring; chairs monthly review.
  • Security Lead — owner of all security-category risks; chairs security pen-test cadence.
  • AI Lead — owner of AI-category risks; chairs eval cadence and provider review.
  • SRE Lead — owner of operational and infra risks; chairs DR drill cadence.
  • Compliance Lead — owner of regulatory risks; chairs jurisdictional review with local counsel.
  • Service owner — owner of service-level risks; reflects platform-level risks into SERVICE_RISK_REGISTER.md.
  • Release Captain — rotating role; runs the per-release pre-flight against this register.

6.3 ADR creation policy

A tradeoff is changed (a TR-MEL-xx entry is reversed, narrowed, or replaced) only via an ADR. The ADR cites the previous tradeoff, the trigger, the new decision, the new mitigation, and the watchpoint that would force a future revisit. The risk register is updated in the same PR.

6.4 Postmortem feedback loop

Every postmortem template includes:

  • Which R-MEL-xxx risks materialized?
  • Which mitigations were ineffective?
  • What new risks does this incident reveal?
  • Which TR-MEL-xx tradeoffs are implicated?

These items are merged into this register in the same week as the postmortem. The register is a living document.

6.5 Why this matters

A risk register that is only used as theatre is worse than no register: it lulls the team into thinking the risks are managed. The cadence above makes the register real. Every entry is owned, dated, and re-examined. Tradeoffs are not "design intuitions" — they are written down with their alternatives, their rationales, and the watchpoints that would force us to reverse them. When someone asks "why did we do X instead of Y?", the answer is in this document.


7. Cross-References

This document supersedes any prior risk discussion in service bundles; service-level SERVICE_RISK_REGISTER.md files inherit and extend the entries here, never contradict them.