12 — Risks & Tradeoffs
Companion: 02 Enterprise Architecture · 07 Security, Compliance & Tenancy · 09 Lock & Key Integration · 10 Payments Architecture · ADR-0001 Core stack · ADR-0003 Electron offline-first desktop
This document is the platform-wide register of identified risks and deliberate tradeoffs for Ghasi Melmastoon. It is the single source of truth that every service-level SERVICE_RISK_REGISTER.md, every ADR, every release readiness gate, and every quarterly governance review reads against. Per-service risks may add detail; they may not contradict the entries here.
The shape of the document is deliberately operational rather than narrative. We assume the reader is an engineer, SRE, security reviewer, finance reviewer, or platform-product owner deciding whether to ship, hold, or escalate. Risks are scored, owned, and dated; tradeoffs are explicit, justified, and bound to mitigation work.
1. Purpose
Ghasi Melmastoon operates at the intersection of multi-tenant SaaS, event-driven microservices on GCP, AI-assisted operations, payment rails in low-trust markets, lock hardware in long-tail vendor configurations, and an offline-first Electron desktop application that runs daily hotel operations even when the internet does not. The combinatorics of these surfaces produce a non-trivial risk surface that cannot be discovered after launch.
The purpose of this register is to:
- Catalogue every known risk across technical, operational, security, regulatory, market, AI, vendor, and people-and-process categories — at least 60 entries — with consistent scoring, ownership, mitigation, and watchpoint triggers.
- Make every accepted tradeoff explicit so that future engineers, auditors, partners, and acquirers can find the rationale where it should be: in the spec, not in someone's head.
- Bind risks to mitigations and to release gates so the readiness checklists in
docs/roadmap/and the per-serviceSERVICE_READINESS.mdfiles can deny a release that drifts from the agreed posture. - Drive governance cadence — monthly platform review, per-release pre-flight, per-incident postmortem feedback into this register, and ADR creation when a tradeoff changes.
A risk that is not in this register but is operationally relevant is a defect in this document. Adding a risk costs nothing; missing one costs incidents.
2. Risk Scoring Model
We use a simple two-axis qualitative scoring model. Both axes are scored 1–5 and multiplied for a composite score in the range 1–25. We pick a qualitative model deliberately: quantitative risk modelling at our scale (early SMB pilots, regional rollout) over-fits to noise. The cadence of review compensates for the coarseness of the scale.
2.1 Likelihood (L)
| L | Label | Approximate frequency |
|---|---|---|
| 1 | Rare | Plausible at most once across the platform's lifetime; requires an unlikely combination of failures |
| 2 | Unlikely | Possible once per year per tenant or per service; precedent in similar systems but not in ours |
| 3 | Possible | Expected to occur multiple times per year across the fleet |
| 4 | Likely | Expected to occur monthly across the fleet, or quarterly per tenant |
| 5 | Almost certain | Expected weekly across the fleet, or monthly per tenant |
2.2 Impact (I)
| I | Label | Operational meaning |
|---|---|---|
| 1 | Negligible | Cosmetic; no operator workaround needed; no SLO breach |
| 2 | Minor | Operator workaround exists; one-tenant impact; below SLO threshold |
| 3 | Moderate | Multi-tenant impact, single-region; partial SLO breach; recoverable within shift |
| 4 | Major | Significant data correctness, financial, security, or reputational impact; multi-day recovery |
| 5 | Severe | Existential — sustained data leakage, mass financial loss, regulator escalation, multi-week outage, brand-defining incident |
2.3 Score thresholds
| Score (L × I) | Posture | Required action |
|---|---|---|
| 1–6 | Monitor | Owner records baseline; review during quarterly cadence; no action required unless trigger fires |
| 7–14 | Mitigate | Mitigation must be in flight or scheduled within the next release; owner reports status monthly |
| 15–25 | Escalate | Mitigation must be in production before next release; CTO and Security lead reviewed; ADR or runbook required |
A risk that crosses a threshold upward (e.g., Score moves from 12 to 16) auto-escalates. A risk that crosses downward (mitigation lands and verification holds for two consecutive review cycles) may be downgraded.
2.4 Status lifecycle
open → mitigated → accepted → closed
↘ deferred ↗
- open — identified, mitigation not yet effective.
- mitigated — mitigation effective; residual risk acknowledged; ongoing monitoring.
- accepted — explicitly accepted by CTO + business owner; written justification required; revisited each quarter.
- deferred — mitigation planned for a named future release; rationale recorded.
- closed — risk no longer applies (e.g., capability removed, vendor replaced, regulation withdrawn).
2.5 Review cadence
- Per-PR — service-level risks attached to changes touching their domain.
- Per-release pre-flight — every risk with score ≥ 7 reviewed against release scope.
- Monthly — platform-wide review of every open risk; status update per row.
- Quarterly — full register review; tradeoffs revisited; any accepted item re-justified.
- Per-incident — postmortems must declare which existing risks materialized and propose new ones.
3. Risk Register
The full register lives below. Each risk has a stable ID R-MEL-NNN. IDs are never reused; closed risks remain in the register for archaeology.
3.1 Technical risks — narrative for the top 3
R-MEL-001 — Multi-tenant data leakage from missing RLS guard
A new endpoint or new query that omits the tenant_id predicate, or a Postgres role that is granted broader access than intended, can expose tenant A's data to tenant B. This is the single largest existential risk for a multi-tenant SaaS. The blast radius is regulatory (GDPR-class breach notifications), commercial (chain operators terminate), and reputational (recovery measured in years). Postgres Row-Level Security (RLS) is our primary defense; the secondary defense is the application-layer RequestContext middleware that injects tenant_id into every repository call.
Mitigation stack:
- RLS policies on every table that holds tenant data (
USING (tenant_id = current_setting('app.tenant_id')::uuid)), including read-only projections and analytics views. RLS enabled byALTER TABLE … ENABLE ROW LEVEL SECURITYandFORCE ROW LEVEL SECURITYso even table owners cannot bypass. - CI test pattern (
docs/standards/RLS_TEST_PATTERN.md): every PR touching SQL or repositories runs a two-tenant fixture test that authenticates as tenant A and asserts a 0-row result for tenant B's records on every endpoint reachable from the BFF. The test harness fails the build on any cross-tenant leak. - No raw SQL outside the repository layer. Drizzle (or
pgin hot paths) is the only allowed access; raw SQL strings are linted out. - Quarterly external pen-test with a tenant-isolation category.
search-aggregation-serviceis the only service permitted to query across tenants, and only against a denormalized read model with explicit aggregation predicates. All other services treat cross-tenant queries as a defect class.
Watchpoint trigger: any cross-tenant test failure in CI; any pen-test finding in the tenant-isolation category; any production query log showing repository calls without tenant_id predicate.
R-MEL-002 — Saga compensation gaps
The booking saga (Booking → Inventory hold → Payment authorize → Reservation confirm → Key issuance → Notification) is multi-step, multi-service, and crosses the offline boundary on the desktop. A compensation that is incorrect, missing, or non-idempotent leaves the platform in a state where inventory says a room is sold but billing has no folio, or payment is captured but the reservation is cancelled. The financial and trust damage is direct.
Mitigation stack:
- Each saga is an explicit state machine in
reservation-servicewith named steps, named compensations, deterministic transitions, and a persisted journal. No implicit "if this fails, undo the previous step" logic. - Compensations are idempotent and outbox-driven — every compensation is replayable; idempotency keys are derived from
{saga_id, step}. - Saga inspector UI in the control plane lets platform staff see in-flight sagas, trigger replay, or force-close with audit trail.
- Chaos tests inject failures at every step boundary in CI.
- Provisional-state UX on the desktop: any folio, key credential, or reservation in a transitional state is labelled with a sync-pending badge; operators cannot act on provisional state in irreversible ways.
Watchpoint trigger: any saga journal entry stuck for > 30 minutes without progress; any payment captured without a corresponding folio entry within 5 minutes; any reservation in pending longer than the configured policy window.
R-MEL-003 — Sync conflict bugs corrupting offline data
The Electron desktop app holds a 60-day window of operational data in SQLite via better-sqlite3. When a property runs offline for hours or days (or a chain operator works from multiple devices), the sync engine's per-aggregate conflict resolution decides which version of a folio, room status, housekeeping task, or key credential wins. A bug in conflict resolution can silently overwrite a valid local mutation with stale server state, or vice versa. The damage manifests as "the cash drawer count was right yesterday and is wrong today" — hard to detect, hard to recover, trust-destroying.
Mitigation stack:
- Per-aggregate conflict policy declared in
services/<svc>/SYNC_CONTRACT.md— never global last-write-wins. Monetary state (folios, payments) requires server-authoritative resolution; operational state (room status, housekeeping) uses domain-specific merge rules; reference data (rate plans) uses LWW. - Outbox + idempotency keys on every mutation, derived from
{device_id, local_aggregate_version, mutation_seq}— replay is safe. - Property-based tests in CI that generate random divergent histories and assert merge convergence.
- Conflict log table persisted server-side; conflicts requiring operator decision surface as a tray notification on the desktop with a side-by-side diff view and a "restore my version" affordance.
- Pre-merge backups of the local aggregate are kept for 7 days locally so an operator can recover their version if the merge UI was misused.
Watchpoint trigger: sync conflict rate > 5 per 1k mutations per tenant per week; any operator complaint of "lost work" matched to a sync window; any property-based test failure in the sync harness.
3.2 Technical risks — full table
| ID | Description | L | I | Score | Mitigation | Owner | Status | Watchpoint |
|---|---|---|---|---|---|---|---|---|
| R-MEL-001 | Multi-tenant data leakage from missing RLS guard | 3 | 5 | 15 | RLS on every table; two-tenant CI suite; quarterly pen-test; no raw SQL | Platform Lead | open | Cross-tenant test fails in CI |
| R-MEL-002 | Saga compensation gap leaves inconsistent state | 3 | 5 | 15 | Explicit saga state machine; idempotent compensations; chaos tests; saga inspector | Reservation owner | open | Saga stuck > 30 min |
| R-MEL-003 | Sync conflict bug corrupts offline data | 3 | 5 | 15 | Per-aggregate policy; idempotency; pre-merge backup; conflict UI | Sync Lead | open | Conflict rate > 5/1k/week |
| R-MEL-004 | ONNX model integrity tampering on desktop | 2 | 4 | 8 | Signed model artifacts; signature verified on load; key rotation; tamper telemetry | AI Lead | mitigated | Verification failure on app start |
| R-MEL-005 | Electron auto-update signature breach (rogue update) | 2 | 5 | 10 | electron-updater with code-signing certificate; staged rollout; rollback channel; cert pinning | Desktop Lead | mitigated | Unsigned update event |
| R-MEL-006 | SQLite/SQLCipher key loss on device wipe | 3 | 3 | 9 | Key in OS keychain via keytar; recovery via re-pair to tenant; documented re-pair flow | Desktop Lead | mitigated | Re-pair count anomaly |
| R-MEL-007 | Pub/Sub backlog after downstream outage causes consumer overload | 3 | 4 | 12 | Subscriber concurrency caps; dead-letter topics; per-topic ack deadline; outbox replay tools | SRE | open | Backlog age > 1h on any topic |
| R-MEL-008 | Cross-tenant query in search-aggregation surface | 2 | 5 | 10 | Single explicit service; denormalized read model; cross-tenant queries forbidden elsewhere | Platform Lead | mitigated | New service requesting cross-tenant read |
| R-MEL-009 | GCP region outage (asia-south1) | 2 | 5 | 10 | Multi-region from R2 (asia-south1 + asia-southeast1); read replicas; static site CDN | SRE | deferred | Region SLA breach |
| R-MEL-010 | Cloud SQL HA failover gap (sub-minute write loss) | 3 | 3 | 9 | HA primary + standby; PITR; outbox tolerates failover; chaos drill quarterly | SRE | mitigated | Failover RTO > 60s |
| R-MEL-011 | pgvector index size growth degrades writes on hot DB | 3 | 3 | 9 | Move embeddings to a separate *-vector schema; HNSW index discipline; nightly REINDEX window | AI Lead | open | Vector table > 50% of DB size |
| R-MEL-012 | BigQuery cost runaway from analytics queries | 3 | 3 | 9 | Slot reservation; per-tenant query budget; partition pruning enforced; BI reviewer rota | Finance + SRE | open | Daily slot-hour > budget |
| R-MEL-013 | Cold start on Cloud Run for ai-gateway (latency spike) | 4 | 2 | 8 | Min instances ≥ 1; CPU-always-allocated for AI surfaces; warm-up endpoint hit per region | AI Lead | mitigated | First-token p95 > 1.5s |
| R-MEL-014 | Memorystore eviction loses idempotency keys mid-flow | 2 | 4 | 8 | Idempotency keys persisted to Postgres; Redis is cache only; TTL > saga horizon | Platform Lead | mitigated | Duplicate mutation accepted |
| R-MEL-015 | Drizzle migration applied out of order across services | 2 | 4 | 8 | Per-service migrations; CI verifies linear order; production migrate gated on review | Platform Lead | open | Out-of-order migration in CI |
| R-MEL-016 | Outbox table growth pressures hot transactions | 3 | 3 | 9 | Outbox archival job; partition by month; dead-letter compaction | SRE | open | Outbox > 1M unprocessed rows |
| R-MEL-017 | better-sqlite3 native build fails on operator OS variant | 3 | 2 | 6 | Pre-built binaries for Win/macOS/Linux x64+arm64; install diagnostics; fallback installer | Desktop Lead | mitigated | Install failure rate > 1% |
| R-MEL-018 | Pub/Sub message ordering not preserved across partitions | 3 | 3 | 9 | Use ordering key on aggregate id where order matters; document where order is irrelevant | Platform Lead | mitigated | Out-of-order event in saga |
| R-MEL-019 | Electron preload security boundary leak | 2 | 5 | 10 | contextIsolation true; nodeIntegration false; typed window.melmastoon; CSP; review checklist | Desktop Lead | mitigated | New API exposed without review |
| R-MEL-020 | Drift between OpenAPI spec and BFF implementation | 3 | 2 | 6 | Generated client from spec; CI contract test fails on drift | Platform Lead | mitigated | Contract test fail in CI |
3.3 Operational risks — narrative for the top 3
R-MEL-101 — Front-desk operator forgets cash-drawer close
In cash-heavy markets, the end-of-day cash-drawer reconciliation is the financial truth. An operator who forgets to close the drawer (shift change, surprise checkout rush, power loss mid-flow) leaves the next shift inheriting an unbounded variance and the property's daily revenue ledger broken until manually reconstructed. Existing PMS tools either silently allow it or produce alerts the operator dismisses.
Mitigation stack:
- Forced close on logout — the desktop app blocks operator logout while the drawer is open with a clear "close drawer" CTA and a count helper.
- Auto-close at midnight tenant-local with a "needs review" status on the next-day open; the morning operator must reconcile before processing new payments.
- Anomaly callout if the drawer is left open > 8 hours.
- EOD report flags every drawer that was auto-closed vs. operator-closed.
- Training playbook in the operator onboarding doc with a one-page laminated quick-card.
Watchpoint trigger: > 5% of drawers auto-closed at midnight per tenant per week; > 1 day with EOD variance > 2× tenant baseline.
R-MEL-102 — Manual override abuse without audit trail
Hotel staff need manual overrides — the rate plan does not match the walk-in's negotiated rate, the lock failed and a master key was issued, the folio was adjusted to compensate a complaint. Without an audit trail, manual overrides become the route for fraud, kickbacks, and silent revenue leakage. Without a usable audit trail, the audit is theater.
Mitigation stack:
- Every override emits an event (
<service>.override.applied.v1) with reason code, free-text justification, operator id, and traceparent. - Override reasons are a closed list per surface (no free-text-only reasons); free-text justification is required and retained.
- Override reports in the GM dashboard with per-operator and per-property aggregations and trend lines.
- Threshold alerts to the owner persona when an operator's override rate exceeds 2× the property baseline.
- Override types touching money require a second-operator approval on the desktop (4-eyes gate).
Watchpoint trigger: any operator with override rate > 2× baseline; any 4-eyes gate bypassed; any override with reason "other" > 5% of total.
R-MEL-103 — Offline window exceeds grace period and access is lost
The Electron desktop app caches a 60-day operational window so a property can run offline for days. But there are limits: lock vendor offline-issuance windows expire (TTLock dynamic codes are time-bounded), payment authorizations cannot be captured offline, and the device's own access tokens expire (refresh tokens have a 30-day default). A property that goes offline for longer than the configured grace period loses front-desk operability.
Mitigation stack:
- Documented grace periods per capability in
docs/frontend/desktop/06-desktop-app-specification.md; defaults: 30 days for token refresh, 7 days for TTLock dynamic codes (vendor-specific), 60 days for the operational data window. - Proactive warnings at 75% and 90% of grace; visible in the connectivity bar; emailed to GM at 90%.
- Last-resort manual mode that allows mechanical-key fallback and paper-folio capture for later digitization, with a forced reconciliation on reconnect.
- Per-property "offline duration" KPI in the SRE dashboard.
Watchpoint trigger: any property offline > 14 days; any token refresh failure on reconnect; any TTLock offline-issuance failure on a configured property.
3.4 Operational risks — full table
| ID | Description | L | I | Score | Mitigation | Owner | Status | Watchpoint |
|---|---|---|---|---|---|---|---|---|
| R-MEL-101 | Front-desk forgets cash-drawer close | 4 | 3 | 12 | Forced close on logout; auto-close midnight; EOD flagging; training | Ops Lead | open | Auto-closed > 5%/week |
| R-MEL-102 | Manual override abuse without audit | 3 | 4 | 12 | Override event + reason; 4-eyes for money; trend alerts | Finance Lead | open | Operator > 2× baseline override |
| R-MEL-103 | Offline window > grace; access lost | 2 | 4 | 8 | Documented grace; proactive warnings; manual fallback | Ops Lead | mitigated | Offline > 14 days |
| R-MEL-104 | Encoder hardware failure strands guests | 2 | 4 | 8 | Spare encoder per property; fallback to mobile-key + mechanical; vendor SLA | Ops Lead | open | Encoder offline > 4h |
| R-MEL-105 | Lock device battery death without alerting | 3 | 3 | 9 | Battery telemetry from lock-integration; tray alert at 20%; vendor heartbeat schedule | Ops Lead | open | Battery silent > 30 days |
| R-MEL-106 | Receipt printer hardware failure | 4 | 2 | 8 | Email/SMS receipt fallback; manual receipt template; spare printer recommendation | Ops Lead | mitigated | Print failure > 5%/day |
| R-MEL-107 | Staff training gap on multi-tenant features (chain) | 3 | 3 | 9 | Per-role onboarding course; in-app tour; chain-operator playbook | Ops Lead | open | Support tickets in onboarding category |
| R-MEL-108 | On-call coverage gaps non-business hours | 3 | 4 | 12 | 24/7 rota across 2 timezones; PagerDuty escalation; runbooks per service | SRE | open | Page acknowledged > 15 min |
| R-MEL-109 | Backup verification skipped (untested DR) | 2 | 5 | 10 | Quarterly DR drill; restore time measured; documented runbook | SRE | mitigated | Drill skipped or fail |
| R-MEL-110 | Deploy outside maintenance window in front-desk hours | 3 | 3 | 9 | Tenant-timezone-aware deploy schedule; pre-deploy notification; rollback ≤ 5 min | SRE | mitigated | Deploy in front-desk hours |
| R-MEL-111 | Tenant data export request mishandled | 2 | 4 | 8 | Documented export workflow; per-service export tooling; legal review on cross-border | Legal + Platform | open | Request open > 30 days |
| R-MEL-112 | Operator runs old desktop version offline (drift) | 4 | 2 | 8 | Auto-update enforced when online; min-version gate on sync; in-app banner | Desktop Lead | mitigated | Old version sync attempt |
3.5 Security risks — narrative for the top 3
R-MEL-201 — Credential phishing on staff
Staff in the target market mix high-trust hospitality culture with low password hygiene and shared devices. A phishing email impersonating "Ghasi Support" asking the front desk to "verify your password" succeeds more often than we would like. The blast radius depends on the role — a front-desk credential exposes one property; a chain-operator credential exposes many.
Mitigation stack:
- WebAuthn / passkey as the default second factor for desktop login; passkeys cannot be phished by a fake page.
- TOTP fallback for environments where passkeys are not yet supported, but flagged as weaker in the security dashboard.
- Device binding for the desktop app — a stolen credential cannot be used from an unbound device without re-pair, which requires an out-of-band code issued via the GM's verified phone.
- In-app "we will never ask" reminder during onboarding and printed on the operator quick-card.
- Suspicious-login telemetry — new device, new geo, new ASN — surfaces in the GM dashboard with a one-click revoke.
- Quarterly phishing drill for chain operators.
Watchpoint trigger: any account compromised by phishing; any login from a new geo without prior device pair; any TOTP-only chain-operator account.
R-MEL-202 — Lost desktop device with cached PII
A lost or stolen laptop with the Electron desktop installed contains 60 days of operational data (guest names, ID-document references, partial payment instruments, lock state). The legal and reputational damage is shaped by what is on disk and how well it is encrypted.
Mitigation stack:
- SQLCipher on the local SQLite store; key in OS keychain via
keytar, never written to disk in plaintext. - Device-bound key derivation — the SQLCipher key is derived from a server-issued device key plus a local secret; pulling the file off the disk yields encrypted bytes.
- Remote revocation — once the device is reported lost, the next successful sync attempt fails; the device key is revoked server-side; subsequent re-pair requires GM out-of-band approval.
- Auto-purge after configurable inactivity — by default, 14 days without sync triggers a local-data wipe on next launch.
- Remote-wipe API for chain operators (executes on first reconnect; cannot guarantee execution if the device never reconnects).
- No PAN data ever stored locally — see R-MEL-209.
Watchpoint trigger: any reported lost device; any device with no sync > 14 days.
R-MEL-203 — Insider threat (chain operator role)
A chain operator role has access to multiple properties' financial and guest data. A malicious insider — disgruntled employee, social-engineered admin — can extract data, manipulate folios, or sabotage operations across the chain. The blast radius is larger than any external attack.
Mitigation stack:
- ABAC scoping — chain operators see only the properties on their attribute set; access scoped per property, never global.
- Audit log on every read of guest PII at the chain-operator level (not at the front-desk level, which would be too noisy); reviewed weekly.
- 4-eyes on financial overrides at the chain level (any override > a configured amount requires a second chain-operator's approval).
- Bulk-export rate limit — exporting > 100 guest records in a 24h window triggers a security review notification.
- Just-in-time elevation for sensitive actions (delete guest, refund > threshold) — the operator requests, a second operator approves, the elevation is time-bounded and audited.
Watchpoint trigger: chain-operator bulk export > threshold; any chain-operator action without 4-eyes for financial overrides; any abnormal access pattern (off-hours, from new geo).
3.6 Security risks — full table
| ID | Description | L | I | Score | Mitigation | Owner | Status | Watchpoint |
|---|---|---|---|---|---|---|---|---|
| R-MEL-201 | Credential phishing on staff | 4 | 4 | 16 | Passkeys default; device binding; suspicious-login telemetry; phishing drills | Security | open | New geo login w/o pair |
| R-MEL-202 | Lost device with cached PII | 3 | 4 | 12 | SQLCipher; device-bound key; remote revocation; auto-purge | Security | mitigated | Lost device report |
| R-MEL-203 | Insider threat (chain operator) | 2 | 5 | 10 | ABAC; audit on PII read; 4-eyes; export rate limit; JIT elevation | Security | open | Bulk export > threshold |
| R-MEL-204 | RFID card cloning | 3 | 3 | 9 | Vendor-recommended encryption; reservation-bound credentials; revoke on checkout | Security + Lock | mitigated | Duplicate-credential anomaly |
| R-MEL-205 | Mobile-key shared with non-guest | 3 | 3 | 9 | Time-bound credentials; per-stay key; vendor abuse signals; audit on extra-device pair | Security + Lock | open | Multiple device pair per credential |
| R-MEL-206 | Weak tenant-admin passwords | 3 | 4 | 12 | Password policy; HIBP check at set; passkey nudge; admin MFA mandatory | Security | mitigated | Admin without MFA |
| R-MEL-207 | SQL injection in custom queries | 1 | 5 | 5 | No raw SQL rule; parameterized queries; lint; pen-test | Platform Lead | mitigated | Lint failure or finding |
| R-MEL-208 | XSS in tenant-provided content blocks | 3 | 3 | 9 | Content blocks rendered through sanitizer; no arbitrary HTML; CSP; per-tenant CSP isolation in R2 | Frontend Lead | mitigated | CSP report-only violations |
| R-MEL-209 | CSRF on BFF without proper double-submit | 2 | 4 | 8 | SameSite=Lax cookies; double-submit token on state-changing routes; OWASP review | Platform Lead | mitigated | Missing token in route audit |
| R-MEL-210 | Supply-chain compromise of npm dependency | 3 | 4 | 12 | Renovate + audit; lockfile review; pinned versions; Snyk + Socket.dev scanning | Security | open | Critical advisory in deps |
| R-MEL-211 | KMS misconfiguration leaks key material | 2 | 5 | 10 | Per-tenant DEKs wrapped by KEK; CMEK option; least-privilege IAM; audit log | Security | mitigated | KMS access by non-allowlisted SA |
| R-MEL-212 | PAN/PCI data leaks into logs | 2 | 5 | 10 | No PAN in our perimeter; SAQ-A scope; log scrubber; redaction tests | Security + Payments | mitigated | Redaction test fail |
| R-MEL-213 | OAuth / OIDC misconfiguration on chain SSO | 2 | 4 | 8 | OIDC discovery doc; nonce + state validation; tested per IdP; chain-onboarding playbook | Security | open | New IdP without test pass |
| R-MEL-214 | Replay of webhook signatures | 2 | 3 | 6 | HMAC + timestamp + nonce window; replay corpus tested | Platform Lead | mitigated | Replay accepted in test |
| R-MEL-215 | Electron renderer breaks contextBridge boundary | 2 | 5 | 10 | Periodic Electron security audit; CSP in renderer; preload review | Desktop Lead | mitigated | New preload export w/o review |
3.7 Regulatory risks — narrative for the top 3
R-MEL-301 — Daily guest-registration mandate change in target jurisdictions
Afghanistan, Tajikistan, and Iran each have police-registration regimes for hotel guests with country-specific frequency, format, and enforcement. The format and the deadline can change with little notice. A platform that cannot respond within days becomes the reason a tenant gets fined or shut down.
Mitigation stack:
- Per-jurisdiction registration adapter as a port in
reservation-service; new format = new adapter, no schema change. - Manual fallback — the desktop app can always export the registration in the latest known format as PDF + CSV for paper submission.
- Per-jurisdiction toggle and field set in tenant configuration so we can enable/disable a field without a deploy.
- Quarterly regulatory review with a local counsel partner per jurisdiction.
Watchpoint trigger: any tenant report of registration rejection by authorities; any local-counsel quarterly report of changed mandate.
R-MEL-302 — Data residency requirements changing (especially Iran)
Data residency in Iran is the most variable surface in our regulatory landscape. A residency mandate that arrives mid-quarter could require us to host all Iranian-tenant data inside Iran or to route it through specific carriers, neither of which is feasible on GCP today.
Mitigation stack:
- Hexagonal architecture — every infrastructure dependency is behind a port, so a future port to a different cloud or to a co-located deployment is feasible without service rewrites.
- Per-tenant data classification — what is PII vs. operational vs. analytic — so a residency mandate can be enforced selectively.
- Iran exploratory deployment in R2 under sanctions-compliant boundary, with a clear escape hatch (do not onboard Iranian tenants if residency cannot be honored).
- CMEK option enabled per-tenant for sensitive data.
- Documented "Plan B" — a co-located Postgres + Cloud Run deployment topology kept in IaC even if not deployed.
Watchpoint trigger: any residency mandate published; any sanctions-list change affecting GCP availability in Iran.
R-MEL-303 — KYC mandates for foreign guests
Foreign guests trigger heavier KYC in many target jurisdictions: passport, visa, departure date, accompanying-persons declaration. The exact list is jurisdiction-specific and changes. The GDPR-class minimization principle pushes against retaining more PII than needed; the regulatory pressure pushes the other way.
Mitigation stack:
- Per-jurisdiction guest-document schema in
reservation-service; tenants on that jurisdiction collect the required set, others do not. - Document storage with short retention by default; per-jurisdiction retention overrides codified.
- Encrypted at rest with per-tenant DEK; access audited.
- Operator-facing affordance to mark a document as "verified" rather than retaining the image, where the jurisdiction allows.
Watchpoint trigger: any KYC mandate change; any tenant request to retain documents past default retention; any document-storage volume anomaly.
3.8 Regulatory risks — full table
| ID | Description | L | I | Score | Mitigation | Owner | Status | Watchpoint |
|---|---|---|---|---|---|---|---|---|
| R-MEL-301 | Daily guest-registration mandate change | 4 | 3 | 12 | Per-jurisdiction adapter; manual fallback; quarterly counsel | Compliance | open | Counsel reports change |
| R-MEL-302 | Data residency change (esp. Iran) | 3 | 5 | 15 | Hexagonal ports; per-tenant classification; Plan B IaC; CMEK | Platform + Compliance | open | Mandate published |
| R-MEL-303 | KYC mandate for foreign guests | 4 | 3 | 12 | Per-jurisdiction schema; short retention; encrypted at rest; verify-not-retain | Compliance | open | Mandate or storage anomaly |
| R-MEL-304 | Tax rate changes mid-month | 4 | 2 | 8 | Effective-dated rate table; per-jurisdiction; audit on change | Finance | mitigated | Rate change without audit |
| R-MEL-305 | Sanctions-list update affects payment availability | 3 | 4 | 12 | OFAC + UN + EU lists synced daily; tenant onboarding screen; payment-route fallback | Compliance + Payments | open | Tenant on updated list |
| R-MEL-306 | GDPR-style request handling delay | 2 | 4 | 8 | Per-service export + erase tooling; 30-day SLA; legal queue | Compliance | open | Request open > 25 days |
| R-MEL-307 | PCI scope creep if PAN slips into our perimeter | 2 | 5 | 10 | SAQ-A scope; PCI scanner in CI; log redaction; quarterly scope review | Security + Payments | mitigated | PAN in any service log |
| R-MEL-308 | Currency-control rules change for cross-border settlement | 3 | 3 | 9 | Per-jurisdiction settlement rules; FX provider compliance; audit log | Finance | open | Settlement bounce by bank |
| R-MEL-309 | Local hospitality-licensing change | 3 | 3 | 9 | Per-jurisdiction property metadata; counsel review per market entry | Compliance | open | License renewal failure |
| R-MEL-310 | Anti-trafficking flagging mandates | 3 | 3 | 9 | Risk-flagging field on guest aggregate; compliance contact per jurisdiction | Compliance | open | Mandate published |
| R-MEL-311 | Cross-border data transfer restrictions (EU expansion) | 3 | 4 | 12 | SCCs; in-EU region for EU tenants from R3; DPIA per high-risk feature | Compliance | deferred | EU pilot signed |
3.9 Market & business risks — narrative for the top 3
R-MEL-401 — Small target tenant size (financial fragility)
The modal target tenant is an 8–50 room independent property in a market where seasonal revenue swings of 50%+ are normal. A run of bad months and they cannot pay our subscription. We absorb the churn or we lose the relationship. Either way, the unit economics are tight.
Mitigation stack:
- Tiered pricing sized for SMB independents — no per-room USD model.
- Outcomes-aligned commercial terms in the chain segment — a percentage of incremental direct-booking revenue we capture.
- Local-currency billing (AFN, TJS, IRR via local rails) where banking allows, with FX-risk pricing.
- Pause-not-cancel option during off-season; data retained, app inactive, low monthly fee.
- Diversified geography in R1 to avoid single-market concentration.
Watchpoint trigger: monthly churn > 3%; > 30% of tenants on pause-not-cancel; concentration on a single market > 50%.
R-MEL-402 — Slow PMS migration culture in target markets
Hoteliers in the target market have run their property the same way for years. WhatsApp + Excel + a paper register works. Convincing them to change is a multi-month, high-touch sale. A go-to-market plan that assumes self-serve onboarding will run aground.
Mitigation stack:
- Assisted onboarding by a local field rep for every tenant in R1 and R2.
- In-language training material — printed quick-cards and short videos in Pashto, Dari, Tajik, Persian.
- "Run alongside" pilot mode — tenants run Ghasi alongside their existing process for 30 days; we earn the switch.
- Per-region rep model in R2; reseller channel in R3 (
white-label reseller program). - Local advocacy — first 5 tenants per market are referenceable case studies.
Watchpoint trigger: average onboarding time > 90 days; first-30-day usage rate < 60%; pilot abandonment > 20%.
R-MEL-403 — Inability to take cards in target markets due to sanctions
Cards are not the dominant rail in our markets, but they matter for foreign guests and for the subset of properties that serve diaspora. Stripe and PayPal availability fluctuates with sanctions postures. A platform that hard-codes Stripe as the only card processor breaks when a tenant's market becomes unavailable.
Mitigation stack:
- Pluggable
payment-gateway-servicewith adapters for Stripe, PayPal, MFS providers, cash-on-arrival, and bank transfer. - Per-tenant payment-method enable/disable with jurisdiction defaults.
- Cash-on-arrival as first-class — the dominant rail in our markets is treated as a primary, not a workaround.
- MFS expansion in R2 — M-PESA, EasyPaisa, AfghanPaisa, Pamir-Pay — to give every market at least one electronic rail.
- Reconciliation tooling for cash and bank-transfer flows.
Watchpoint trigger: any payment provider's regional suspension; any tenant with no working electronic rail.
3.10 Market & business risks — full table
| ID | Description | L | I | Score | Mitigation | Owner | Status | Watchpoint |
|---|---|---|---|---|---|---|---|---|
| R-MEL-401 | Small target tenant financial fragility | 4 | 3 | 12 | Tiered pricing; outcomes terms; local-currency billing; pause-not-cancel | Commercial | open | Monthly churn > 3% |
| R-MEL-402 | Slow PMS migration culture | 4 | 3 | 12 | Assisted onboarding; in-language training; "run alongside" pilot; reseller channel | Commercial | open | Onboarding > 90 days |
| R-MEL-403 | Card unavailability due to sanctions | 4 | 3 | 12 | Pluggable gateway; per-tenant enable; cash first-class; MFS expansion | Payments + Compliance | open | Provider regional suspension |
| R-MEL-404 | Competition from regional incumbents | 3 | 3 | 9 | Differentiation (meta+direct, offline, RTL, AI); reference customers; community | Commercial | open | Major incumbent regional play |
| R-MEL-405 | Currency volatility (AFN, IRR, TJS) | 4 | 3 | 12 | FX snapshot at confirm; multi-currency folio; daily FX feed; settlement-currency choice | Finance | mitigated | FX swing > 5%/day |
| R-MEL-406 | Informal-channel payments not auditable | 4 | 3 | 12 | Cash + bank-transfer reconciliation; documented "informal payment" capture path | Finance | open | Tenant complaint on reconciliation |
| R-MEL-407 | Tenant cohort revenue concentrated on 1 market | 3 | 4 | 12 | Geographic diversification target per release; market-mix dashboard | Commercial | open | Single market > 50% revenue |
| R-MEL-408 | Pricing perceived as expensive in local currency | 3 | 3 | 9 | Local-currency pricing; per-market adjustment; outcomes terms | Commercial | open | Local pricing complaint trend |
| R-MEL-409 | OTA push-back on direct-booking thesis | 2 | 3 | 6 | Tenant-owned listings; OTA channel manager R3 anyway; reputation focus | Commercial | open | OTA delisting threats |
| R-MEL-410 | Regulator perceives meta layer as OTA | 2 | 4 | 8 | Counsel review per market; clear tenant-of-record on every booking; legal positioning doc | Compliance | open | Regulator inquiry |
3.11 AI risks — narrative for the top 3
R-MEL-501 — Model drift on dynamic pricing causes revenue loss
Dynamic pricing is the highest-leverage AI capability we ship. It is also the one with the most asymmetric downside: a 5% price drift across a 50-tenant fleet, sustained over a quarter, is real money. Drift can come from changed demand patterns (a holiday calendar shift), changed competitor behavior, or upstream model changes from Vertex AI.
Mitigation stack:
- Suggestions, not auto-apply — every pricing change is HITL by default until tenant-level acceptance rate clears the readiness bar (≥ 60% over 30 days).
- Per-tenant baseline — pricing model is bounded by a tenant-configured rate band; suggestions outside the band are flagged.
- A/B and shadow-model evaluation — every model version runs in shadow alongside the live model for 14 days before promotion.
- Per-tenant revenue impact dashboard — RevPAR / GPAR delta vs. baseline, per model version.
- One-click rollback to the previous model version per tenant.
- Quarterly model accuracy eval per tenant cohort.
Watchpoint trigger: RevPAR delta < -5% across a tenant cohort over a week; acceptance rate below 40%; eval scores deviating > 1σ.
R-MEL-502 — Hallucinated guest message text
AI-drafted guest messages (confirmations, pre-arrival, late-checkout) reach the guest in their language. A hallucination — wrong room number, wrong date, wrong policy — damages trust at the worst moment. The risk is amplified by translation: the operator may not be able to verify a Pashto draft by reading it.
Mitigation stack:
- HITL by default for all guest-facing AI output — operator must explicitly send.
- Structured generation — the prompt is built with the verified facts (dates, room, name, amount); the model fills the prose around them.
- Per-template glossary — locale-specific glossary pinned to the prompt; brand voice consistency.
- Round-trip verification — translate-back-to-English check on long messages; mismatch flagged for operator review.
- Refusal on low confidence — model declines to draft if the inputs are inconsistent.
- Audit log of every draft, accepted or rejected, with provenance.
Watchpoint trigger: any guest complaint linked to AI-drafted text; any round-trip mismatch above a configured threshold; any operator override rate > 30%.
R-MEL-503 — AI cost runaway from prompt explosion
A bug, a feature change, or a misconfigured retry policy can multiply AI calls without warning. A 10× spike on Vertex AI is recoverable; a 100× spike for a week is not. The risk is amplified by edge inference fallback that hides degradation.
Mitigation stack:
- Per-tenant AI budget — soft-degrade at 80% (cheaper model, less context); hard-stop at 100% (HITL only) until top-up or new period.
- Per-feature quotas layered on top of tenant budget.
- Cache by prompt hash — identical prompts in a 1h window served from cache.
- AI cost dashboard with per-tenant, per-feature, per-model breakdown; alert on > 2× baseline.
- Default-off for net-new AI features; tenant opts in.
- Vertex AI batch APIs where latency tolerates.
Watchpoint trigger: daily cost > 2× 7-day moving average; budget hit > 80% before mid-period; per-prompt cost > model average.
3.12 AI risks — full table
| ID | Description | L | I | Score | Mitigation | Owner | Status | Watchpoint |
|---|---|---|---|---|---|---|---|---|
| R-MEL-501 | Pricing model drift causes revenue loss | 3 | 4 | 12 | HITL; rate band; shadow model; rollback; quarterly eval | AI Lead | open | RevPAR < -5% cohort |
| R-MEL-502 | Hallucinated guest message text | 3 | 4 | 12 | HITL default; structured gen; glossary; round-trip; refusal; audit | AI Lead | open | Guest complaint or override > 30% |
| R-MEL-503 | AI cost runaway from prompt explosion | 3 | 4 | 12 | Tenant budget; cache; dashboard; default-off; batch APIs | AI Lead + Finance | open | Cost > 2× baseline |
| R-MEL-504 | AI suggestion latency > tolerance | 3 | 3 | 9 | Min instances; warm-up; edge fallback; per-surface budget | AI Lead | mitigated | First-token > 1.5s p95 |
| R-MEL-505 | Embedding leak via vector index reconstruction | 2 | 4 | 8 | Per-tenant partitioning of pgvector; query filter mandatory; pen-test category | AI Lead + Security | mitigated | Pen-test finding |
| R-MEL-506 | Hosted-model provider deprecation | 3 | 3 | 9 | Multi-provider abstraction in ai-gateway; eval suite vs alternatives; deprecation calendar | AI Lead | open | Vendor deprecation announced |
| R-MEL-507 | Edge model fairness across locales (Pashto/Dari/Persian) | 4 | 3 | 12 | Per-locale eval suite; locale-specific fine-tunes; HITL stricter on weak locales | AI Lead | open | Locale eval > 1σ from EN |
| R-MEL-508 | Prompt injection from tenant content | 3 | 4 | 12 | Pre-call classifier; system prompt isolation; structured tools; allowlist | AI Lead + Security | mitigated | Injection detected in fuzz |
| R-MEL-509 | Bias in upsell or anomaly recommendations | 3 | 3 | 9 | Fairness eval suite; HITL; quarterly review; documented audit | AI Lead | open | Fairness metric drift |
| R-MEL-510 | Provenance lost on AI artifact (audit gap) | 2 | 4 | 8 | Domain refuses persistence without provenance; export includes provenance | AI Lead | mitigated | Persisted artifact w/o provenance |
| R-MEL-511 | Edge ONNX runtime breaking change on app upgrade | 2 | 3 | 6 | Pin runtime version; integration test on each release; staged rollout | Desktop Lead | mitigated | Runtime version mismatch |
3.13 Vendor risks — narrative for the top 3
R-MEL-601 — TTLock or Salto API breaking change
Lock vendors are mid-tier SaaS providers with their own release cadence. A breaking change to their API in production has happened to other PMS vendors and will happen to us. The blast radius is per-tenant per-vendor; the recovery is hours of unplanned engineering and a window of degraded operations.
Mitigation stack:
- Vendor adapter pattern — every vendor lives behind the
LockPortinterface; a breaking change touches one adapter, not the domain. - Adapter contract tests run nightly against vendor sandboxes.
- Vendor-version pinning where APIs offer it; explicit vendor-version metadata in every key event.
- Manual fallback — issue a mechanical key, capture the lock event for later sync, do not block check-in.
- Generic Wiegand adapter as a backstop for vendors that lose their cloud entirely.
Watchpoint trigger: vendor sandbox contract test fails; vendor advisory of breaking change; any key-issuance success rate drop > 5% per vendor.
R-MEL-602 — Stripe / PayPal regional restriction tightening
Payment providers periodically tighten rules in our markets. The risk is that a tenant who relied on Stripe one month cannot use it the next. The product impact is direct revenue loss for that tenant and a support escalation for us.
Mitigation stack:
- Pluggable payment adapters — Stripe and PayPal are two of many.
- Per-tenant payment-method config — disable a method per tenant without deploy.
- MFS coverage — at least one electronic rail per market.
- Cash-on-arrival as the universal fallback.
- Compliance alert subscription to provider terms changes; quarterly review.
Watchpoint trigger: provider TOS change in target market; provider account restriction notice; tenant payment-method failure spike.
R-MEL-603 — Vertex AI model deprecation
Vertex AI deprecates models on its own schedule. A deprecation that lands during a release window forces an unplanned migration of every prompt that targeted the deprecated model.
Mitigation stack:
- Single AI gateway — model identifier changes happen in one place.
- Eval suite per prompt — re-runs against the new model before swap.
- Multi-provider abstraction — fallback to a different provider where eval permits.
- Subscription to Vertex AI deprecation calendar; quarterly model-currency audit.
- Prompt registry with version history so old prompts can be re-evaluated against new models.
Watchpoint trigger: deprecation announced; eval suite drift on a target prompt > 10%.
3.14 Vendor risks — full table
| ID | Description | L | I | Score | Mitigation | Owner | Status | Watchpoint |
|---|---|---|---|---|---|---|---|---|
| R-MEL-601 | TTLock or Salto API breaking change | 3 | 4 | 12 | Adapter pattern; nightly contract tests; manual fallback | Lock Lead | open | Sandbox contract fail |
| R-MEL-602 | Stripe / PayPal regional restriction tighten | 3 | 4 | 12 | Pluggable adapters; per-tenant config; MFS; cash-on-arrival | Payments Lead | open | Provider TOS change |
| R-MEL-603 | Vertex AI model deprecation | 3 | 3 | 9 | AI gateway centralization; per-prompt eval; multi-provider; deprecation calendar | AI Lead | open | Deprecation announced |
| R-MEL-604 | GCP pricing change (egress, Cloud Run, Pub/Sub) | 3 | 3 | 9 | FinOps dashboard; quarterly cost review; reserve commits where stable | Finance + SRE | open | Pricing announcement |
| R-MEL-605 | OpenSearch licensing complications | 1 | 3 | 3 | Not adopted; meta search uses Postgres + indexes; OpenSearch deferred | Platform Lead | closed | If adopted, re-open |
| R-MEL-606 | ONNX runtime breaking change | 2 | 3 | 6 | Version pinning; release-test matrix; vendor changelog watch | Desktop Lead | mitigated | Major version bump |
| R-MEL-607 | Twilio (SMS) outage | 3 | 3 | 9 | Multi-provider notification adapter; queue + retry; in-app fallback | Notification Lead | open | Twilio incident |
| R-MEL-608 | WhatsApp Business platform policy change | 3 | 3 | 9 | Per-tenant template approval; manual fallback to SMS / email; abuse-rate monitoring | Notification Lead | open | Policy change |
| R-MEL-609 | Resend / SendGrid email outage | 3 | 2 | 6 | Multi-provider; outbox + retry; deliverability dashboard | Notification Lead | mitigated | Provider incident |
| R-MEL-610 | electron-builder release / signing infra outage | 2 | 3 | 6 | Local fallback signing; release pipeline self-host option; staged rollout | Desktop Lead | mitigated | Pipeline failure |
| R-MEL-611 | Vendor sub-processor change (under DPA) | 3 | 3 | 9 | DPA review per vendor; sub-processor change notice; legal queue | Compliance | open | Sub-processor change notice |
3.15 People & process risks — narrative for the top 2
R-MEL-701 — Hiring locale-fluent QA
Pashto/Dari/Persian/Tajik QA fluent in the operational domain is a thin pipeline. Without it, our acceptance tests for RTL content and AI translations regress silently.
Mitigation stack:
- Local QA contract pool in Kabul, Dushanbe, Mashhad, Herat for paid validation cycles.
- Per-locale acceptance gate in CI for translated content.
- Tenant beta program — pilot tenants are paid feedback partners.
- Internal locale champions — engineers and PMs with native fluency rotate review duty.
Watchpoint trigger: translation defect found in production; per-locale eval drift; QA pipeline > 1 week cycle.
R-MEL-702 — Documentation rot
The 17-doc-per-service standard generates a lot of paper. Without enforcement, the docs drift from code; new engineers stop trusting them; the moat erodes.
Mitigation stack:
SERVICE_READINESS.mdaudit at every release per service; documented gaps block.- Service readiness audit skill runs in CI per service touched.
- Quarterly platform doc audit — random sample reviewed.
- PR template requires confirming which doc(s) were updated.
- ADR creation policy — if a tradeoff is changed, an ADR is mandatory.
Watchpoint trigger: service readiness score drops > 10 points; ADR backlog > 3 months; PR with code change but no doc check.
3.16 People & process risks — full table
| ID | Description | L | I | Score | Mitigation | Owner | Status | Watchpoint |
|---|---|---|---|---|---|---|---|---|
| R-MEL-701 | Hiring locale-fluent QA | 4 | 3 | 12 | Local QA pool; per-locale CI gate; beta program; champions | People + QA | open | Translation defect in prod |
| R-MEL-702 | Documentation rot | 4 | 3 | 12 | SERVICE_READINESS audit; skill in CI; quarterly review; ADR policy | Platform Lead | open | Readiness drop > 10 |
| R-MEL-703 | Cross-cultural design (RTL nuances) | 3 | 3 | 9 | Locale champions; native review on UI changes; mirrored screenshots in CI | Frontend Lead | open | UX regression in RTL |
| R-MEL-704 | On-call burnout | 3 | 3 | 9 | Rotation across timezones; runbook quality; auto-remediation; weekly health pulse | SRE Lead | open | Page count > target/operator |
| R-MEL-705 | ADR drift (decision changed without ADR) | 3 | 3 | 9 | ADR policy in PR template; quarterly ADR review; spec-vs-implementation audit | Platform Lead | open | Code drifted from ADR |
| R-MEL-706 | Founder bus factor | 2 | 5 | 10 | Doc-heavy culture; pair-on-decision rule; succession plan; access shared | CTO | open | Single-owner critical path > 2 |
| R-MEL-707 | Hiring senior engineers in target geographies | 3 | 4 | 12 | Remote-friendly; local hubs; partnerships with universities; relocation support | People | open | Open senior role > 90 days |
3.17 Register summary
- Total risks identified: 67
- Escalate (≥ 15): 5 — R-MEL-001, R-MEL-002, R-MEL-003, R-MEL-201, R-MEL-302
- Mitigate (7–14): 50
- Monitor (≤ 6): 12
Distribution by category:
| Category | Count |
|---|---|
| Technical | 20 |
| Operational | 12 |
| Security | 15 |
| Regulatory | 11 |
| Market & business | 10 |
| AI | 11 |
| Vendor | 11 |
| People & process | 7 |
(Counts include narrative top-3 plus tables; some risks span categories — the canonical category is the one in the table heading.)
4. Tradeoffs Register
Tradeoffs are decisions where we deliberately accepted a downside in exchange for an upside. Each entry names the alternative, the upside we chose, the downside we accepted, the mitigation that bounds the downside, and the watchpoint that would force us to revisit. Cross-references to the relevant ADR or spec section are included.
TR-MEL-01 — Single shared schema with RLS for most domain data
- Alternative considered: Schema-per-tenant for every service.
- Decision: Shared schema +
tenant_idcolumn + Postgres RLS foriam,tenant,property,reservation,pricing,inventory,housekeeping,maintenance,staff,theme-config,notification,reporting,analytics,lock-integration,search-aggregation. Schema-per-tenant forbillingandpayment-gatewayonly. - Why we chose this: Operational simplicity (one migration set per service vs. one per tenant per service), lower cost (Postgres connection pooling sane, no per-tenant connection sprawl), simpler analytics (aggregations across tenants for the platform team without proxying through every tenant DB), and preserved isolation via RLS + application context middleware + CI tests.
- What we gave up: Maximal isolation. A bug that bypasses RLS exposes more than a bug that bypasses a schema boundary.
- Mitigation: Two-tenant CI test suite that runs on every PR; mandatory
RequestContextmiddleware;no raw SQLlint; quarterly pen-test; schema-per-tenant for the two services where the financial blast radius justified the operational cost. - Cross-reference: ADR-0002 multi-tenancy model; R-MEL-001.
- Watchpoint to revisit: any cross-tenant CI failure; any pen-test finding in tenant isolation; tenant > 5% of total platform load (consider schema-per-tenant promotion path); regulatory mandate for stronger data segregation in a target market.
TR-MEL-02 — Electron over Tauri for the desktop backoffice
- Alternative considered: Tauri (Rust + WebView), with a 30 MB bundle vs. Electron's 100 MB+ bundle and a smaller memory footprint.
- Decision: Electron. Locked, with substitution requiring an explicit ADR and unanimous architecture-team approval.
- Why we chose this: The lock vendors we must integrate (TTLock, Salto, Assa Abloy) ship Node bindings, not Rust crates.
better-sqlite3is a first-class Node ecosystem package;keytarfor OS keychain is Node-native; ONNX Runtime Node is mature;electron-builder+electron-updaterdeliver one-click signed installers across Windows/macOS/Linux that hotel IT can deploy without an extra toolchain. Hiring profile in our target geographies favors JS/TS over Rust by an order of magnitude. Bundle size of 100 MB+ is irrelevant for a staff-installed line-of-business app that is downloaded once and updated incrementally. - What we gave up: Bundle size, memory footprint, and the security advantages of a smaller native attack surface.
- Mitigation: Strict Electron security configuration (
contextIsolation: true,nodeIntegration: false, narrow typedwindow.melmastoonsurface via preload +contextBridge, CSP in renderer, periodic security audit); incremental auto-updates so the 100 MB ships only once. - Cross-reference: ADR-0003 Electron offline-first desktop; R-MEL-019, R-MEL-215.
- Watchpoint to revisit: Tauri 2.x maturity in 2 years specifically around Node-binding interop; any lock vendor that ships Rust-first; any sustained operator complaint about install size on metered networks.
TR-MEL-03 — GCP-only, multi-cloud avoidance
- Alternative considered: AWS-equivalent stack from day one or active multi-cloud (GCP + AWS).
- Decision: GCP-only. Cloud Run + Cloud SQL + Pub/Sub + Memorystore + Vertex AI + Cloud Storage + KMS + Secret Manager + Cloud Logging/Monitoring/Trace.
- Why we chose this: Faster delivery (one cloud's IAM, one set of IaC patterns), Vertex AI co-location matters for our AI-first thesis, and the cost competitiveness at our scale is real.
- What we gave up: Vendor-lock-in risk to GCP. A GCP pricing surprise, a GCP regional outage, or an Iran-availability change forces a port.
- Mitigation: Hexagonal architecture — every infrastructure dependency is behind a port. A future port to AWS or Azure exercises new adapters, not domain rewrites. Plan B IaC kept current for a co-located fallback (esp. for Iran residency). Quarterly FinOps review.
- Cross-reference: ADR-0001 §6; R-MEL-009, R-MEL-302, R-MEL-604.
- Watchpoint to revisit: GCP regional outage > 4h; pricing change > 20% on a hot service; sanctions blocking GCP availability in a target market; an enterprise tenant whose contract forbids GCP.
TR-MEL-04 — Single AI gateway with provider routing
- Alternative considered: Direct calls from each service to its preferred model provider.
- Decision: Single
ai-orchestrator-serviceas the only egress to Vertex AI or any external provider; ONNX Runtime on the desktop is the only edge inference allowed. - Why we chose this: Cost control (per-tenant budgets, per-feature quotas, prompt-hash caching live in one place), provenance (every AI artifact has
{ model, version, promptId, traceId, reviewedBy?, local }from the gateway), HITL governance (one place to enforce that irreversible AI actions go through a human), and vendor-portability (multi-provider routing without service-by-service refactor). - What we gave up: Latency overhead of a centralized hop, a central failure point, and the simplicity of "just call the SDK".
- Mitigation: Min instances ≥ 1 per region; warm-up endpoints; multi-region deployment from R2; per-feature circuit breakers; explicit "AI degraded" UX that hides AI affordances rather than fabricating output.
- Cross-reference: docs/08-ai-architecture.md; R-MEL-013, R-MEL-503, R-MEL-506.
- Watchpoint to revisit: sustained gateway latency p95 > 1.5 s; gateway availability < 99.9%; new feature where centralization is provably worse for cost or latency.
TR-MEL-05 — Single React Native consumer mobile for browse + post-booking
- Alternative considered: Two apps — a browse-and-book consumer app and a post-booking management app for guests during stay.
- Decision: One React Native consumer app with feature-flag gating per tenant for in-stay management.
- Why we chose this: Shared codebase, shared design tokens, shared auth, single store presence, lower acquisition cost.
- What we gave up: Bundle size and complexity in a single binary; per-tenant store-presence customization is harder; the in-stay surface lives at the mercy of the consumer-app release cadence.
- Mitigation: Feature flags per tenant; lazy-loaded modules per surface; in-stay UI behind a tab that does not affect cold-start.
- Cross-reference: docs/frontend/01-web-and-mobile-specification.md.
- Watchpoint to revisit: in-stay surface drives > 30% of app sessions and competes with browse for screen estate; tenant requests for white-label mobile presence (R3 reseller program may force a split).
TR-MEL-06 — PostgreSQL as default datastore
- Alternative considered: Polyglot persistence — Cassandra for write-heavy aggregates, ElasticSearch for search, dedicated vector DB.
- Decision: PostgreSQL on Cloud SQL for OLTP, with pgvector for embeddings and Postgres GIN/GIST indexes for search where feasible. Firestore for sync cursors only. BigQuery for analytical sink. Cloud Storage for blobs.
- Why we chose this: Operational simplicity; one set of backup, restore, IAM, RLS, migration patterns; team expertise; mature tooling; Cloud SQL HA managed.
- What we gave up: Pure write throughput at scale; some workloads (vector search at very large scale, full-text search at scale) may force later movement to specialized stores.
- Mitigation: Read replicas; per-aggregate index discipline documented in
DATA_MODEL.md; pgvector partitioned per tenant; OpenSearch deferred unless evidence forces it. - Cross-reference: docs/06-data-models.md; R-MEL-011.
- Watchpoint to revisit: vector index size > 50% of DB size on hot service; full-text query p95 > 300 ms after index tuning; tenant-cohort write throughput > 5k TPS sustained.
TR-MEL-07 — No GraphQL on BFFs (REST only)
- Alternative considered: GraphQL gateway (Apollo or similar) at the BFF layer.
- Decision: REST + BFF.
- Why we chose this: Tooling familiarity, simpler edge cache (HTTP cache headers do real work), smaller dependency surface, easier observability (per-route metrics), and the surfaces have small enough response shapes that GraphQL's flexibility does not pay for itself.
- What we gave up: Per-surface query flexibility; some BFF endpoints will be chatty for surfaces with deep relations.
- Mitigation: Per-surface BFF resolvers can be added without touching domain services; GraphQL is not banned for internal exploration if a future surface (e.g., advanced reporting) has a strong fit.
- Cross-reference: docs/05-api-design.md; ADR-0001 Alternatives table.
- Watchpoint to revisit: any BFF endpoint averaging > 5 round-trips per page-load over a quarter; reporting surface in R3 demands flexible aggregation queries.
TR-MEL-08 — No native iOS / Android backoffice in Phase 1
- Alternative considered: Native staff app on iOS and Android in parallel with the desktop.
- Decision: Electron desktop only in R1. React Native consumer app does not carry staff workflows. A React Native staff sub-mode is deferred to R3.
- Why we chose this: Cost. Two more codebases to ship, two more app-store cycles to maintain, two more security audits. The desktop covers the operational core; mobile is for the consumer.
- What we gave up: Field-ops mobility — a housekeeper updating room status from the room itself, a maintenance technician from the basement.
- Mitigation: The desktop UI is touch-friendly so a tablet works; the consumer app's offline cache covers the in-stay guest case; R3 plan includes a React Native staff sub-mode and a kiosk mode.
- Cross-reference: docs/frontend/01-web-and-mobile-specification.md; R-MEL-403 (deferred).
- Watchpoint to revisit: > 30% of housekeeping operators using the tablet form-factor over a quarter; R3 reseller channel demands a mobile staff app.
TR-MEL-09 — Single Electron desktop binary per tenant install
- Alternative considered: Multi-tenant binary with chain-operator switcher in R1.
- Decision: Single-tenant install in R1; chain multi-tenant switcher added in R2.
- Why we chose this: Simpler ops in R1 (one device = one tenant = one keychain entry = one sync cursor); the chain-operator persona is a small fraction of R1 tenants.
- What we gave up: Friction for chain operators in R1 — they install per-property.
- Mitigation: Documented per-property install playbook; chain-switcher is a R2 commitment.
- Cross-reference: docs/frontend/desktop/06-desktop-app-specification.md.
- Watchpoint to revisit: > 10% of R1 tenants are chain operators; chain operator pilots start before R2.
TR-MEL-10 — Custom tenant booking flow config (declarative)
- Alternative considered: Full WYSIWYG theme editor with arbitrary HTML/CSS per tenant.
- Decision: Declarative configuration — token model + layout presets + content blocks. No arbitrary HTML.
- Why we chose this: Cheaper to build; protects accessibility (presets are reviewed); protects performance (no tenant ships an unbounded asset); protects security (no tenant injects a script); covers the 90% of customization needs we observe in the target market.
- What we gave up: The 10% of customization needs that require arbitrary markup. Some tenants will ask for "but my website has X".
- Mitigation: Content-block library expands per quarter based on tenant requests; presets grow from 3 (R1) to 8+ (R2); R3 introduces an advanced "block authoring" surface for tenants who pass a vetting gate.
- Cross-reference: docs/frontend/02-theming-and-tenant-config.md; R-MEL-208.
- Watchpoint to revisit: > 20% of tenant onboarding requests blocked by missing block; competitive tenant lost on theming flexibility.
TR-MEL-11 — Cash-on-arrival as a first-class payment method
- Alternative considered: Cash-on-arrival as a workaround under "manual" or "offline" payment.
- Decision: Cash-on-arrival is a first-class method with a full reconciliation surface, drawer accounting, audit trail, and FX-aware folio handling.
- Why we chose this: It is the dominant rail in our beachhead markets. Treating it as a workaround would mean treating our majority customer as an exception.
- What we gave up: Accounting complexity in
billing-serviceandpayment-gateway-service; reconciliation features that competitors do not need to build. - Mitigation: Drawer-close enforced on logout; auto-close at midnight; EOD variance reporting; per-operator override audit.
- Cross-reference: docs/10-payments-architecture.md; R-MEL-101, R-MEL-406.
- Watchpoint to revisit: card share in target market crosses 50% of bookings; regulator mandates electronic-only payments.
TR-MEL-12 — Hexagonal architecture as a pre-paid escape hatch
- Alternative considered: Direct framework / cloud calls in services for speed.
- Decision: Hexagonal everywhere. Domain is framework-free. Every infra dependency is a port.
- Why we chose this: Cheap insurance for cloud port (R-MEL-302), vendor swap (R-MEL-601, R-MEL-603), and AI provider swap (R-MEL-506). The cost is real but bounded; the option value across a 5-year horizon is large.
- What we gave up: Some boilerplate; some code that "just works" with the framework's defaults must be plumbed through a port.
- Mitigation: Service template enforces structure; review checklist flags direct framework use in domain.
- Cross-reference: ADR-0001 §7; R-MEL-302, R-MEL-601, R-MEL-603.
- Watchpoint to revisit: team reports hexagonal overhead > 10% of feature time on a service; ADR proposing exception.
5. Mitigation Catalog
These are the named, reusable mitigation patterns referenced from the risk register above. Each pattern is a one-page protocol with an owner. The full pattern docs live in docs/standards/mitigations/; this catalog is the index.
MIT-01 — RLS test pattern
For every PR touching SQL, repositories, or BFF route handlers: a two-tenant fixture is loaded; an authenticated request as tenant A hits every endpoint reachable from the BFF; the response is asserted to contain zero rows belonging to tenant B. The test harness fails the build on any cross-tenant leak. Maintained by the Platform Lead.
Used by: R-MEL-001, R-MEL-008.
MIT-02 — Sync conflict UX pattern
When a per-aggregate merge produces a conflict that requires operator decision: the desktop app surfaces a tray notification with a side-by-side diff; the operator sees "your version", "server version", and a "merged proposal"; the operator picks; a pre-merge backup is retained for 7 days. Conflicts that the merge engine resolves automatically are logged but not surfaced. The conflict UI is unified across aggregates so operators learn it once.
Used by: R-MEL-003.
MIT-03 — AI HITL gate pattern
Every AI action in the irreversible or guest-facing class flows through a single HITL component: the AI suggestion is rendered alongside its provenance, the user sees a "Try this?" CTA (not "Apply"), accepting or rejecting both writes telemetry, accepted suggestions are persisted with provenance attached. Bypassing HITL requires an explicit per-tenant policy override and emits an audit event.
Used by: R-MEL-501, R-MEL-502, R-MEL-509.
MIT-04 — Encoder failure-fallback pattern
When the lock vendor adapter fails to issue a key during check-in: the desktop offers an immediate fallback path — issue a mechanical key and capture the lock event for later sync; the front-desk operator is shown the next-step protocol; the failure is queued for retry; an alert goes to the GM. Check-in is never blocked by encoder failure.
Used by: R-MEL-104, R-MEL-601.
MIT-05 — Override audit pattern
Every operator override (rate, folio adjustment, late checkout, manual key issuance) emits a <service>.override.applied.v1 event with a closed-list reason code, free-text justification, operator id, and traceparent. Aggregations roll up to the GM dashboard with per-operator and per-property trend lines. Overrides above a threshold require a 4-eyes approval.
Used by: R-MEL-102, R-MEL-203.
MIT-06 — Per-tenant AI budget pattern
Every tenant has a monthly AI budget per feature. At 80% the system soft-degrades (cheaper model, less context); at 100% it hard-stops (HITL only, no automated AI calls) until top-up or new period. Cost dashboard shows daily burn vs. budget per tenant per feature. The default-off setting on net-new AI features prevents quiet adoption.
Used by: R-MEL-503.
MIT-07 — Vendor adapter contract test pattern
Every adapter (lock, payment, AI provider, notification) has a contract test suite that runs nightly against the vendor's sandbox. Failures page the vendor owner. The contract test asserts every method we use; vendor-side breaking changes are detected before they hit production.
Used by: R-MEL-601, R-MEL-602, R-MEL-603, R-MEL-607, R-MEL-608.
MIT-08 — Offline grace-period warning pattern
The desktop app monitors the freshness of every cached capability that has a vendor-side time bound: token refresh window, TTLock dynamic-code window, key-credential horizon. At 75% and 90% of the configured window, an in-app banner warns the operator; at 90%, an email is sent to the GM. At 100%, the affected capability degrades gracefully (mechanical-key fallback, paper-folio capture, etc.) rather than silently failing.
Used by: R-MEL-103.
MIT-09 — Provenance enforcement pattern
The domain layer of every service that consumes AI output refuses to persist an AI artifact without a complete { model, version, promptId, traceId, reviewedBy?, reviewedAt?, local } provenance object. The TypeScript type system enforces this at compile time; the database constraint enforces it at write time. The export path always includes provenance.
Used by: R-MEL-510.
MIT-10 — Lost-device protocol
When a device is reported lost: the device key is revoked server-side; the next sync attempt from that device fails; the SQLCipher store is unreadable on a different machine because the OS keychain is bound to the original; remote-wipe is queued and executes on first reconnect; re-pair on a new device requires GM out-of-band approval. The protocol is documented in the operator playbook and the security runbook.
Used by: R-MEL-202.
MIT-11 — Two-tenant CI fixture
Every service ships a two-tenant fixture (tenant A and tenant B with overlapping data shapes) used by the test harness for isolation, sync, and event tests. Fixtures are seeded into the CI Postgres and Firestore emulators per test run.
Used by: R-MEL-001, MIT-01.
MIT-12 — DR drill cadence
Every quarter, an SRE-led DR drill restores Cloud SQL from PITR into a sibling project, replays Pub/Sub from a known cursor, validates outbox idempotency, and measures RTO + RPO. Results are recorded; deltas trigger remediation.
Used by: R-MEL-009, R-MEL-010, R-MEL-109.
MIT-13 — Per-jurisdiction adapter pattern
Regulatory surfaces (guest registration, KYC, tax) are exposed as ports. Each jurisdiction has an adapter. New mandate = new adapter; no schema change, no domain rewrite. Per-tenant config selects the adapter and the field set.
Used by: R-MEL-301, R-MEL-303.
MIT-14 — Phishing-resistant auth
WebAuthn / passkey is the default second factor; TOTP is fallback. Device binding adds a per-device cryptographic identity. Suspicious-login telemetry surfaces in the GM dashboard with one-click revoke.
Used by: R-MEL-201, R-MEL-206.
MIT-15 — Outbox + idempotency
Every state-changing API call carries an idempotency key derived from {device_id, local_aggregate_version, mutation_seq} (desktop) or {client_id, request_id} (web). The server-side outbox table dedupes; replays on retry are no-ops. Idempotency keys live in Postgres, not Redis (R-MEL-014).
Used by: R-MEL-002, R-MEL-014.
6. Risk Review Cadence & Governance
6.1 Cadence
| Cadence | Scope | Owner | Output |
|---|---|---|---|
| Per-PR | Service-level risks touched by the change | Service owner | PR description references affected risk IDs; CI runs targeted mitigations (RLS test, contract test) |
| Per-release pre-flight | Every risk with score ≥ 7 | Release Captain | Go / hold decision per release; mitigation status snapshot |
| Monthly | Platform-wide; every open risk | Platform Lead | Status update per row; new risks from incidents promoted into register |
| Quarterly | Full register + every tradeoff | CTO + Security Lead + Compliance | Re-justify each "accepted" item; close obsolete; create ADRs for changed decisions |
| Per-incident | Risks the incident materialized | Incident Commander | Postmortem identifies which risks fired; proposes new ones; updates mitigations |
6.2 Roles and ownership
- Platform Lead — owner of register; final say on risk scoring; chairs monthly review.
- Security Lead — owner of all security-category risks; chairs security pen-test cadence.
- AI Lead — owner of AI-category risks; chairs eval cadence and provider review.
- SRE Lead — owner of operational and infra risks; chairs DR drill cadence.
- Compliance Lead — owner of regulatory risks; chairs jurisdictional review with local counsel.
- Service owner — owner of service-level risks; reflects platform-level risks into
SERVICE_RISK_REGISTER.md. - Release Captain — rotating role; runs the per-release pre-flight against this register.
6.3 ADR creation policy
A tradeoff is changed (a TR-MEL-xx entry is reversed, narrowed, or replaced) only via an ADR. The ADR cites the previous tradeoff, the trigger, the new decision, the new mitigation, and the watchpoint that would force a future revisit. The risk register is updated in the same PR.
6.4 Postmortem feedback loop
Every postmortem template includes:
- Which
R-MEL-xxxrisks materialized? - Which mitigations were ineffective?
- What new risks does this incident reveal?
- Which
TR-MEL-xxtradeoffs are implicated?
These items are merged into this register in the same week as the postmortem. The register is a living document.
6.5 Why this matters
A risk register that is only used as theatre is worse than no register: it lulls the team into thinking the risks are managed. The cadence above makes the register real. Every entry is owned, dated, and re-examined. Tradeoffs are not "design intuitions" — they are written down with their alternatives, their rationales, and the watchpoints that would force us to reverse them. When someone asks "why did we do X instead of Y?", the answer is in this document.
7. Cross-References
- Architectural realization of every mitigation surface:
docs/02-enterprise-architecture.md - Security and tenancy contract underlying many mitigations:
docs/07-security-compliance-tenancy.md - Lock-integration deep spec backing R-MEL-104, R-MEL-204, R-MEL-205, R-MEL-601:
docs/09-lock-and-key-integration.md - Payments deep spec backing R-MEL-403, R-MEL-405, R-MEL-602, R-MEL-307:
docs/10-payments-architecture.md - AI architecture backing R-MEL-501..R-MEL-511:
docs/08-ai-architecture.md - Roadmap with release-by-release risk emphasis:
docs/roadmap/README.md - ADR index:
docs/architecture/
This document supersedes any prior risk discussion in service bundles; service-level SERVICE_RISK_REGISTER.md files inherit and extend the entries here, never contradict them.