tenant-service — SERVICE_RISK_REGISTER
Living document, reviewed monthly. Each risk has: ID, description, likelihood (L/M/H), impact (L/M/H), severity (L×I), mitigation, owner, status, last-review date.
1. Top Risks
R-001 — Cross-Tenant Data Leak via Missed RLS
- Description: A new query path or migration omits
WHERE tenant_id = ?and the RLS policy is bypassed (e.g. aBYPASSRLSadmin path used outside its scope). - Likelihood: Low (defense in depth + CI gate)
- Impact: High (breach; reputation; regulatory)
- Severity: High
- Mitigation: Two-tenant simulator on every PR; nightly canary; RLS-on by default;
BYPASSRLSgranted only to a separate Postgres role used by scheduled jobs; PR template asks "did you touch RLS or admin queries?"; security review required for any RLS change. - Owner: service tech lead + security on-call
- Status: monitored
- Last review: 2026-04-15
R-002 — Role Escalation via Bug in PolicyEngine
- Description: A bug in
PolicyEngineorRoleEscalationGuardallows a member to assign a role they do not hold. - Likelihood: Low
- Impact: High (privilege escalation across the platform)
- Severity: High
- Mitigation: Property-based ABAC fuzz tests; > 200 hand-curated unit cases; AuthZ matrix snapshot diffed every PR; RBAC matrix in SECURITY_MODEL §3.1 is authoritative.
- Owner: service tech lead
- Status: monitored
- Last review: 2026-04-15
R-003 — Last-Owner Removal Race Leaving Tenant Unmanaged
- Description: Concurrent removes succeed and a tenant ends with zero owners.
- Likelihood: Low
- Impact: Medium (operational lock-out; recoverable via super-admin)
- Severity: Medium
- Mitigation:
SERIALIZABLEisolation on remove path;OwnerProtectionServicere-counts in-tx; integration test for concurrent removes. - Owner: service tech lead
- Status: mitigated
- Last review: 2026-04-10
R-004 — Tenant Cascade Delete Stalled by Unresponsive Downstream
- Description: A downstream service fails to ack the close-tenant saga, leaving PII in place beyond the GDPR window.
- Likelihood: Medium (we have ten downstream services to coordinate)
- Impact: High (regulatory)
- Severity: High
- Mitigation: Saga timeout alerts at day 3 and day 6; per-service "force cascade" runbook; quarterly full-cascade rehearsal in staging; explicit SLA contract from each consuming service.
- Owner: platform tech lead
- Status: monitored
- Last review: 2026-04-05
R-005 — PDP Outage Cascades to Platform-Wide Write Outage
- Description:
tenant-serviceoutage takes the platform's PDP offline; downstream services fail closed and writes stop platform-wide. - Likelihood: Low
- Impact: High (platform-wide)
- Severity: High
- Mitigation: Min 2 instances per region; multi-region; aggressive HPA; PDP-emergency runbook (cache widening + revision pinning); 99.99 % SLO with burn-rate alerts at 14× / 30 min.
- Owner: service tech lead + platform tech lead
- Status: monitored
- Last review: 2026-04-15
R-006 — Invitation Token Compromise
- Description: Invitation token leaked via email scraping, screenshot, or transport interception.
- Likelihood: Medium (email is not end-to-end encrypted)
- Impact: Medium (single-tenant compromise; mitigated by single-use + short TTL)
- Severity: Medium
- Mitigation: 256-bit entropy; SHA-256 hash storage; constant-time compare; per-IP per-
invitationIdrate limit; auto-revoke after 50 failed attempts; magic-link delivery vianotification-service(provider with TLS; DKIM/SPF/DMARC enforced); operator can resend (does not extend TTL). - Owner: security on-call
- Status: monitored
- Last review: 2026-04-12
R-007 — ltree Path Corruption on Move
- Description: A bug in
MovePropertysaga leaves the org tree with an invalidpath, breaking ancestor queries. - Likelihood: Low
- Impact: Medium (some property queries return wrong tree slice; recoverable by re-derivation)
- Severity: Medium
- Mitigation:
OrgTreeIntegrityServicevalidates path/cycles in-tx; nightly integrity job recomputes paths and compares; saga test covers move + integrity assertion. - Owner: service tech lead
- Status: mitigated
- Last review: 2026-04-08
R-008 — Slug Hijack / Phishing Resemblance
- Description: A new tenant slug closely resembles an existing one to phish guests via tenant booking subdomain.
- Likelihood: Medium
- Impact: Medium (brand confusion; phishing risk for guests of high-profile tenants)
- Severity: Medium
- Mitigation: Levenshtein similarity check at provision; manual review for similar slugs; reserved-slug list; trademark holders may pre-claim slugs via support ticket.
- Owner: trust & safety lead
- Status: monitored
- Last review: 2026-04-15
R-009 — Subscription-Driven Auto-Suspend False Positive
- Description: A delayed
subscription.cancelled.v1followed by a delayed…reactivated.v1puts the tenant into a brief suspended state. - Likelihood: Low
- Impact: Low (brief downtime per affected tenant)
- Severity: Low
- Mitigation: 14-day grace window; reactivation event cancels pending suspend job; operator can override via super-admin; alerting on suspension/reactivation rate.
- Owner: billing tech lead
- Status: mitigated
- Last review: 2026-04-01
R-010 — Drift Between Platform Permission Registry and Per-Tenant Role Catalog
- Description: New permissions added to the canonical registry are not seeded into existing tenants, causing inconsistent UX and functionality.
- Likelihood: Medium
- Impact: Low (cosmetic + missing functionality; not a security risk)
- Severity: Low
- Mitigation: Weekly
RoleCatalogReconcileropens drift report; new-permission PR template asks for the seed migration; runbook forpnpm migrate:role-catalog --tenants all. - Owner: service tech lead
- Status: monitored
- Last review: 2026-04-10
R-011 — Memorystore Stale Reads During Cache TTL Window
- Description: A tenant config update is published, but a downstream cache (Memorystore + service-local in-memory) serves stale data for up to TTL seconds, causing inconsistent quotes or check-in times.
- Likelihood: Medium
- Impact: Low (briefly inconsistent UX)
- Severity: Low
- Mitigation:
tenant.config_updated.v1triggers cache invalidation across regions; TTL kept low (60 s); snapshot included in event so downstream can refresh without an extra REST call. - Owner: platform tech lead
- Status: mitigated
- Last review: 2026-04-12
R-012 — Mass Invitation Abuse
- Description: A compromised owner account or a malicious inviter sends thousands of invitations to harvest verification, spam, or phish.
- Likelihood: Low
- Impact: Medium (deliverability impact on the platform; reputation)
- Severity: Medium
- Mitigation: Rate limit (50/hour per tenant; 5/min per IP); AI invite-abuse classifier holds suspicious sends; per-domain anomaly detection; suspension flow for abusive tenants.
- Owner: trust & safety lead
- Status: monitored
- Last review: 2026-04-15
R-013 — Cross-Region Replica Lag on Tenant Directory
- Description: Tenant provisioning is regional; the global directory replica lags, causing the gateway to reject requests for a freshly-created tenant.
- Likelihood: Low
- Impact: Low (transient; resolves within seconds)
- Severity: Low
- Mitigation: Directory lag SLO (≤ 5 s); tenant provisioning UX explicitly says "your tenant will be live in a few seconds"; gateway has a 10-second retry-with-backoff for unknown-tenant on the immediately-following request.
- Owner: platform tech lead
- Status: monitored
- Last review: 2026-04-12
R-014 — Schema Evolution Breaks Long-Running Sync Clients
- Description: A breaking change to a sync aggregate breaks Electron clients running an older app version.
- Likelihood: Medium
- Impact: Medium (offline backoffice users blocked until update)
- Severity: Medium
- Mitigation: Additive-only changes within a major; major version bumps run side-by-side for ≥ 90 d; client
User-Agentcarries app version; gateway-side compat shim translates older shapes for one minor version. - Owner: desktop tech lead
- Status: monitored
- Last review: 2026-04-15
R-015 — AI Misclassification Holds Legitimate Invites
- Description: Invite-abuse classifier false-positive holds legitimate hires, blocking onboarding.
- Likelihood: Medium
- Impact: Low (operator can override with one click)
- Severity: Low
- Mitigation: Always advisory; one-click override with reason; monthly false-positive review per TLD; auto-pause prompt if FPR > 10 %.
- Owner: AI orchestrator tech lead
- Status: monitored
- Last review: 2026-04-12
2. Risk Review Cadence
- Monthly: scan all open risks; update
Last review; promote/demote severity as warranted. - After every P1 incident: add a new risk if root cause exposed a previously-unmanaged failure mode.
- Annual: full re-baseline by tech lead + security on-call; archive mitigated risks with > 6 months of clean operation.