tenant-service — SERVICE_RISK_REGISTER

Living document, reviewed monthly. Each risk has: ID, description, likelihood (L/M/H), impact (L/M/H), severity (L×I), mitigation, owner, status, last-review date.

1. Top Risks

R-001 — Cross-Tenant Data Leak via Missed RLS

Description: A new query path or migration omits WHERE tenant_id = ? and the RLS policy is bypassed (e.g. a BYPASSRLS admin path used outside its scope).
Likelihood: Low (defense in depth + CI gate)
Impact: High (breach; reputation; regulatory)
Severity: High
Mitigation: Two-tenant simulator on every PR; nightly canary; RLS-on by default; BYPASSRLS granted only to a separate Postgres role used by scheduled jobs; PR template asks "did you touch RLS or admin queries?"; security review required for any RLS change.
Owner: service tech lead + security on-call
Status: monitored
Last review: 2026-04-15

R-002 — Role Escalation via Bug in PolicyEngine

Description: A bug in PolicyEngine or RoleEscalationGuard allows a member to assign a role they do not hold.
Likelihood: Low
Impact: High (privilege escalation across the platform)
Severity: High
Mitigation: Property-based ABAC fuzz tests; > 200 hand-curated unit cases; AuthZ matrix snapshot diffed every PR; RBAC matrix in SECURITY_MODEL §3.1 is authoritative.
Owner: service tech lead
Status: monitored
Last review: 2026-04-15

R-003 — Last-Owner Removal Race Leaving Tenant Unmanaged

Description: Concurrent removes succeed and a tenant ends with zero owners.
Likelihood: Low
Impact: Medium (operational lock-out; recoverable via super-admin)
Severity: Medium
Mitigation: SERIALIZABLE isolation on remove path; OwnerProtectionService re-counts in-tx; integration test for concurrent removes.
Owner: service tech lead
Status: mitigated
Last review: 2026-04-10

R-004 — Tenant Cascade Delete Stalled by Unresponsive Downstream

Description: A downstream service fails to ack the close-tenant saga, leaving PII in place beyond the GDPR window.
Likelihood: Medium (we have ten downstream services to coordinate)
Impact: High (regulatory)
Severity: High
Mitigation: Saga timeout alerts at day 3 and day 6; per-service "force cascade" runbook; quarterly full-cascade rehearsal in staging; explicit SLA contract from each consuming service.
Owner: platform tech lead
Status: monitored
Last review: 2026-04-05

R-005 — PDP Outage Cascades to Platform-Wide Write Outage

Description: tenant-service outage takes the platform's PDP offline; downstream services fail closed and writes stop platform-wide.
Likelihood: Low
Impact: High (platform-wide)
Severity: High
Mitigation: Min 2 instances per region; multi-region; aggressive HPA; PDP-emergency runbook (cache widening + revision pinning); 99.99 % SLO with burn-rate alerts at 14× / 30 min.
Owner: service tech lead + platform tech lead
Status: monitored
Last review: 2026-04-15

R-006 — Invitation Token Compromise

Description: Invitation token leaked via email scraping, screenshot, or transport interception.
Likelihood: Medium (email is not end-to-end encrypted)
Impact: Medium (single-tenant compromise; mitigated by single-use + short TTL)
Severity: Medium
Mitigation: 256-bit entropy; SHA-256 hash storage; constant-time compare; per-IP per-invitationId rate limit; auto-revoke after 50 failed attempts; magic-link delivery via notification-service (provider with TLS; DKIM/SPF/DMARC enforced); operator can resend (does not extend TTL).
Owner: security on-call
Status: monitored
Last review: 2026-04-12

R-007 — `ltree` Path Corruption on Move

Description: A bug in MoveProperty saga leaves the org tree with an invalid path, breaking ancestor queries.
Likelihood: Low
Impact: Medium (some property queries return wrong tree slice; recoverable by re-derivation)
Severity: Medium
Mitigation: OrgTreeIntegrityService validates path/cycles in-tx; nightly integrity job recomputes paths and compares; saga test covers move + integrity assertion.
Owner: service tech lead
Status: mitigated
Last review: 2026-04-08

R-008 — Slug Hijack / Phishing Resemblance

Description: A new tenant slug closely resembles an existing one to phish guests via tenant booking subdomain.
Likelihood: Medium
Impact: Medium (brand confusion; phishing risk for guests of high-profile tenants)
Severity: Medium
Mitigation: Levenshtein similarity check at provision; manual review for similar slugs; reserved-slug list; trademark holders may pre-claim slugs via support ticket.
Owner: trust & safety lead
Status: monitored
Last review: 2026-04-15

R-009 — Subscription-Driven Auto-Suspend False Positive

Description: A delayed subscription.cancelled.v1 followed by a delayed …reactivated.v1 puts the tenant into a brief suspended state.
Likelihood: Low
Impact: Low (brief downtime per affected tenant)
Severity: Low
Mitigation: 14-day grace window; reactivation event cancels pending suspend job; operator can override via super-admin; alerting on suspension/reactivation rate.
Owner: billing tech lead
Status: mitigated
Last review: 2026-04-01

R-010 — Drift Between Platform Permission Registry and Per-Tenant Role Catalog

Description: New permissions added to the canonical registry are not seeded into existing tenants, causing inconsistent UX and functionality.
Likelihood: Medium
Impact: Low (cosmetic + missing functionality; not a security risk)
Severity: Low
Mitigation: Weekly RoleCatalogReconciler opens drift report; new-permission PR template asks for the seed migration; runbook for pnpm migrate:role-catalog --tenants all.
Owner: service tech lead
Status: monitored
Last review: 2026-04-10

R-011 — Memorystore Stale Reads During Cache TTL Window

Description: A tenant config update is published, but a downstream cache (Memorystore + service-local in-memory) serves stale data for up to TTL seconds, causing inconsistent quotes or check-in times.
Likelihood: Medium
Impact: Low (briefly inconsistent UX)
Severity: Low
Mitigation: tenant.config_updated.v1 triggers cache invalidation across regions; TTL kept low (60 s); snapshot included in event so downstream can refresh without an extra REST call.
Owner: platform tech lead
Status: mitigated
Last review: 2026-04-12

R-012 — Mass Invitation Abuse

Description: A compromised owner account or a malicious inviter sends thousands of invitations to harvest verification, spam, or phish.
Likelihood: Low
Impact: Medium (deliverability impact on the platform; reputation)
Severity: Medium
Mitigation: Rate limit (50/hour per tenant; 5/min per IP); AI invite-abuse classifier holds suspicious sends; per-domain anomaly detection; suspension flow for abusive tenants.
Owner: trust & safety lead
Status: monitored
Last review: 2026-04-15

R-013 — Cross-Region Replica Lag on Tenant Directory

Description: Tenant provisioning is regional; the global directory replica lags, causing the gateway to reject requests for a freshly-created tenant.
Likelihood: Low
Impact: Low (transient; resolves within seconds)
Severity: Low
Mitigation: Directory lag SLO (≤ 5 s); tenant provisioning UX explicitly says "your tenant will be live in a few seconds"; gateway has a 10-second retry-with-backoff for unknown-tenant on the immediately-following request.
Owner: platform tech lead
Status: monitored
Last review: 2026-04-12

R-014 — Schema Evolution Breaks Long-Running Sync Clients

Description: A breaking change to a sync aggregate breaks Electron clients running an older app version.
Likelihood: Medium
Impact: Medium (offline backoffice users blocked until update)
Severity: Medium
Mitigation: Additive-only changes within a major; major version bumps run side-by-side for ≥ 90 d; client User-Agent carries app version; gateway-side compat shim translates older shapes for one minor version.
Owner: desktop tech lead
Status: monitored
Last review: 2026-04-15

R-015 — AI Misclassification Holds Legitimate Invites

Description: Invite-abuse classifier false-positive holds legitimate hires, blocking onboarding.
Likelihood: Medium
Impact: Low (operator can override with one click)
Severity: Low
Mitigation: Always advisory; one-click override with reason; monthly false-positive review per TLD; auto-pause prompt if FPR > 10 %.
Owner: AI orchestrator tech lead
Status: monitored
Last review: 2026-04-12

2. Risk Review Cadence

Monthly: scan all open risks; update Last review; promote/demote severity as warranted.
After every P1 incident: add a new risk if root cause exposed a previously-unmanaged failure mode.
Annual: full re-baseline by tech lead + security on-call; archive mitigated risks with > 6 months of clean operation.

1. Top Risks​

R-001 — Cross-Tenant Data Leak via Missed RLS​

R-002 — Role Escalation via Bug in PolicyEngine​

R-003 — Last-Owner Removal Race Leaving Tenant Unmanaged​

R-004 — Tenant Cascade Delete Stalled by Unresponsive Downstream​

R-005 — PDP Outage Cascades to Platform-Wide Write Outage​

R-006 — Invitation Token Compromise​

R-007 — ltree Path Corruption on Move​

R-008 — Slug Hijack / Phishing Resemblance​

R-009 — Subscription-Driven Auto-Suspend False Positive​

R-010 — Drift Between Platform Permission Registry and Per-Tenant Role Catalog​

R-011 — Memorystore Stale Reads During Cache TTL Window​

R-012 — Mass Invitation Abuse​

R-013 — Cross-Region Replica Lag on Tenant Directory​

R-014 — Schema Evolution Breaks Long-Running Sync Clients​

R-015 — AI Misclassification Holds Legitimate Invites​

2. Risk Review Cadence​

1. Top Risks

R-001 — Cross-Tenant Data Leak via Missed RLS

R-002 — Role Escalation via Bug in PolicyEngine

R-003 — Last-Owner Removal Race Leaving Tenant Unmanaged

R-004 — Tenant Cascade Delete Stalled by Unresponsive Downstream

R-005 — PDP Outage Cascades to Platform-Wide Write Outage

R-006 — Invitation Token Compromise

R-007 — `ltree` Path Corruption on Move

R-008 — Slug Hijack / Phishing Resemblance

R-009 — Subscription-Driven Auto-Suspend False Positive

R-010 — Drift Between Platform Permission Registry and Per-Tenant Role Catalog

R-011 — Memorystore Stale Reads During Cache TTL Window

R-012 — Mass Invitation Abuse

R-013 — Cross-Region Replica Lag on Tenant Directory

R-014 — Schema Evolution Breaks Long-Running Sync Clients

R-015 — AI Misclassification Holds Legitimate Invites

2. Risk Review Cadence