Skip to main content

tenant-service — SERVICE_RISK_REGISTER

Living document, reviewed monthly. Each risk has: ID, description, likelihood (L/M/H), impact (L/M/H), severity (L×I), mitigation, owner, status, last-review date.


1. Top Risks

R-001 — Cross-Tenant Data Leak via Missed RLS

  • Description: A new query path or migration omits WHERE tenant_id = ? and the RLS policy is bypassed (e.g. a BYPASSRLS admin path used outside its scope).
  • Likelihood: Low (defense in depth + CI gate)
  • Impact: High (breach; reputation; regulatory)
  • Severity: High
  • Mitigation: Two-tenant simulator on every PR; nightly canary; RLS-on by default; BYPASSRLS granted only to a separate Postgres role used by scheduled jobs; PR template asks "did you touch RLS or admin queries?"; security review required for any RLS change.
  • Owner: service tech lead + security on-call
  • Status: monitored
  • Last review: 2026-04-15

R-002 — Role Escalation via Bug in PolicyEngine

  • Description: A bug in PolicyEngine or RoleEscalationGuard allows a member to assign a role they do not hold.
  • Likelihood: Low
  • Impact: High (privilege escalation across the platform)
  • Severity: High
  • Mitigation: Property-based ABAC fuzz tests; > 200 hand-curated unit cases; AuthZ matrix snapshot diffed every PR; RBAC matrix in SECURITY_MODEL §3.1 is authoritative.
  • Owner: service tech lead
  • Status: monitored
  • Last review: 2026-04-15

R-003 — Last-Owner Removal Race Leaving Tenant Unmanaged

  • Description: Concurrent removes succeed and a tenant ends with zero owners.
  • Likelihood: Low
  • Impact: Medium (operational lock-out; recoverable via super-admin)
  • Severity: Medium
  • Mitigation: SERIALIZABLE isolation on remove path; OwnerProtectionService re-counts in-tx; integration test for concurrent removes.
  • Owner: service tech lead
  • Status: mitigated
  • Last review: 2026-04-10

R-004 — Tenant Cascade Delete Stalled by Unresponsive Downstream

  • Description: A downstream service fails to ack the close-tenant saga, leaving PII in place beyond the GDPR window.
  • Likelihood: Medium (we have ten downstream services to coordinate)
  • Impact: High (regulatory)
  • Severity: High
  • Mitigation: Saga timeout alerts at day 3 and day 6; per-service "force cascade" runbook; quarterly full-cascade rehearsal in staging; explicit SLA contract from each consuming service.
  • Owner: platform tech lead
  • Status: monitored
  • Last review: 2026-04-05

R-005 — PDP Outage Cascades to Platform-Wide Write Outage

  • Description: tenant-service outage takes the platform's PDP offline; downstream services fail closed and writes stop platform-wide.
  • Likelihood: Low
  • Impact: High (platform-wide)
  • Severity: High
  • Mitigation: Min 2 instances per region; multi-region; aggressive HPA; PDP-emergency runbook (cache widening + revision pinning); 99.99 % SLO with burn-rate alerts at 14× / 30 min.
  • Owner: service tech lead + platform tech lead
  • Status: monitored
  • Last review: 2026-04-15

R-006 — Invitation Token Compromise

  • Description: Invitation token leaked via email scraping, screenshot, or transport interception.
  • Likelihood: Medium (email is not end-to-end encrypted)
  • Impact: Medium (single-tenant compromise; mitigated by single-use + short TTL)
  • Severity: Medium
  • Mitigation: 256-bit entropy; SHA-256 hash storage; constant-time compare; per-IP per-invitationId rate limit; auto-revoke after 50 failed attempts; magic-link delivery via notification-service (provider with TLS; DKIM/SPF/DMARC enforced); operator can resend (does not extend TTL).
  • Owner: security on-call
  • Status: monitored
  • Last review: 2026-04-12

R-007 — ltree Path Corruption on Move

  • Description: A bug in MoveProperty saga leaves the org tree with an invalid path, breaking ancestor queries.
  • Likelihood: Low
  • Impact: Medium (some property queries return wrong tree slice; recoverable by re-derivation)
  • Severity: Medium
  • Mitigation: OrgTreeIntegrityService validates path/cycles in-tx; nightly integrity job recomputes paths and compares; saga test covers move + integrity assertion.
  • Owner: service tech lead
  • Status: mitigated
  • Last review: 2026-04-08

R-008 — Slug Hijack / Phishing Resemblance

  • Description: A new tenant slug closely resembles an existing one to phish guests via tenant booking subdomain.
  • Likelihood: Medium
  • Impact: Medium (brand confusion; phishing risk for guests of high-profile tenants)
  • Severity: Medium
  • Mitigation: Levenshtein similarity check at provision; manual review for similar slugs; reserved-slug list; trademark holders may pre-claim slugs via support ticket.
  • Owner: trust & safety lead
  • Status: monitored
  • Last review: 2026-04-15

R-009 — Subscription-Driven Auto-Suspend False Positive

  • Description: A delayed subscription.cancelled.v1 followed by a delayed …reactivated.v1 puts the tenant into a brief suspended state.
  • Likelihood: Low
  • Impact: Low (brief downtime per affected tenant)
  • Severity: Low
  • Mitigation: 14-day grace window; reactivation event cancels pending suspend job; operator can override via super-admin; alerting on suspension/reactivation rate.
  • Owner: billing tech lead
  • Status: mitigated
  • Last review: 2026-04-01

R-010 — Drift Between Platform Permission Registry and Per-Tenant Role Catalog

  • Description: New permissions added to the canonical registry are not seeded into existing tenants, causing inconsistent UX and functionality.
  • Likelihood: Medium
  • Impact: Low (cosmetic + missing functionality; not a security risk)
  • Severity: Low
  • Mitigation: Weekly RoleCatalogReconciler opens drift report; new-permission PR template asks for the seed migration; runbook for pnpm migrate:role-catalog --tenants all.
  • Owner: service tech lead
  • Status: monitored
  • Last review: 2026-04-10

R-011 — Memorystore Stale Reads During Cache TTL Window

  • Description: A tenant config update is published, but a downstream cache (Memorystore + service-local in-memory) serves stale data for up to TTL seconds, causing inconsistent quotes or check-in times.
  • Likelihood: Medium
  • Impact: Low (briefly inconsistent UX)
  • Severity: Low
  • Mitigation: tenant.config_updated.v1 triggers cache invalidation across regions; TTL kept low (60 s); snapshot included in event so downstream can refresh without an extra REST call.
  • Owner: platform tech lead
  • Status: mitigated
  • Last review: 2026-04-12

R-012 — Mass Invitation Abuse

  • Description: A compromised owner account or a malicious inviter sends thousands of invitations to harvest verification, spam, or phish.
  • Likelihood: Low
  • Impact: Medium (deliverability impact on the platform; reputation)
  • Severity: Medium
  • Mitigation: Rate limit (50/hour per tenant; 5/min per IP); AI invite-abuse classifier holds suspicious sends; per-domain anomaly detection; suspension flow for abusive tenants.
  • Owner: trust & safety lead
  • Status: monitored
  • Last review: 2026-04-15

R-013 — Cross-Region Replica Lag on Tenant Directory

  • Description: Tenant provisioning is regional; the global directory replica lags, causing the gateway to reject requests for a freshly-created tenant.
  • Likelihood: Low
  • Impact: Low (transient; resolves within seconds)
  • Severity: Low
  • Mitigation: Directory lag SLO (≤ 5 s); tenant provisioning UX explicitly says "your tenant will be live in a few seconds"; gateway has a 10-second retry-with-backoff for unknown-tenant on the immediately-following request.
  • Owner: platform tech lead
  • Status: monitored
  • Last review: 2026-04-12

R-014 — Schema Evolution Breaks Long-Running Sync Clients

  • Description: A breaking change to a sync aggregate breaks Electron clients running an older app version.
  • Likelihood: Medium
  • Impact: Medium (offline backoffice users blocked until update)
  • Severity: Medium
  • Mitigation: Additive-only changes within a major; major version bumps run side-by-side for ≥ 90 d; client User-Agent carries app version; gateway-side compat shim translates older shapes for one minor version.
  • Owner: desktop tech lead
  • Status: monitored
  • Last review: 2026-04-15

R-015 — AI Misclassification Holds Legitimate Invites

  • Description: Invite-abuse classifier false-positive holds legitimate hires, blocking onboarding.
  • Likelihood: Medium
  • Impact: Low (operator can override with one click)
  • Severity: Low
  • Mitigation: Always advisory; one-click override with reason; monthly false-positive review per TLD; auto-pause prompt if FPR > 10 %.
  • Owner: AI orchestrator tech lead
  • Status: monitored
  • Last review: 2026-04-12

2. Risk Review Cadence

  • Monthly: scan all open risks; update Last review; promote/demote severity as warranted.
  • After every P1 incident: add a new risk if root cause exposed a previously-unmanaged failure mode.
  • Annual: full re-baseline by tech lead + security on-call; archive mitigated risks with > 6 months of clean operation.