tenant-service — SERVICE_READINESS
Cross-cutting readiness conventions live in SERVICE_TEMPLATE. This file records production-readiness criteria for
tenant-service.
tenant-service is a tier-1 dependency for every other service on the platform. Its readiness gates are stricter than for non-PDP services; nothing ships to prod without all canonical gates green.
1. Milestone Plan
| Milestone | Scope | Exit criteria |
|---|---|---|
| M0 — Skeleton | NestJS scaffold, domain primitives, OpenAPI scaffold, CI pipeline | /healthz + /readyz return 200 |
| M1 — Tenant + Config | Tenant and TenantConfig aggregates; provision, suspend, reactivate, close (no saga); REST + events; outbox/inbox | G1, G2, G3, G6 green |
| M2 — Memberships + Roles | Membership, Role, RoleAssignment, Invitation; system role seed; RoleEscalationGuard; OwnerProtectionService | G1-G6 green; two-tenant simulator green; ABAC fuzz green |
| M3 — Org Tree | OrganizationUnit with ltree; chain restructuring saga (MoveProperty); property-scope ABAC | G1-G6 green; saga test suite green |
| M4 — Operational Robustness | Cascade-delete saga; sync surface for desktop; feature flags; billing-contact; AI advisory hooks | G1-G8 green; load test passes; chaos drill green |
| M5 — GA | All canonical gates green; pen-test passed; runbooks complete; on-call trained | Production cutover signed off by platform team |
2. Canonical Gates (G1-G8)
The platform standards define eight canonical gates for every service. Below are the per-tenant-service evidence pointers.
G1 — Domain
- All aggregates from DOMAIN_MODEL implemented with invariants.
- State machines covered by unit tests (≥ 1 per illegal transition).
-
OwnerProtectionServiceandRoleEscalationGuardtested. - Property-based tests for
OrgUnit.move,Invitation.accept,RoleAssignmentscope narrowing.
G2 — API
- Every endpoint in API_CONTRACTS implemented.
- OpenAPI spec emitted, committed, and validated against contract tests.
- Pact provider tests pass for
bff-backoffice-service,bff-tenant-booking-service,notification-service,billing-service. - Problem+JSON conformance for every error code in API_CONTRACTS §13.
G3 — Events
- Every published event in EVENT_SCHEMAS has a JSON Schema, fixture, and contract test.
- Outbox writes in same tx as domain mutation (verified by
outbox.spec.ts). - Inbox dedup verified (
inbox.spec.ts). - Subscriber for
iam.user.registered.v1,iam.user.deleted.v1,billing.subscription.cancelled.v1,billing.subscription.reactivated.v1implemented and idempotent. - DLQ topic configured per source topic; alerts routed.
G4 — Sync
- Pull surface for
tenant_config,org_units,my_membership,role_catalog,my_role_assignments,feature_flagsworks against the desktop fixture. - Push surface accepts the limited TenantConfig PATCH and FeatureFlag toggle; rejects role/membership writes with
MELMASTOON.SYNC.ONLINE_REQUIRED. - Cursor-too-old returns
410with full-resync hint. - Conflict policy
lww+difffor feature flags audited.
G5 — AI
-
AIClientadapter wired toai-orchestrator-service. - Invite-abuse classifier integrated as advisory.
- Bulk-removal anomaly detector wired.
- Provenance recorded for every AI-influenced action.
- Feature-flag
aiEnabledcorrectly disables every surface.
G6 — Observability
- Metrics from OBSERVABILITY §2 emitted and visible in Cloud Monitoring.
- All six dashboards built and shared with the team.
- All P1/P2 alerts wired with PagerDuty routing.
- Synthetic monitor + two-tenant canary running every 60 s / nightly.
- Runbooks committed in
services/tenant-service/runbooks/for every alert.
G7 — Performance
- k6 scenarios from TESTING_STRATEGY §8 green at SLO budgets.
-
authz_checkp95 ≤ 15 ms (cache hit), ≤ 50 ms (miss) at 5 000 rps. -
tenant_config_readp95 ≤ 25 ms at 2 000 rps. - Outbox lag p95 ≤ 2 s at 1 000 events/s.
- No memory leak across a 4-hour soak test.
G8 — Security
- Two-tenant simulator green on every PR + nightly canary.
- ABAC fuzz green.
- OWASP ASVS L2 self-assessment passed.
- External pen-test report received and findings addressed (or accepted with explicit risk).
- CMEK enforced on Cloud SQL; secrets in Secret Manager only.
-
pgauditenabled on sensitive tables. - Audit-event sink to BigQuery verified.
3. SLOs (recap)
| SLI | Target |
|---|---|
tenant.config availability | 99.95 % |
authz.check availability | 99.99 % |
authz.check p95 (cache hit) | ≤ 15 ms |
tenant.config p95 (cache hit) | ≤ 25 ms |
tenant.config PATCH p95 | ≤ 200 ms |
| Outbox dispatch lag p95 | ≤ 2 s |
| Two-tenant isolation | 100 % (zero cross-tenant reads) |
4. Definition of Done (per PR)
- All checks in TESTING_STRATEGY §11 green.
- OpenAPI spec regenerated and committed if the API changed.
- Event JSON Schemas updated and fixtures regenerated if events changed.
- Migration applied locally and rolled back cleanly if it has a
downscript. - Two-tenant simulator green.
- At least one new test added if behavior changed.
- Runbook updated if operational behavior changed.
- Doc updated (this bundle) if any contract changed.
5. Release Checklist (per Cloud Run revision)
- CI is green on the merge commit.
- Image signed via Binary Authorization, scanned by Trivy with no high-severity findings.
- Migration scripts applied in staging without errors; tested rollback path documented.
- Smoke tests against staging green.
- Synthetic monitor on staging green for ≥ 15 min.
- Canary at 5 % production traffic for ≥ 30 min with no SLO regression.
- Canary at 50 % for ≥ 30 min.
- Full rollout to 100 %.
- Release tagged in Git; release notes posted to platform channel.
- On-call notified.
6. Production Cutover Sign-off (M5 → GA)
Required signatures (recorded in releases/v1.0.md):
- Service tech lead
- Platform tech lead
- Security on-call
- Compliance officer
- Customer-success lead (for first chain customers)
Without all five, GA is not declared.
7. Post-GA Operating Rhythm
- Monthly SLO review.
- Quarterly chaos drill (Cloud SQL failover, region failover, outbox burst).
- Quarterly security review of role/permission registry.
- Quarterly bias review of AI invite-classifier outputs.
- Annual external pen-test.