Skip to main content

Service Readiness Gates — Ghasi Melmastoon

Companion: Roadmap Index · team-capacity-model.md · jira-epic-based-sprint-ruleset.md · docs/standards/DEFINITION_OF_DONE.md · docs/standards/REQUIREMENTS_GUARD_RAILS.md

A service readiness gate is the bar a service must clear before downstream consumers (other services, BFFs, frontends) are allowed to depend on it for production traffic. Gates exist so that an upstream service's gaps cannot silently become a downstream surface's bug.

This document defines the five readiness phases, the exit criteria for each, the gate review process, and the dependency direction rules a sprint planner must respect.

The gates extend (and never replace) the per-story Definition of Done. DoD is per-ticket; readiness gates are per-service-version.


1. The five phases

PhaseLabel (Jira fixVersion / sub-label)MeaningWho can depend on this service?
P0 — Stubservice-readiness:p0-stubService exists in the monorepo as scaffolded NestJS app with /health/live returning 200. No business logic.Nobody. Local-dev only.
P1 — Contract-onlyservice-readiness:p1-contractOpenAPI + AsyncAPI specs published in pkg-contracts; mock server in bff-* and integration tests can use generated client. No real persistence.Other teams may code against this contract; may not run integration in shared envs.
P2 — Functional alphaservice-readiness:p2-alphaCore happy-path implemented with persistence, outbox, RLS, basic tests, deployed to dev GCP project.Other services may integrate in dev. Not for staging or prod.
P3 — Production-ready betaservice-readiness:p3-betaAll §3 P3 criteria met; deployed to staging; under SLO observation.Other services may use in staging. One internal pilot tenant in prod allowed with explicit risk waiver.
P4 — General availabilityservice-readiness:p4-gaAll §3 P4 criteria met; on-call rotation active; runbooks complete; SLOs steady ≥ 14 days in prod.All services and surfaces may depend on it in prod.

A regression in any criterion demotes the service one phase until restored.


2. Visual: dependency direction rules

P4 GA ──── may depend on ────► P4 GA (✅ allowed)
P4 GA ──── may depend on ────► P3 (⚠️ pilot tenant only, with waiver)
P3 ──── may depend on ────► P3+ (✅ allowed in staging only)
P3 ──── may depend on ────► P2 (❌ blocked; lift upstream first)
P2 ──── may depend on ────► P2+ (✅ allowed in dev only)
P2 ──── may depend on ────► P1 (❌ blocked; lift upstream first)
P1 ──── may depend on ────► P1+ (✅ contract-only integration)
P0 ──── may depend on ────► anything (local-dev only)

The sprint planner (jira-sprint-distribution-rules.md Rule E) refuses to commit a story whose upstreams are at a lower readiness than the story requires.


3. Phase exit criteria

Each phase has hard exits. AI agents (epic-spec-implementation-audit, ghasi-service-readiness-audit) verify these criteria before raising the readiness label.

3.1 P0 → P1 (Stub → Contract-only)

AreaCriterion
RepoService scaffolded via commands/scaffold-service.md.
ContractsOpenAPI 3.1 spec for HTTP routes published in pkg-contracts.
EventsAsyncAPI 2.6 spec for produced/consumed events published in pkg-contracts.
Generated clientspkg-contracts exports typed TS client; CI publishes preview.
Mock serverpkg-contracts ships a mock server consumable by BFFs in tests.
Health/health/live and /health/ready return 200 (ready returns 503 until P2).
Docsservices/<name>/README.md exists with the bundle index.

3.2 P1 → P2 (Contract-only → Functional alpha)

AreaCriterion
DomainAt least 1 happy-path use case fully implemented (per spec story).
PersistenceDrizzle schema + migrations applied; RLS enabled on every tenant-scoped table.
Outbox/InboxOutbox writes recorded; outbox dispatcher operational; inbox dedup with idempotency keys.
APIAll P1 contract routes implemented; rejects with the standard error envelope.
TestsUnit ≥ 70 %; integration with Testcontainers; tenant-isolation forever-passing test green.
DeployCloud Run service deployed to dev GCP project via Turbo + Cloud Build pipeline.
ObservabilityStructured logs (Pino) + traces (OTel) emitting; service-level dashboard stub exists.
SecurityThreat model linked in services/<name>/security.md; secrets via Secret Manager.
IdempotencyAll write routes accept Idempotency-Key; duplicate requests reproduce response.

3.3 P2 → P3 (Functional alpha → Production-ready beta)

AreaCriterion
FunctionalityAll R stories merged and Done; no open S1/S2 bugs.
CoverageUnit ≥ 80 %, integration ≥ 70 %, contract tests passing in CI per Testing Standards §3.3.
SLOSLO defined (latency p95, error rate, availability); breach alerts wired to Cloud Monitoring → Slack/PagerDuty.
PerformanceLoad test executed (k6) at 2× expected R peak; no SLO breach.
ResilienceChaos test: kill one Pub/Sub subscriber, kill one Cloud Run instance — system recovers.
MigrationsForward + rollback rehearsed; no destructive operations without explicit approval.
BackupsCloud SQL automated backups enabled with PITR; restore-rehearsal documented.
Multi-tenantRLS regression suite green; cross-tenant exfiltration test green.
AuditAll write actions emit audit-service events; PII redacted in logs.
DocsRunbook (services/<name>/runbook.md) covers incident triage, common alerts, rollback.
DeployDeployed to staging with at least 14 days of soak + 0 P1 incidents.
AI/HITLIf service has AI affordances: HITL surface present, refusal/abstention paths implemented (see frontend/04-frontend-design-guidelines.md and ai-orchestrator-service spec).

3.4 P3 → P4 (Production-ready beta → GA)

AreaCriterion
SLO history≥ 14 days continuous SLO compliance in prod under real load.
On-callService is on the on-call rotation; ≥ 2 engineers can respond to a P1 page.
RunbooksRunbook validated by an on-call drill; incident-rehearsal report attached.
CostFinOps dashboard shows cost-per-tenant within budget; no runaway anomalies.
ComplianceIf touching PII / payments / secrets: data-classification map signed off; KMS rotation policy active.
LocalizationAll user-facing strings flow through ICU; en/ps/dr/ar pseudo-locale tested.
AccessibilityFrontends consuming this service pass axe-core + manual a11y check at the affected screens.
Docsdocs/services/<name>/ bundle complete; docs/13-traceability-matrix.md rows all green.
Two-team reviewOne reviewer from a non-owning squad signs off the readiness ticket.

3.5 Maintenance criteria (P4 ongoing)

AreaCriterion
SLO error budgetBurn-rate alerts wired; 28-day budget breach triggers a freeze on new features for the service.
Dependency hygieneRenovate/Dependabot PRs merged ≤ 14 days old.
SecurityAll npm audit and Snyk Critical/High issues triaged ≤ 7 days.
DriftNo undocumented infra drift (Terraform plan = empty diff).
Doc freshnessSpec ↔ implementation audit clean every quarter (epic-spec-implementation-audit).

A failed maintenance criterion that persists ≥ 14 days demotes the service to P3.


4. Gate review process

A readiness phase change is a Jira sub-task (Type=Task, label gate-review) on the service's component, with the new phase in the title (e.g., [gate] reservation-service P2 → P3).

4.1 Inputs

  • Latest ghasi-service-readiness-audit report for the service.
  • Latest epic-spec-implementation-audit report covering the service's epics.
  • SLO snapshot (last 14/28 days).
  • Open S1/S2 bug list filtered by component.
  • Cost-per-tenant trend.
  • On-call drill report (only for P3 → P4).

4.2 Reviewers

Phase changeRequired reviewers
P0 → P1Owning EM + 1 SRE
P1 → P2Owning EM + 1 SRE + 1 SecEng (if PII / payments / secrets)
P2 → P3Owning EM + Platform & SRE EM + Release Captain + 1 SecEng (if applicable)
P3 → P4Owning EM + Platform & SRE EM + 1 PM + 1 Designer (if user-facing) + Head of Engineering sign-off

4.3 Outputs

  • Updated service-readiness:pX label on the Jira component (we model labels per service via Jira automation).
  • Diff in docs/13-traceability-matrix.md cells for that service.
  • Comment posted to the parent epic(s) with the new phase.

4.4 Cadence

  • Gate reviews run on demand (when criteria are believed met) and at wave checkpoints (end of each wave).
  • A service that has not advanced for 4+ sprints triggers an involuntary gate review to either advance, demote, or document the blocking dependency.

5. Per-service initial gates (R1 starting state)

The R1 wave plan starts with this readiness map. Updated at each wave checkpoint.

ServiceStart of R1Target by R1-S06Target by R1 end (R1-S12)
reservation-serviceP0P2P3
inventory-serviceP0P2P3
pricing-serviceP0P1P2
payment-gateway-serviceP0P2P3
lock-integration-serviceP0P1P2
audit-serviceP0P2P3
housekeeping-serviceP0P1P2
notification-serviceP0P2P3
file-storage-serviceP0P1P2
staff-serviceP0P2P3
tax-serviceP0P1P2
feature-flag-serviceP0P2P3
theme-config-serviceP0P1P2
bff-consumer-serviceP0P1P2
bff-tenant-booking-serviceP0P2P3
bff-backoffice-serviceP0P2P3
app-web-metaP0P1P2
app-web-tenant-bookingP0P2P3
app-desktop-backofficeP0P2P3
app-mobileP0P1P2
maintenance-service(R2)
analytics-service(R2)
ai-orchestrator-service(R2)

R1 explicitly does not target P4 GA for any service; we observe P3 in staging and pilot prod. P4 begins in R2.


6. Cross-service dependency map (R1)

Used by the planner to enforce dependency direction.

Consumer (story / surface)Required upstream phases
bff-tenant-booking-service confirm storyreservation-service ≥ P2; inventory-service ≥ P2; pricing-service ≥ P1; payment-gateway-service ≥ P2 (for confirm with payment); tax-service ≥ P1
app-web-tenant-booking checkoutbff-tenant-booking-service ≥ P2; same upstreams ≥ same.
bff-backoffice-service housekeeping boardhousekeeping-service ≥ P2; inventory-service ≥ P2; staff-service ≥ P2
app-desktop-backoffice arrivals/departuresbff-backoffice-service ≥ P2; reservation-service ≥ P2
app-mobile partial-offline check-inbff-backoffice-service ≥ P3; sync engine ready (sub-system ≥ P2)
Any payment-touching surface in prodpayment-gateway-service ≥ P3 + audit ≥ P3
Any lock-touching surface in prodlock-integration-service ≥ P3

The map is the source of truth for the planner; it lives also in pkg-contracts/dependency-map.ts as a typed object so stories can fail fast in CI when violating it.


7. Tooling: how readiness is asserted

ToolPhase coverageWhat it checks
commands/scaffold-service.mdP0 onlyGenerates the stub.
pkg-contracts CIP0 → P1Spec validity + generated client compiles.
Service unit + integration testsP1 → P2Coverage + tenant-isolation + outbox tests.
epic-spec-implementation-auditP2 → P3Spec deviation register; recommends fixes.
ghasi-service-readiness-auditP2 → P3 / P3 → P4Full readiness scoring against this document.
Cloud Monitoring SLO recordersP3 → P414-day SLO compliance.
On-call drill templateP3 → P4Validated runbook.

When evidence is missing, the audit refuses to recommend phase advance.


8. Anti-patterns

  • "Ship to prod first, fix readiness later" — only acceptable for the explicit pilot-tenant carve-out.
  • Skipping P1 because "we'll just code straight against the service" — destroys mock-driven contract development.
  • Letting one service jump to P4 while its consumers stay at P2 — creates load and feature traps the consumers cannot withstand.
  • Maintaining a gate review template but not running gates at wave checkpoints — gates rot.
  • Using readiness phases as performance ratings of teams — they are about service surface area, not team output.
  • Allowing exceptions to the dependency direction rules without an ADR.

9. Cross-references


10. Versioning

Same governance as docs/standards/CODING_STANDARDS.md §19. Phase criteria changes require sign-off from Platform & SRE EM + at least one service-owning EM.