Service Readiness Gates — Ghasi Melmastoon
Companion: Roadmap Index · team-capacity-model.md · jira-epic-based-sprint-ruleset.md · docs/standards/DEFINITION_OF_DONE.md · docs/standards/REQUIREMENTS_GUARD_RAILS.md
A service readiness gate is the bar a service must clear before downstream consumers (other services, BFFs, frontends) are allowed to depend on it for production traffic. Gates exist so that an upstream service's gaps cannot silently become a downstream surface's bug.
This document defines the five readiness phases, the exit criteria for each, the gate review process, and the dependency direction rules a sprint planner must respect.
The gates extend (and never replace) the per-story Definition of Done. DoD is per-ticket; readiness gates are per-service-version.
1. The five phases
| Phase | Label (Jira fixVersion / sub-label) | Meaning | Who can depend on this service? |
|---|---|---|---|
| P0 — Stub | service-readiness:p0-stub | Service exists in the monorepo as scaffolded NestJS app with /health/live returning 200. No business logic. | Nobody. Local-dev only. |
| P1 — Contract-only | service-readiness:p1-contract | OpenAPI + AsyncAPI specs published in pkg-contracts; mock server in bff-* and integration tests can use generated client. No real persistence. | Other teams may code against this contract; may not run integration in shared envs. |
| P2 — Functional alpha | service-readiness:p2-alpha | Core happy-path implemented with persistence, outbox, RLS, basic tests, deployed to dev GCP project. | Other services may integrate in dev. Not for staging or prod. |
| P3 — Production-ready beta | service-readiness:p3-beta | All §3 P3 criteria met; deployed to staging; under SLO observation. | Other services may use in staging. One internal pilot tenant in prod allowed with explicit risk waiver. |
| P4 — General availability | service-readiness:p4-ga | All §3 P4 criteria met; on-call rotation active; runbooks complete; SLOs steady ≥ 14 days in prod. | All services and surfaces may depend on it in prod. |
A regression in any criterion demotes the service one phase until restored.
2. Visual: dependency direction rules
P4 GA ──── may depend on ────► P4 GA (✅ allowed)
P4 GA ──── may depend on ────► P3 (⚠️ pilot tenant only, with waiver)
P3 ──── may depend on ────► P3+ (✅ allowed in staging only)
P3 ──── may depend on ────► P2 (❌ blocked; lift upstream first)
P2 ──── may depend on ────► P2+ (✅ allowed in dev only)
P2 ──── may depend on ────► P1 (❌ blocked; lift upstream first)
P1 ──── may depend on ────► P1+ (✅ contract-only integration)
P0 ──── may depend on ────► anything (local-dev only)
The sprint planner (jira-sprint-distribution-rules.md Rule E) refuses to commit a story whose upstreams are at a lower readiness than the story requires.
3. Phase exit criteria
Each phase has hard exits. AI agents (epic-spec-implementation-audit, ghasi-service-readiness-audit) verify these criteria before raising the readiness label.
3.1 P0 → P1 (Stub → Contract-only)
| Area | Criterion |
|---|---|
| Repo | Service scaffolded via commands/scaffold-service.md. |
| Contracts | OpenAPI 3.1 spec for HTTP routes published in pkg-contracts. |
| Events | AsyncAPI 2.6 spec for produced/consumed events published in pkg-contracts. |
| Generated clients | pkg-contracts exports typed TS client; CI publishes preview. |
| Mock server | pkg-contracts ships a mock server consumable by BFFs in tests. |
| Health | /health/live and /health/ready return 200 (ready returns 503 until P2). |
| Docs | services/<name>/README.md exists with the bundle index. |
3.2 P1 → P2 (Contract-only → Functional alpha)
| Area | Criterion |
|---|---|
| Domain | At least 1 happy-path use case fully implemented (per spec story). |
| Persistence | Drizzle schema + migrations applied; RLS enabled on every tenant-scoped table. |
| Outbox/Inbox | Outbox writes recorded; outbox dispatcher operational; inbox dedup with idempotency keys. |
| API | All P1 contract routes implemented; rejects with the standard error envelope. |
| Tests | Unit ≥ 70 %; integration with Testcontainers; tenant-isolation forever-passing test green. |
| Deploy | Cloud Run service deployed to dev GCP project via Turbo + Cloud Build pipeline. |
| Observability | Structured logs (Pino) + traces (OTel) emitting; service-level dashboard stub exists. |
| Security | Threat model linked in services/<name>/security.md; secrets via Secret Manager. |
| Idempotency | All write routes accept Idempotency-Key; duplicate requests reproduce response. |
3.3 P2 → P3 (Functional alpha → Production-ready beta)
| Area | Criterion |
|---|---|
| Functionality | All R |
| Coverage | Unit ≥ 80 %, integration ≥ 70 %, contract tests passing in CI per Testing Standards §3.3. |
| SLO | SLO defined (latency p95, error rate, availability); breach alerts wired to Cloud Monitoring → Slack/PagerDuty. |
| Performance | Load test executed (k6) at 2× expected R |
| Resilience | Chaos test: kill one Pub/Sub subscriber, kill one Cloud Run instance — system recovers. |
| Migrations | Forward + rollback rehearsed; no destructive operations without explicit approval. |
| Backups | Cloud SQL automated backups enabled with PITR; restore-rehearsal documented. |
| Multi-tenant | RLS regression suite green; cross-tenant exfiltration test green. |
| Audit | All write actions emit audit-service events; PII redacted in logs. |
| Docs | Runbook (services/<name>/runbook.md) covers incident triage, common alerts, rollback. |
| Deploy | Deployed to staging with at least 14 days of soak + 0 P1 incidents. |
| AI/HITL | If service has AI affordances: HITL surface present, refusal/abstention paths implemented (see frontend/04-frontend-design-guidelines.md and ai-orchestrator-service spec). |
3.4 P3 → P4 (Production-ready beta → GA)
| Area | Criterion |
|---|---|
| SLO history | ≥ 14 days continuous SLO compliance in prod under real load. |
| On-call | Service is on the on-call rotation; ≥ 2 engineers can respond to a P1 page. |
| Runbooks | Runbook validated by an on-call drill; incident-rehearsal report attached. |
| Cost | FinOps dashboard shows cost-per-tenant within budget; no runaway anomalies. |
| Compliance | If touching PII / payments / secrets: data-classification map signed off; KMS rotation policy active. |
| Localization | All user-facing strings flow through ICU; en/ps/dr/ar pseudo-locale tested. |
| Accessibility | Frontends consuming this service pass axe-core + manual a11y check at the affected screens. |
| Docs | docs/services/<name>/ bundle complete; docs/13-traceability-matrix.md rows all green. |
| Two-team review | One reviewer from a non-owning squad signs off the readiness ticket. |
3.5 Maintenance criteria (P4 ongoing)
| Area | Criterion |
|---|---|
| SLO error budget | Burn-rate alerts wired; 28-day budget breach triggers a freeze on new features for the service. |
| Dependency hygiene | Renovate/Dependabot PRs merged ≤ 14 days old. |
| Security | All npm audit and Snyk Critical/High issues triaged ≤ 7 days. |
| Drift | No undocumented infra drift (Terraform plan = empty diff). |
| Doc freshness | Spec ↔ implementation audit clean every quarter (epic-spec-implementation-audit). |
A failed maintenance criterion that persists ≥ 14 days demotes the service to P3.
4. Gate review process
A readiness phase change is a Jira sub-task (Type=Task, label gate-review) on the service's component, with the new phase in the title (e.g., [gate] reservation-service P2 → P3).
4.1 Inputs
- Latest
ghasi-service-readiness-auditreport for the service. - Latest
epic-spec-implementation-auditreport covering the service's epics. - SLO snapshot (last 14/28 days).
- Open S1/S2 bug list filtered by component.
- Cost-per-tenant trend.
- On-call drill report (only for P3 → P4).
4.2 Reviewers
| Phase change | Required reviewers |
|---|---|
| P0 → P1 | Owning EM + 1 SRE |
| P1 → P2 | Owning EM + 1 SRE + 1 SecEng (if PII / payments / secrets) |
| P2 → P3 | Owning EM + Platform & SRE EM + Release Captain + 1 SecEng (if applicable) |
| P3 → P4 | Owning EM + Platform & SRE EM + 1 PM + 1 Designer (if user-facing) + Head of Engineering sign-off |
4.3 Outputs
- Updated
service-readiness:pXlabel on the Jira component (we model labels per service via Jira automation). - Diff in
docs/13-traceability-matrix.mdcells for that service. - Comment posted to the parent epic(s) with the new phase.
4.4 Cadence
- Gate reviews run on demand (when criteria are believed met) and at wave checkpoints (end of each wave).
- A service that has not advanced for 4+ sprints triggers an involuntary gate review to either advance, demote, or document the blocking dependency.
5. Per-service initial gates (R1 starting state)
The R1 wave plan starts with this readiness map. Updated at each wave checkpoint.
| Service | Start of R1 | Target by R1-S06 | Target by R1 end (R1-S12) |
|---|---|---|---|
reservation-service | P0 | P2 | P3 |
inventory-service | P0 | P2 | P3 |
pricing-service | P0 | P1 | P2 |
payment-gateway-service | P0 | P2 | P3 |
lock-integration-service | P0 | P1 | P2 |
audit-service | P0 | P2 | P3 |
housekeeping-service | P0 | P1 | P2 |
notification-service | P0 | P2 | P3 |
file-storage-service | P0 | P1 | P2 |
staff-service | P0 | P2 | P3 |
tax-service | P0 | P1 | P2 |
feature-flag-service | P0 | P2 | P3 |
theme-config-service | P0 | P1 | P2 |
bff-consumer-service | P0 | P1 | P2 |
bff-tenant-booking-service | P0 | P2 | P3 |
bff-backoffice-service | P0 | P2 | P3 |
app-web-meta | P0 | P1 | P2 |
app-web-tenant-booking | P0 | P2 | P3 |
app-desktop-backoffice | P0 | P2 | P3 |
app-mobile | P0 | P1 | P2 |
maintenance-service | (R2) | — | — |
analytics-service | (R2) | — | — |
ai-orchestrator-service | (R2) | — | — |
R1 explicitly does not target P4 GA for any service; we observe P3 in staging and pilot prod. P4 begins in R2.
6. Cross-service dependency map (R1)
Used by the planner to enforce dependency direction.
| Consumer (story / surface) | Required upstream phases |
|---|---|
bff-tenant-booking-service confirm story | reservation-service ≥ P2; inventory-service ≥ P2; pricing-service ≥ P1; payment-gateway-service ≥ P2 (for confirm with payment); tax-service ≥ P1 |
app-web-tenant-booking checkout | bff-tenant-booking-service ≥ P2; same upstreams ≥ same. |
bff-backoffice-service housekeeping board | housekeeping-service ≥ P2; inventory-service ≥ P2; staff-service ≥ P2 |
app-desktop-backoffice arrivals/departures | bff-backoffice-service ≥ P2; reservation-service ≥ P2 |
app-mobile partial-offline check-in | bff-backoffice-service ≥ P3; sync engine ready (sub-system ≥ P2) |
Any payment-touching surface in prod | payment-gateway-service ≥ P3 + audit ≥ P3 |
Any lock-touching surface in prod | lock-integration-service ≥ P3 |
The map is the source of truth for the planner; it lives also in pkg-contracts/dependency-map.ts as a typed object so stories can fail fast in CI when violating it.
7. Tooling: how readiness is asserted
| Tool | Phase coverage | What it checks |
|---|---|---|
commands/scaffold-service.md | P0 only | Generates the stub. |
pkg-contracts CI | P0 → P1 | Spec validity + generated client compiles. |
| Service unit + integration tests | P1 → P2 | Coverage + tenant-isolation + outbox tests. |
epic-spec-implementation-audit | P2 → P3 | Spec deviation register; recommends fixes. |
ghasi-service-readiness-audit | P2 → P3 / P3 → P4 | Full readiness scoring against this document. |
| Cloud Monitoring SLO recorders | P3 → P4 | 14-day SLO compliance. |
| On-call drill template | P3 → P4 | Validated runbook. |
When evidence is missing, the audit refuses to recommend phase advance.
8. Anti-patterns
- "Ship to prod first, fix readiness later" — only acceptable for the explicit pilot-tenant carve-out.
- Skipping P1 because "we'll just code straight against the service" — destroys mock-driven contract development.
- Letting one service jump to P4 while its consumers stay at P2 — creates load and feature traps the consumers cannot withstand.
- Maintaining a gate review template but not running gates at wave checkpoints — gates rot.
- Using readiness phases as performance ratings of teams — they are about service surface area, not team output.
- Allowing exceptions to the dependency direction rules without an ADR.
9. Cross-references
- docs/standards/DEFINITION_OF_DONE.md
- docs/standards/TESTING_STANDARDS.md
- docs/standards/REQUIREMENTS_GUARD_RAILS.md
- jira-sprint-distribution-rules.md
- team-capacity-model.md
- jira-epic-based-sprint-ruleset.md
10. Versioning
Same governance as docs/standards/CODING_STANDARDS.md §19. Phase criteria changes require sign-off from Platform & SRE EM + at least one service-owning EM.