Skip to main content

maintenance-service · SERVICE_READINESS

Production cutover gate. Every box must be ticked and signed off by the named role before traffic is permitted on a fresh region or before declaring the service GA.

1. Documentation

  • docs/03-microservices/maintenance-service.md published and linked from index
  • All 17 service-bundle files present, no TODO markers
  • SYNC_CONTRACT.md cross-referenced from desktop's docs index
  • Error codes registered in docs/standards/ERROR_CODES.md with HTTP, retriability, i18n, runbooks
  • ULID prefixes registered in docs/standards/NAMING.md
  • Topics + schemas registered in docs/04-event-driven-architecture.md topic registry
  • OpenAPI v1 published to schema registry; consumers notified
  • Runbooks linked from every alert in OBSERVABILITY.md §6 — links resolve
  • Sign-off: Tech writer, Domain owner

2. Code quality

  • No TODO/FIXME comments in domain/ or application/
  • No any in domain/ or public DTOs
  • All public functions JSDoc-documented at the use case level
  • No dead code (verified by ts-prune)
  • No console.log; structured logger only
  • eslint, tsc --noEmit, prettier --check all green on main
  • No circular dependencies (madge)
  • No imports from infrastructure/ inside application/ or domain/
  • Sign-off: Engineering lead

3. Test coverage and gates

  • Branch coverage: domain/ ≥ 95%, application/ ≥ 90%, infrastructure/ ≥ 70%, controllers/ ≥ 80%
  • Mandatory specs present and green:
    • integration/tenant-isolation.spec.ts
    • integration/outbox.spec.ts
    • integration/inbox.spec.ts
  • State machine: every transition (allowed and denied) covered
  • Saga compensation: room-block rejection, vendor no-show, relocation no-inventory, OCC concurrent resolve
  • Concurrency: at least one fast-check property test on cost rollup; explicit OCC race spec
  • Sync push: OCC, idempotency, partial failure, clock skew specs
  • AI: orchestrator timeout, invalid enum, budget exhaustion specs
  • Performance: locust scripts pass NFR targets in OBSERVABILITY.md §3
  • Pact contracts published and verified against all consumers (property, reservation, housekeeping, notification, billing, analytics, audit, bff-backoffice, sync)
  • OpenAPI diff against previous version: no breaking changes
  • Event schema golden tests: TS / JSON Schema / sample payload all in lock-step
  • Sign-off: Engineering lead, QA

4. API & event hygiene

  • OpenAPI spec auto-generated; CI fails on uncommitted diff
  • All public endpoints documented with examples
  • All error codes return RFC 7807 envelope
  • Idempotency keys honoured on every state-changing endpoint
  • Rate limits configured at Kong: 50 RPS per tenant for BFF
  • Each event has TS interface + JSON Schema + golden sample + retention class set
  • No event has "raw PII" — vendor identifiers used, no plaintext numbers
  • Outbox + inbox correctness verified by chaos run
  • Sign-off: API governance, Eventing team

5. Storage & migrations

  • All tables have RLS enabled with tenant_id policy
  • All money columns are bigint micro-units; never numeric decimal
  • All CREATE INDEX statements use CONCURRENTLY in production migrations
  • Expand-then-contract migrations only; no destructive ops in single step
  • PgBouncer in transaction mode in front of Cloud SQL
  • Backups: daily full + 7 d PITR; cross-region replica in europe-west4
  • DR runbook exists and was last drilled in staging within the last 90 days
  • Sign-off: DBA, SRE

6. Security

  • All endpoints behind JWT + tenant header check at Kong
  • RLS verified by tenant-isolation.spec.ts
  • Workload Identity used for all GCP API calls; no static SA keys
  • Secrets in Secret Manager, never in env vars
  • CMEK enabled on Cloud SQL and melmastoon-vendor-invoices/ bucket
  • AI calls go through orchestrator; no model SDK in service
  • PII redaction verified before any orchestrator call
  • Audit events emitted on every state mutation
  • Threat model in SECURITY_MODEL.md §11 reviewed in last 12 months
  • Penetration test report on file (last 12 months)
  • Sign-off: Security

7. Observability

  • OTel tracing enabled with required attributes (per OBSERVABILITY.md §1)
  • All logs structured JSON; no PII in logs
  • Dashboards published: Service health, Operations, AI, Sync
  • Alerts configured per OBSERVABILITY.md §6 with runbook links
  • SLOs published in error-budget tracking system
  • Synthetic checks running every 60 s from 3 regions
  • BigQuery archive sink connected and ingesting
  • Sign-off: SRE

8. Deployment

  • Two Cloud Run services (api + workers) deployed via CI from main only
  • Canary release: 10% → 50% → 100% traffic shift with bake intervals
  • Migration job runs pre-deploy; failure halts pipeline
  • Rollback runbook tested (last quarter)
  • Health and readiness probes configured
  • Resource limits set (CPU, mem) per DEPLOYMENT_TOPOLOGY.md §2
  • Cloud Scheduler entries deployed for all worker cron jobs
  • Outbox relay configured for maintenance.outbox table
  • Sign-off: SRE, Platform

9. Desktop & sync integration

  • SYNC_CONTRACT.md accepted by Desktop team
  • Pull payload size verified at < 200 KB compressed for typical property
  • Push commands accept the canonical command set; rejected commands return clear error codes
  • Conflict log entry surfaced in desktop "Sync issues" view
  • Offline UX walkthrough recorded and approved
  • Clock skew warning UX implemented
  • Sign-off: Desktop lead

10. AI integration

  • All capabilities registered with ai-orchestrator-service
  • Provenance persisted on every AI-influenced field
  • HITL flow implemented in BFF for severity and vendor-message-draft
  • Per-tenant budget configured; soft-cap and hard-cap behaviour verified
  • PII redaction verified in audit trail samples
  • Model rollback procedure documented and tested
  • Sign-off: AI/ML lead

11. Operations

  • On-call rotation defined; PagerDuty integration tested
  • Runbooks linked from every alert
  • Tenant onboarding runbook tested in staging
  • Tenant offboarding (data export + erasure) runbook tested
  • Customer-support runbooks for common issues drafted (vendor no-show, OOO not lifting, AI suggestion wrong)
  • Dependency map up to date in service catalog
  • Sign-off: Support, On-call

12. Compliance

  • Data retention configured per DATA_MODEL.md §5
  • Vendor invoice files retained 7 years (regulated)
  • Audit events retained per platform policy
  • GDPR: erasure flow tested for vendor records
  • Data residency: confirmed all production storage in europe-west1
  • DPIA on file (Data Protection Impact Assessment)
  • Sign-off: Compliance, Legal

13. Capacity & cost

  • Capacity plan in DEPLOYMENT_TOPOLOGY.md §9 reviewed
  • Cloud SQL sized for 2× expected peak
  • Cloud Run min/max instances set per environment
  • AI cost guardrails configured per tenant
  • BigQuery storage cost projection within budget
  • Sign-off: Finance ops, SRE

14. Communication

  • Service announcement drafted for internal Slack
  • Customer-facing changelog entry drafted
  • Backoffice product team trained on the new operations
  • Vendor outreach playbook updated (WhatsApp/SMS templates approved)
  • Sign-off: Product, Marketing ops

Final cutover sign-off: Engineering lead + SRE + Security + Product + Domain owner (5 signatures). The signed checklist is committed at services/maintenance-service/READINESS_<YYYY-MM-DD>.md and referenced in the release tag annotation.