SERVICE_READINESS — notification-service
Sibling: DEPLOYMENT_TOPOLOGY · OBSERVABILITY · TESTING_STRATEGY · SERVICE_RISK_REGISTER
Strategic anchors: docs/standards/DEFINITION_OF_DONE · docs/standards/SERVICE_TEMPLATE
This service is ready for production when every gate below is green and every owner has signed off. Sign-offs are recorded in the rollout ticket; gaps remain in SERVICE_RISK_REGISTER with a remediation plan.
Current overall status: Beta — staged rollout to early-design partner tenants on gcp-asia-south1. Phased rollout to additional regions per §10.
1. Documentation completeness
| Document | Status |
|---|
docs/03-microservices/notification-service.md (catalog summary) | ✅ |
services/notification-service/SERVICE_OVERVIEW.md | ✅ |
services/notification-service/DOMAIN_MODEL.md | ✅ |
services/notification-service/APPLICATION_LOGIC.md | ✅ |
services/notification-service/API_CONTRACTS.md | ✅ |
services/notification-service/EVENT_SCHEMAS.md | ✅ |
services/notification-service/DATA_MODEL.md | ✅ |
services/notification-service/SYNC_CONTRACT.md | ✅ |
services/notification-service/AI_INTEGRATION.md | ✅ |
services/notification-service/SECURITY_MODEL.md | ✅ |
services/notification-service/OBSERVABILITY.md | ✅ |
services/notification-service/TESTING_STRATEGY.md | ✅ |
services/notification-service/DEPLOYMENT_TOPOLOGY.md | ✅ |
services/notification-service/FAILURE_MODES.md | ✅ |
services/notification-service/LOCAL_DEV_SETUP.md | ✅ |
services/notification-service/SERVICE_RISK_REGISTER.md | ✅ |
services/notification-service/MIGRATION_PLAN.md | ✅ |
2. Code & implementation gates
| Gate | Required state |
|---|
Tag v1.0.0 cut from main | ⏳ pending GA |
| Coverage thresholds met (lines ≥ 85, branches ≥ 80, domain ≥ 95) | ✅ |
Zero TODO/FIXME in src/ flagged release-blocker | ✅ |
| Lint, format, typecheck clean | ✅ |
| OpenAPI ↔ implementation drift | ✅ none |
| Event-schema ↔ payload drift | ✅ none |
| Webhook contract tests for all configured vendors | ✅ |
| Migration dry-run on staging snapshot | ✅ |
3. Security
| Control | Required state |
|---|
Threat model reviewed (SECURITY_MODEL §12) | ✅ |
RLS enforced on every tenant-scoped table; no BYPASSRLS for app role | ✅ |
| CMEK on Cloud SQL, GCS, backups | ✅ |
Vendor credentials only in Secret Manager (no .env references in CI/CD) | ✅ |
| HMAC verification per vendor, replay window enforced | ✅ |
| Opt-out tokens single-use, hashed-at-rest | ✅ |
| PII never in logs / events / Electron replicas | ✅ |
| AI provenance present on every AI-derived artefact | ✅ |
HITL gate enforced for ai_drafted template publish | ✅ |
| Secret rotation documented + automated | ✅ |
| Annual third-party pentest scope confirmed | ✅ scoped; pentest scheduled for Q3 |
| OWASP top-10 covered by SAST/DAST/dependency scans | ✅ |
| GDPR Art 17 (erasure) propagation tested | ✅ |
| Data-residency routing tested with cross-region canary | ✅ |
Open security items: 0 high, 1 medium (R-NTF-04 in SERVICE_RISK_REGISTER), 3 low.
4. Reliability & SLOs
| SLI | Target | Current (last 30 d staging) |
|---|
| Enqueue p95 | ≤ 250 ms | 178 ms ✅ |
| Dispatch p95 (transactional) | ≤ 5 s | 2.1 s ✅ |
| Dispatch p95 (operational/reminder) | ≤ 30 s | 11 s ✅ |
| Webhook ingestion success | ≥ 99.9 % | 99.97 % ✅ |
| WS feed availability | ≥ 99.5 % | 99.78 % ✅ |
| Outbox publish lag p95 | ≤ 1 s | 320 ms ✅ |
| Pub/Sub consumer lag p95 | ≤ 5 s | 1.4 s ✅ |
| Render success | ≥ 99.95 % | 99.97 % ✅ |
| API availability | ≥ 99.9 % | 99.93 % ✅ |
| Email delivered rate | ≥ 95 % | 96.4 % ✅ |
| SMS delivered rate | ≥ 92 % | 93.1 % ✅ |
| WhatsApp delivered rate | ≥ 95 % | 95.6 % ✅ |
| Push delivered rate | ≥ 90 % | 91.0 % ✅ |
Error budget burn-down dashboard: Notifications Overview; alert routing per OBSERVABILITY §7.
5. Observability
| Item | Required |
|---|
| OpenTelemetry traces, metrics, logs configured and exported | ✅ |
| All RED metrics emitted | ✅ |
Domain metrics in OBSERVABILITY §3 emitted | ✅ |
| 6 dashboards provisioned in Grafana / Cloud Monitoring | ✅ |
All alerts in OBSERVABILITY §7 configured with runbook links | ✅ |
| Synthetic monitors green for ≥ 14 days | ✅ |
Audit signals to audit-service confirmed end-to-end | ✅ |
| AI telemetry joined and dashboarded | ✅ |
| Cost telemetry per tenant available | ✅ |
6. Operational readiness
| Item | Required |
|---|
| Runbooks for every alert (in FAILURE_MODES) | ✅ |
On-call rotation defined for notification-service | ✅ team-comms |
| DR plan tested in staging (full cross-region failover) | ✅ Q2 drill passed |
| Backup/restore drilled (Cloud SQL PITR) | ✅ Q1 drill passed |
| Incident comms templates ready (status-page + tenant in-app) | ✅ |
Capacity model validated against §7 of DEPLOYMENT_TOPOLOGY | ✅ |
| Cost guardrails active (budget alerts at 50/80/100 %) | ✅ |
| Rollback procedure tested in staging | ✅ |
7. Compliance
| Item | Required |
|---|
| Data Processing Agreement coverage with each vendor | ✅ SendGrid, Twilio, Meta WhatsApp, Infobip, FCM |
| GDPR Art 7 (consent for marketing) records visible per tenant | ✅ |
GDPR Art 17 (erasure) wired to iam.user.deleted.v1 | ✅ |
| WhatsApp Business policy compliance (template approval enforced) | ✅ |
| Local SMS regulator (PK PTA, AF, IR) registered sender ID enforcement | ✅ |
| Tenant data-residency boundaries enforced | ✅ tested |
| Audit log retention 7 y for regulated subjects | ✅ |
8. Cross-team dependencies
| Upstream | Status |
|---|
iam-service (JWT issuance + JWKS) | ✅ |
tenant-service (tenant projection + theme + policy + DKIM verification workflow) | ✅ |
reservation-service (events + projection client) | ✅ |
billing-service (invoice attachment client + payment events) | ✅ |
lock-integration-service (mobile-key token client) | ✅ |
ai-orchestrator-service (HITL workflow + capability tools) | ✅ |
audit-service (audit sink) | ✅ |
bff-backoffice-service (UI integration) | ✅ |
bff-tenant-booking-service (guest feed + opt-out UI) | ✅ |
sync-service (Electron pull/push wiring) | ✅ |
| Downstream | Status |
|---|
| All 11 consumers of our published events have integration tests | ✅ |
| Pact contracts established | ✅ |
9. Tenants & rollout state
| Tenant tier | Status |
|---|
| Synthetic test tenants | ✅ in dev/staging/prod |
| Design-partner tenants (3) | ✅ live in asia-south1 since 2026-03-15 |
| Early-access tenants (~20) | 🔶 onboarding through 2026-05 |
| General availability | ⏳ targeted 2026-Q3 (after voice phase 3 cut) |
| ME-residency tenants | ⏳ pending me-central1 regional cutover (2026-Q3) |
10. Phased feature rollout
| Phase | Scope | State |
|---|
| Phase 1 | Email + SMS + in-app + WhatsApp transactional flows; templates; preferences; suppression; webhook ingestion; trigger map; outbox/CDC; basic backoffice | ✅ |
| Phase 2 | Push (web + mobile); marketing batches; AI personalisation (suggest_only); AI translation HITL; cost analytics; me-central1 regional rollout | 🔶 in progress |
| Phase 3 | Voice/IVR; SMS short codes per region; tenant-self-serve template marketplace; auto_send AI for marketing/reminder | ⏳ planned |
Feature flags for phase-2 controls are listed in DEPLOYMENT_TOPOLOGY §6.
11. Sign-offs
Required signatures before each major rollout:
12. What "GA" means here
For us to declare GA, all of the following must be true for 30 consecutive days:
- SLOs in §4 met or exceeded across all enabled regions and tenants.
- Zero P1 incidents attributable to our service.
- ≤ 1 P2 incident per week with no recurring root cause.
- AI HITL queue cleared within tenant-policy TTL on ≥ 95 % of items.
- Webhook ingestion HMAC failure rate < 0.01 %.
- Cost per tenant within 10 % of budget model.
- Audit findings tracked in SERVICE_RISK_REGISTER all in
green or accepted-with-mitigation.
Until then, we operate as Beta with explicit tenant onboarding gates and a daily on-call standup focused on this service.