Skip to main content

SERVICE_READINESS — notification-service

Sibling: DEPLOYMENT_TOPOLOGY · OBSERVABILITY · TESTING_STRATEGY · SERVICE_RISK_REGISTER

Strategic anchors: docs/standards/DEFINITION_OF_DONE · docs/standards/SERVICE_TEMPLATE

This service is ready for production when every gate below is green and every owner has signed off. Sign-offs are recorded in the rollout ticket; gaps remain in SERVICE_RISK_REGISTER with a remediation plan.

Current overall status: Beta — staged rollout to early-design partner tenants on gcp-asia-south1. Phased rollout to additional regions per §10.


1. Documentation completeness

DocumentStatus
docs/03-microservices/notification-service.md (catalog summary)
services/notification-service/SERVICE_OVERVIEW.md
services/notification-service/DOMAIN_MODEL.md
services/notification-service/APPLICATION_LOGIC.md
services/notification-service/API_CONTRACTS.md
services/notification-service/EVENT_SCHEMAS.md
services/notification-service/DATA_MODEL.md
services/notification-service/SYNC_CONTRACT.md
services/notification-service/AI_INTEGRATION.md
services/notification-service/SECURITY_MODEL.md
services/notification-service/OBSERVABILITY.md
services/notification-service/TESTING_STRATEGY.md
services/notification-service/DEPLOYMENT_TOPOLOGY.md
services/notification-service/FAILURE_MODES.md
services/notification-service/LOCAL_DEV_SETUP.md
services/notification-service/SERVICE_RISK_REGISTER.md
services/notification-service/MIGRATION_PLAN.md

2. Code & implementation gates

GateRequired state
Tag v1.0.0 cut from main⏳ pending GA
Coverage thresholds met (lines ≥ 85, branches ≥ 80, domain ≥ 95)
Zero TODO/FIXME in src/ flagged release-blocker
Lint, format, typecheck clean
OpenAPI ↔ implementation drift✅ none
Event-schema ↔ payload drift✅ none
Webhook contract tests for all configured vendors
Migration dry-run on staging snapshot

3. Security

ControlRequired state
Threat model reviewed (SECURITY_MODEL §12)
RLS enforced on every tenant-scoped table; no BYPASSRLS for app role
CMEK on Cloud SQL, GCS, backups
Vendor credentials only in Secret Manager (no .env references in CI/CD)
HMAC verification per vendor, replay window enforced
Opt-out tokens single-use, hashed-at-rest
PII never in logs / events / Electron replicas
AI provenance present on every AI-derived artefact
HITL gate enforced for ai_drafted template publish
Secret rotation documented + automated
Annual third-party pentest scope confirmed✅ scoped; pentest scheduled for Q3
OWASP top-10 covered by SAST/DAST/dependency scans
GDPR Art 17 (erasure) propagation tested
Data-residency routing tested with cross-region canary

Open security items: 0 high, 1 medium (R-NTF-04 in SERVICE_RISK_REGISTER), 3 low.


4. Reliability & SLOs

SLITargetCurrent (last 30 d staging)
Enqueue p95≤ 250 ms178 ms ✅
Dispatch p95 (transactional)≤ 5 s2.1 s ✅
Dispatch p95 (operational/reminder)≤ 30 s11 s ✅
Webhook ingestion success≥ 99.9 %99.97 % ✅
WS feed availability≥ 99.5 %99.78 % ✅
Outbox publish lag p95≤ 1 s320 ms ✅
Pub/Sub consumer lag p95≤ 5 s1.4 s ✅
Render success≥ 99.95 %99.97 % ✅
API availability≥ 99.9 %99.93 % ✅
Email delivered rate≥ 95 %96.4 % ✅
SMS delivered rate≥ 92 %93.1 % ✅
WhatsApp delivered rate≥ 95 %95.6 % ✅
Push delivered rate≥ 90 %91.0 % ✅

Error budget burn-down dashboard: Notifications Overview; alert routing per OBSERVABILITY §7.


5. Observability

ItemRequired
OpenTelemetry traces, metrics, logs configured and exported
All RED metrics emitted
Domain metrics in OBSERVABILITY §3 emitted
6 dashboards provisioned in Grafana / Cloud Monitoring
All alerts in OBSERVABILITY §7 configured with runbook links
Synthetic monitors green for ≥ 14 days
Audit signals to audit-service confirmed end-to-end
AI telemetry joined and dashboarded
Cost telemetry per tenant available

6. Operational readiness

ItemRequired
Runbooks for every alert (in FAILURE_MODES)
On-call rotation defined for notification-service✅ team-comms
DR plan tested in staging (full cross-region failover)✅ Q2 drill passed
Backup/restore drilled (Cloud SQL PITR)✅ Q1 drill passed
Incident comms templates ready (status-page + tenant in-app)
Capacity model validated against §7 of DEPLOYMENT_TOPOLOGY
Cost guardrails active (budget alerts at 50/80/100 %)
Rollback procedure tested in staging

7. Compliance

ItemRequired
Data Processing Agreement coverage with each vendor✅ SendGrid, Twilio, Meta WhatsApp, Infobip, FCM
GDPR Art 7 (consent for marketing) records visible per tenant
GDPR Art 17 (erasure) wired to iam.user.deleted.v1
WhatsApp Business policy compliance (template approval enforced)
Local SMS regulator (PK PTA, AF, IR) registered sender ID enforcement
Tenant data-residency boundaries enforced✅ tested
Audit log retention 7 y for regulated subjects

8. Cross-team dependencies

UpstreamStatus
iam-service (JWT issuance + JWKS)
tenant-service (tenant projection + theme + policy + DKIM verification workflow)
reservation-service (events + projection client)
billing-service (invoice attachment client + payment events)
lock-integration-service (mobile-key token client)
ai-orchestrator-service (HITL workflow + capability tools)
audit-service (audit sink)
bff-backoffice-service (UI integration)
bff-tenant-booking-service (guest feed + opt-out UI)
sync-service (Electron pull/push wiring)
DownstreamStatus
All 11 consumers of our published events have integration tests
Pact contracts established

9. Tenants & rollout state

Tenant tierStatus
Synthetic test tenants✅ in dev/staging/prod
Design-partner tenants (3)✅ live in asia-south1 since 2026-03-15
Early-access tenants (~20)🔶 onboarding through 2026-05
General availability⏳ targeted 2026-Q3 (after voice phase 3 cut)
ME-residency tenants⏳ pending me-central1 regional cutover (2026-Q3)

10. Phased feature rollout

PhaseScopeState
Phase 1Email + SMS + in-app + WhatsApp transactional flows; templates; preferences; suppression; webhook ingestion; trigger map; outbox/CDC; basic backoffice
Phase 2Push (web + mobile); marketing batches; AI personalisation (suggest_only); AI translation HITL; cost analytics; me-central1 regional rollout🔶 in progress
Phase 3Voice/IVR; SMS short codes per region; tenant-self-serve template marketplace; auto_send AI for marketing/reminder⏳ planned

Feature flags for phase-2 controls are listed in DEPLOYMENT_TOPOLOGY §6.


11. Sign-offs

Required signatures before each major rollout:

  • Service tech lead
  • Platform architect
  • Security lead
  • SRE on-call lead
  • DPO/Compliance lead
  • Tenant ops lead (for tenant-onboarding cohort)
  • CTO (for GA rollout)

12. What "GA" means here

For us to declare GA, all of the following must be true for 30 consecutive days:

  1. SLOs in §4 met or exceeded across all enabled regions and tenants.
  2. Zero P1 incidents attributable to our service.
  3. ≤ 1 P2 incident per week with no recurring root cause.
  4. AI HITL queue cleared within tenant-policy TTL on ≥ 95 % of items.
  5. Webhook ingestion HMAC failure rate < 0.01 %.
  6. Cost per tenant within 10 % of budget model.
  7. Audit findings tracked in SERVICE_RISK_REGISTER all in green or accepted-with-mitigation.

Until then, we operate as Beta with explicit tenant onboarding gates and a daily on-call standup focused on this service.