DEPLOYMENT_TOPOLOGY — payment-gateway-service

Sibling: OBSERVABILITY · SECURITY_MODEL · LOCAL_DEV_SETUP · SERVICE_READINESS

This service runs on Google Cloud Platform (per platform standard; Electron is the only desktop technology, never Tauri). Production workloads run on GKE Autopilot; data lives in Cloud SQL Postgres 16; events flow through Pub/Sub; secrets in Secret Manager; key material in Cloud KMS.

1. Environments

Env	GCP project	Purpose	Vendor mode
`dev`	`gm-payments-dev`	Local-merge integration	All processors in sandbox; `hesabpay-mock`
`staging`	`gm-payments-staging`	Pre-prod validation, contract tests	Stripe/PayPal sandbox; HesabPay sandbox
`prod`	`gm-payments-prod`	Live	Stripe/PayPal live; HesabPay live (per-region per-tenant)

Per-tenant Cloud SQL schemas exist in each environment; tenants in dev/staging are synthetic.

2. Cluster topology

GKE Autopilot clusters per env, regional (us-central1 primary, europe-west1 warm-standby for EU tenants).
Two deployments:
1. payment-api — REST + webhook receivers, HPA on RPS + p95 latency, min 3 pods, max 60.
2. payment-worker — webhook dispatcher, reconciliation jobs, outbox flush, HPA on Pub/Sub backlog, min 2 pods, max 30.
A scheduled CronJob payments-reconciliation-daily runs at 02:00 UTC per tenant per processor.
A scheduled CronJob payments-idempotency-gc runs hourly, dropping expired idempotency keys.

3. Network

All inbound API traffic enters via the platform gateway (api.melmastoon.ghasi.io) → Cloud Armor → Internal HTTPS LB → payment-api.
Webhook traffic enters via a separate subdomain webhooks.payments.melmastoon.ghasi.io → Cloud Armor (with vendor IP allowlists per SECURITY_MODEL §4.3) → payment-api webhook receivers.
Egress to vendors (Stripe, PayPal, HesabPay) goes through a Cloud NAT with a fixed public IP, allowlisted on each vendor's side where they support it.
Egress to FX provider behind the same NAT.
Service-to-service (Pub/Sub, Cloud SQL, Secret Manager, KMS) over Private Service Connect — no public IPs.

                                   ┌────────────────────────────┐
                                   │ api.melmastoon.ghasi.io    │
   guests / staff / s2s ─────────▶│   Cloud Armor + LB         │──▶ payment-api ──┬─▶ Cloud SQL (Postgres 16, schema-per-tenant)
                                   └────────────────────────────┘                  │
                                                                                   ├─▶ Pub/Sub (events out + saga in)
                                   ┌─────────────────────────────────┐             │
   stripe / paypal / hesabpay ──▶│ webhooks.payments.melmastoon.…  │──▶ payment-api│
                                   │   Cloud Armor + WAF + IP allow │              │
                                   └─────────────────────────────────┘             │
                                                                                   ├─▶ Secret Manager (vendor creds)
   payment-worker ◀── Pub/Sub backlog (HPA) ── outbox / webhook inbox flush        │
                                                                                   └─▶ Cloud KMS (CMEK envelope decrypt)

   Cloud NAT (fixed egress IP) ──▶ Stripe / PayPal / HesabPay / FX

4. Data plane

Cloud SQL Postgres 16, regional HA (primary + standby), private IP only.
Read replica for analytics queries (no app traffic).
Backups: continuous PITR (35-day retention), daily snapshots (90-day retention), monthly archive snapshot to Coldline GCS bucket (7-year retention).
Maintenance window: Sundays 03:00–05:00 UTC.

5. Configuration & secrets

App config via Kubernetes ConfigMaps for non-secret env (FX provider id, default circuit-breaker thresholds, feature flags).
Secrets via External Secrets Operator pulling from Secret Manager. Pods receive secrets as env vars or mounted files (vendor private keys mounted, never env-vared).
Workload Identity binds the payment-api and payment-worker GSAs to namespace-scoped permissions (secretmanager.secretAccessor, cloudkms.cryptoKeyDecrypter, cloudsql.client, pubsub.publisher, pubsub.subscriber).

6. Release pipeline

CI on PR: lint + unit + integration + contract + PCI scan + OpenAPI diff + event-schema diff.
Merge to main: build container image (Wolfi-based distroless), sign with Cosign, push to Artifact Registry, deploy to dev.
Promotion to staging: manual approval; runs e2e:sandbox suite in staging post-deploy.
Promotion to prod: manual approval (2-engineer); progressive rollout with canary at 5% for 30 min, then 25% for 30 min, then 100%; auto-rollback on SLO burn or error-rate spike.

7. Observability stack

Metrics: OpenTelemetry → Cloud Operations.
Traces: OpenTelemetry → Cloud Trace.
Logs: stdout JSON → Cloud Logging.
Alerts: routed via notification-service to PagerDuty.
Dashboards: Cloud Monitoring workspace payment-gateway-service (see OBSERVABILITY).

8. Disaster recovery

Scenario	RPO	RTO	Strategy
Cluster failure	0	15 m	Failover to warm standby region (DNS swap)
DB primary failure	0 (HA standby in same region)	5 m	Cloud SQL automatic failover
Region-wide outage	≤ 5 m (PITR)	60 m	Restore PITR to standby region; webhook ingestion paused via 503 (vendors retry); re-issued events on recovery
Pub/Sub outage	n/a	as long as outage	Outbox holds; resumes when Pub/Sub returns
Secret Manager outage	5 m read cache covers	as long as outage	Cached secrets serve in-flight requests; new tenants paused

Quarterly DR drills include a region failover and a full restore test.

9. Tenant onboarding & offboarding flow (deploy-side)

Onboarding: tenant-service calls payments-admin-cli provision-tenant <id> which creates the schema, seeds vendor pointers, and runs initial migrations.
Offboarding: 30-day soft-delete window; after that, schema dropped, secrets destroyed, archived snapshot retained per legal hold rules (SECURITY_MODEL §11).

1. Environments​

2. Cluster topology​

3. Network​

4. Data plane​

5. Configuration & secrets​

6. Release pipeline​

7. Observability stack​

8. Disaster recovery​

9. Tenant onboarding & offboarding flow (deploy-side)​