Skip to main content

FAILURE_MODES — staff-service

Sibling: OBSERVABILITY · DEPLOYMENT_TOPOLOGY · SECURITY_MODEL · SYNC_CONTRACT

Strategic anchors: 02 §12 Resilience

A practical catalog of how staff-service can fail, what surfaces degrade, and how on-call recovers. Each entry includes detection (alert / SLI), blast radius, automatic mitigation, manual mitigation, and the canonical runbook reference.


1. Postgres Unavailable (Cloud SQL primary failover)

AspectDetail
Symptoms/health/ready 503; punch latency p95 spikes; outbox depth grows
Detectionstaff-service Cloud Run health failures; alert "Punch error rate > 0.5 %"
Blast radiusAll writes blocked; reads degrade to cache where possible (capacity snapshot stays warm 30 s)
Auto mitigationCloud SQL HA failover (~ 60–120 s); circuit breakers in API short-circuit to 503 with Retry-After: 5
ManualConfirm failover in Cloud SQL console; if PITR needed, follow runbook
Runbookrunbooks/staff/cloud-sql-failover.md
Data riskPunches in flight at failover are retried by clients (Electron) on Idempotency-Key; no loss

2. Pub/Sub Publisher Stalled (Outbox Backed Up)

AspectDetail
SymptomsOutbox depth > 1000 sustained; downstream consumers (housekeeping, maintenance) miss capacity events
DetectionAlert "Outbox depth > 1000 sustained 5 min" (P1)
Blast radiusEventual-consistency degraded (assignments still work via REST); no data loss
Auto mitigationWorker exponential-backoff retry; scale-up rule on outbox depth
ManualInspect Pub/Sub for quota / IAM error; restart staff-worker; if topic misconfigured, fix and replay
Runbookrunbooks/staff/outbox-stalled.md
Data riskNone — outbox is durable; events publish in-order once Pub/Sub recovers

3. Inbox Consumer Stalled / DLQ Growing

AspectDetail
SymptomsInbox lag p99 > 5 min; DLQ depth > 0
DetectionAlert "Inbox lag p99 > 5 min for 5 min" (P1); "Inbox DLQ depth > 0" (P2)
Blast radiusConsumed events (IAM user-registered linkage, tenant membership cascade) delayed; no immediate operator visibility
Auto mitigationConsumer auto-ack on success; failures retry with exponential backoff before DLQ
ManualRead DLQ payloads via gcloud pubsub; if schema mismatch, hot-fix consumer or reject upstream; replay from DLQ via admin endpoint
Runbookrunbooks/staff/inbox-dlq.md
Data riskStale linkage; cascade actions can be delayed by minutes

4. KMS Unavailable (PIN Verification Path)

AspectDetail
SymptomsPIN punch returns 503; manager-override punches succeed (no KMS call)
DetectionKMS error_count metric; readiness probe fails after 30 s
Blast radiusAll PIN clock-ins fail until KMS recovers; JWT punches via Electron continue
Auto mitigationPer-instance pepper cache (5 min TTL); during cache validity, PIN verifies still work
ManualIf KMS region-wide outage, document with property GMs and ensure manager-override is used; do NOT cache pepper > 5 min
Runbookrunbooks/staff/kms-outage.md
Data riskNone directly; operational impact only

5. Memorystore Redis Unavailable

AspectDetail
SymptomsCapacity GET latency p95 climbs; PIN attempt counter falls back to DB
Detectionredis.connection_failures metric; readiness probe partial-fail
Blast radiusCapacity reads slower (~ 250 ms vs 80 ms); PIN brute-force protection degraded to per-staff DB increment
Auto mitigationService degrades gracefully: reads bypass cache; PIN attempt counter writes to staff.staff.clock_in_pin_failed directly
ManualRestart Memorystore replica, await failover; warm cache via staff.cron.warm_capacity
Runbookrunbooks/staff/redis-outage.md
Data riskNone

6. IAM Revoke Cascade Failure on Termination

AspectDetail
Symptomsiam.session.revoke.failure_count alert (P1)
DetectionAfter exponential backoff (12 attempts, ~ 1 h), audit row iam.session.revoke.failed
Blast radiusTerminated staff retains active iam session until token expiry (≤ 1 h for access; up to 30 d for refresh) — security risk
Auto mitigationRetries; staff.terminated.v1 event still published (subscribers act on staff state)
ManualManually call iam-service admin revoke endpoint; force tenant-side session sweep
Runbookrunbooks/staff/iam-revoke-failed.md
Data riskOperational/security; staff record remains terminated regardless

7. Multi-Device Punch Collision

AspectDetail
SymptomsAudit MELMASTOON.STAFF.MULTI_DEVICE_PUNCH_DETECTED; capacity snapshot inconsistency
DetectionCounter multi_device_punch_detected > 0 (P2)
Blast radiusA staff appears double-punched-in across devices; auto-out emitted for the second
Auto mitigationPer SYNC_CONTRACT §6.3
ManualGM reviews via backoffice; corrects via manager-override if needed
Runbookrunbooks/staff/multi-device-punch.md
Data riskNone (audit trail intact); operator confusion

8. Termination of Currently-Clocked-In Staff

AspectDetail
SymptomsNormal flow
Detectionclock.system_auto event; staff.terminated.v1 cascade
Blast radiusOne auto clock-out + one terminated event in close succession
Auto mitigationTerminateStaff use case auto-closes the open punch (source system_auto) before emitting terminated.v1
ManualNone
Runbookn/a (designed behavior)
Data riskNone

9. Property Deactivated With Active Shifts

AspectDetail
Symptomsproperty.deactivated.v1 consumed; future scheduled shifts cancelled
DetectionInbox event; cascade emits one staff.shift.cancelled.v1 per shift
Blast radiusFuture shifts at the property cancelled with cancelReason='property_deactivated'; active staff at the property NOT auto-terminated (operator decides)
Auto mitigationPer APPLICATION_LOGIC §4.4
ManualTenant admin reviews staff list and decides on transfer or termination
Runbookrunbooks/staff/property-deactivation.md
Data riskNone

10. Schedule Generation Idempotency Drift

AspectDetail
SymptomsGenerateShifts creates duplicate shifts on retry
DetectionEXCLUDE constraint on (tenant_id, pattern_id, local_date) rejects duplicates with 23P01
Blast radiusSingle use-case retry returns 409 OCC_CONFLICT; no data corruption
Auto mitigationIdempotency-Key 24 h dedupe + DB constraint
ManualOperator retries with new key
Runbookn/a
Data riskNone

11. Edge Anomaly Model Drift / FPR Spike

AspectDetail
SymptomsManager overrides spike on Electron front-desk; staff complain about flagging
DetectionWeekly FPR > 12 % alert (P3); manager-override count per actor > 5/day (P3)
Blast radiusOperator friction; potential PIN-clock latency from extra confirmation step
Auto mitigationNone
ManualPin Electron auto-update channel to last-known-good model; coordinate with ai-orchestrator to retrain
Runbookrunbooks/staff/edge-anomaly-drift.md
Data riskNone

12. AI Orchestrator Unavailable

AspectDetail
SymptomsGET /forecast/staffing-gaps returns degraded=true; POST /ai/draft-staff-tags returns 503
DetectionOutbound HTTP 5xx counter; circuit breaker trip
Blast radiusAI surfaces degraded; baseline workflows continue
Auto mitigationPer AI_INTEGRATION §3, §6
ManualNone for short outages; if extended, notify GMs that AI suggestions are paused
Runbookrunbooks/staff/ai-orchestrator-down.md
Data riskNone

13. Sync Service Unable to Push (Electron Offline > 24 h)

AspectDetail
SymptomsDevice queue grows; on push, server flags late-replay punches
Detectionstaff.sync.reconcile job alerts on devices > 7 d behind
Blast radiusLocal-only data on device; operators rely on manual records
Auto mitigationSync push accepts with flagged_late_replay; per SYNC_CONTRACT §6.2
ManualField tech checks device connectivity; tenant admin notified if persistent
Runbookrunbooks/staff/device-sync-stale.md
Data riskTime-skew on punches; operator reviews flagged entries

14. Migration Failure in Production

AspectDetail
SymptomsDeploy pipeline aborts; staff-api rollout halted
DetectionCloud Deploy alert (P1)
Blast radiusNew version not rolled out; current version unaffected
Auto mitigationNone (pipeline gate)
ManualPer runbooks/staff/migration-failure.md: investigate Flyway error, fix migration, redeploy. Forward-fix only — do NOT roll back schema
Runbookrunbooks/staff/migration-failure.md
Data riskNone if pipeline stopped before traffic-shift

15. Cross-Region Replica Lag (M2)

AspectDetail
SymptomsReads in secondary region serve stale capacity; writes still primary-only
DetectionCloud SQL replica lag metric > 30 s
Blast radiusRead users in secondary region see slightly stale schedule
Auto mitigationRead traffic falls back to primary if lag > threshold (Cloud Run header-based routing)
ManualInvestigate primary write rate; consider scaling up replica
Runbookrunbooks/staff/replica-lag.md
Data riskOperational; no data loss

16. Audit Append-Only Trigger Bypass Attempt

AspectDetail
SymptomsDB error 0L000 from a use case writing to clock_entries / handoff_notes / audit_events via UPDATE/DELETE
DetectionApplication logs ERROR; CI integration test append-only-trigger.spec.ts
Blast radiusUse case fails fast with 500; no data corruption
Auto mitigationDB trigger raises
ManualCode review; the only legitimate UPDATE/DELETE on these tables is via migration with app.role='platform_admin'
Runbookn/a
Data riskNone

17. Recovery Time Objectives

Failure typeRTORPO
Cloud SQL HA failover2 min0
Pub/Sub regional outage5 min0 (durable outbox)
KMS regional outage10 min0
Region failover (M2)15 min≤ 30 s
Backup restore (PITR)1 h7 d max

18. Postmortem Trigger

A postmortem is required for:

  • Any P1 alert
  • Data loss in clock_entries, audit_events, staff (any row)
  • Cross-tenant data leak (RLS bypass)
  • IAM revoke cascade failure that leads to active session post-termination > 30 min
  • DSAR overdue
  • Staffing gap that resulted in unstaffed property for > 30 min and was caused by service-side bug

Postmortems live in runbooks/staff/postmortems/YYYY-MM-DD-<slug>.md.