FAILURE_MODES — staff-service
Sibling: OBSERVABILITY · DEPLOYMENT_TOPOLOGY · SECURITY_MODEL · SYNC_CONTRACT
Strategic anchors: 02 §12 Resilience
A practical catalog of how staff-service can fail, what surfaces degrade, and how on-call recovers. Each entry includes detection (alert / SLI), blast radius, automatic mitigation, manual mitigation, and the canonical runbook reference.
1. Postgres Unavailable (Cloud SQL primary failover)
| Aspect | Detail |
|---|
| Symptoms | /health/ready 503; punch latency p95 spikes; outbox depth grows |
| Detection | staff-service Cloud Run health failures; alert "Punch error rate > 0.5 %" |
| Blast radius | All writes blocked; reads degrade to cache where possible (capacity snapshot stays warm 30 s) |
| Auto mitigation | Cloud SQL HA failover (~ 60–120 s); circuit breakers in API short-circuit to 503 with Retry-After: 5 |
| Manual | Confirm failover in Cloud SQL console; if PITR needed, follow runbook |
| Runbook | runbooks/staff/cloud-sql-failover.md |
| Data risk | Punches in flight at failover are retried by clients (Electron) on Idempotency-Key; no loss |
2. Pub/Sub Publisher Stalled (Outbox Backed Up)
| Aspect | Detail |
|---|
| Symptoms | Outbox depth > 1000 sustained; downstream consumers (housekeeping, maintenance) miss capacity events |
| Detection | Alert "Outbox depth > 1000 sustained 5 min" (P1) |
| Blast radius | Eventual-consistency degraded (assignments still work via REST); no data loss |
| Auto mitigation | Worker exponential-backoff retry; scale-up rule on outbox depth |
| Manual | Inspect Pub/Sub for quota / IAM error; restart staff-worker; if topic misconfigured, fix and replay |
| Runbook | runbooks/staff/outbox-stalled.md |
| Data risk | None — outbox is durable; events publish in-order once Pub/Sub recovers |
3. Inbox Consumer Stalled / DLQ Growing
| Aspect | Detail |
|---|
| Symptoms | Inbox lag p99 > 5 min; DLQ depth > 0 |
| Detection | Alert "Inbox lag p99 > 5 min for 5 min" (P1); "Inbox DLQ depth > 0" (P2) |
| Blast radius | Consumed events (IAM user-registered linkage, tenant membership cascade) delayed; no immediate operator visibility |
| Auto mitigation | Consumer auto-ack on success; failures retry with exponential backoff before DLQ |
| Manual | Read DLQ payloads via gcloud pubsub; if schema mismatch, hot-fix consumer or reject upstream; replay from DLQ via admin endpoint |
| Runbook | runbooks/staff/inbox-dlq.md |
| Data risk | Stale linkage; cascade actions can be delayed by minutes |
4. KMS Unavailable (PIN Verification Path)
| Aspect | Detail |
|---|
| Symptoms | PIN punch returns 503; manager-override punches succeed (no KMS call) |
| Detection | KMS error_count metric; readiness probe fails after 30 s |
| Blast radius | All PIN clock-ins fail until KMS recovers; JWT punches via Electron continue |
| Auto mitigation | Per-instance pepper cache (5 min TTL); during cache validity, PIN verifies still work |
| Manual | If KMS region-wide outage, document with property GMs and ensure manager-override is used; do NOT cache pepper > 5 min |
| Runbook | runbooks/staff/kms-outage.md |
| Data risk | None directly; operational impact only |
5. Memorystore Redis Unavailable
| Aspect | Detail |
|---|
| Symptoms | Capacity GET latency p95 climbs; PIN attempt counter falls back to DB |
| Detection | redis.connection_failures metric; readiness probe partial-fail |
| Blast radius | Capacity reads slower (~ 250 ms vs 80 ms); PIN brute-force protection degraded to per-staff DB increment |
| Auto mitigation | Service degrades gracefully: reads bypass cache; PIN attempt counter writes to staff.staff.clock_in_pin_failed directly |
| Manual | Restart Memorystore replica, await failover; warm cache via staff.cron.warm_capacity |
| Runbook | runbooks/staff/redis-outage.md |
| Data risk | None |
6. IAM Revoke Cascade Failure on Termination
| Aspect | Detail |
|---|
| Symptoms | iam.session.revoke.failure_count alert (P1) |
| Detection | After exponential backoff (12 attempts, ~ 1 h), audit row iam.session.revoke.failed |
| Blast radius | Terminated staff retains active iam session until token expiry (≤ 1 h for access; up to 30 d for refresh) — security risk |
| Auto mitigation | Retries; staff.terminated.v1 event still published (subscribers act on staff state) |
| Manual | Manually call iam-service admin revoke endpoint; force tenant-side session sweep |
| Runbook | runbooks/staff/iam-revoke-failed.md |
| Data risk | Operational/security; staff record remains terminated regardless |
7. Multi-Device Punch Collision
| Aspect | Detail |
|---|
| Symptoms | Audit MELMASTOON.STAFF.MULTI_DEVICE_PUNCH_DETECTED; capacity snapshot inconsistency |
| Detection | Counter multi_device_punch_detected > 0 (P2) |
| Blast radius | A staff appears double-punched-in across devices; auto-out emitted for the second |
| Auto mitigation | Per SYNC_CONTRACT §6.3 |
| Manual | GM reviews via backoffice; corrects via manager-override if needed |
| Runbook | runbooks/staff/multi-device-punch.md |
| Data risk | None (audit trail intact); operator confusion |
8. Termination of Currently-Clocked-In Staff
| Aspect | Detail |
|---|
| Symptoms | Normal flow |
| Detection | clock.system_auto event; staff.terminated.v1 cascade |
| Blast radius | One auto clock-out + one terminated event in close succession |
| Auto mitigation | TerminateStaff use case auto-closes the open punch (source system_auto) before emitting terminated.v1 |
| Manual | None |
| Runbook | n/a (designed behavior) |
| Data risk | None |
9. Property Deactivated With Active Shifts
| Aspect | Detail |
|---|
| Symptoms | property.deactivated.v1 consumed; future scheduled shifts cancelled |
| Detection | Inbox event; cascade emits one staff.shift.cancelled.v1 per shift |
| Blast radius | Future shifts at the property cancelled with cancelReason='property_deactivated'; active staff at the property NOT auto-terminated (operator decides) |
| Auto mitigation | Per APPLICATION_LOGIC §4.4 |
| Manual | Tenant admin reviews staff list and decides on transfer or termination |
| Runbook | runbooks/staff/property-deactivation.md |
| Data risk | None |
10. Schedule Generation Idempotency Drift
| Aspect | Detail |
|---|
| Symptoms | GenerateShifts creates duplicate shifts on retry |
| Detection | EXCLUDE constraint on (tenant_id, pattern_id, local_date) rejects duplicates with 23P01 |
| Blast radius | Single use-case retry returns 409 OCC_CONFLICT; no data corruption |
| Auto mitigation | Idempotency-Key 24 h dedupe + DB constraint |
| Manual | Operator retries with new key |
| Runbook | n/a |
| Data risk | None |
11. Edge Anomaly Model Drift / FPR Spike
| Aspect | Detail |
|---|
| Symptoms | Manager overrides spike on Electron front-desk; staff complain about flagging |
| Detection | Weekly FPR > 12 % alert (P3); manager-override count per actor > 5/day (P3) |
| Blast radius | Operator friction; potential PIN-clock latency from extra confirmation step |
| Auto mitigation | None |
| Manual | Pin Electron auto-update channel to last-known-good model; coordinate with ai-orchestrator to retrain |
| Runbook | runbooks/staff/edge-anomaly-drift.md |
| Data risk | None |
12. AI Orchestrator Unavailable
| Aspect | Detail |
|---|
| Symptoms | GET /forecast/staffing-gaps returns degraded=true; POST /ai/draft-staff-tags returns 503 |
| Detection | Outbound HTTP 5xx counter; circuit breaker trip |
| Blast radius | AI surfaces degraded; baseline workflows continue |
| Auto mitigation | Per AI_INTEGRATION §3, §6 |
| Manual | None for short outages; if extended, notify GMs that AI suggestions are paused |
| Runbook | runbooks/staff/ai-orchestrator-down.md |
| Data risk | None |
13. Sync Service Unable to Push (Electron Offline > 24 h)
| Aspect | Detail |
|---|
| Symptoms | Device queue grows; on push, server flags late-replay punches |
| Detection | staff.sync.reconcile job alerts on devices > 7 d behind |
| Blast radius | Local-only data on device; operators rely on manual records |
| Auto mitigation | Sync push accepts with flagged_late_replay; per SYNC_CONTRACT §6.2 |
| Manual | Field tech checks device connectivity; tenant admin notified if persistent |
| Runbook | runbooks/staff/device-sync-stale.md |
| Data risk | Time-skew on punches; operator reviews flagged entries |
14. Migration Failure in Production
| Aspect | Detail |
|---|
| Symptoms | Deploy pipeline aborts; staff-api rollout halted |
| Detection | Cloud Deploy alert (P1) |
| Blast radius | New version not rolled out; current version unaffected |
| Auto mitigation | None (pipeline gate) |
| Manual | Per runbooks/staff/migration-failure.md: investigate Flyway error, fix migration, redeploy. Forward-fix only — do NOT roll back schema |
| Runbook | runbooks/staff/migration-failure.md |
| Data risk | None if pipeline stopped before traffic-shift |
15. Cross-Region Replica Lag (M2)
| Aspect | Detail |
|---|
| Symptoms | Reads in secondary region serve stale capacity; writes still primary-only |
| Detection | Cloud SQL replica lag metric > 30 s |
| Blast radius | Read users in secondary region see slightly stale schedule |
| Auto mitigation | Read traffic falls back to primary if lag > threshold (Cloud Run header-based routing) |
| Manual | Investigate primary write rate; consider scaling up replica |
| Runbook | runbooks/staff/replica-lag.md |
| Data risk | Operational; no data loss |
16. Audit Append-Only Trigger Bypass Attempt
| Aspect | Detail |
|---|
| Symptoms | DB error 0L000 from a use case writing to clock_entries / handoff_notes / audit_events via UPDATE/DELETE |
| Detection | Application logs ERROR; CI integration test append-only-trigger.spec.ts |
| Blast radius | Use case fails fast with 500; no data corruption |
| Auto mitigation | DB trigger raises |
| Manual | Code review; the only legitimate UPDATE/DELETE on these tables is via migration with app.role='platform_admin' |
| Runbook | n/a |
| Data risk | None |
17. Recovery Time Objectives
| Failure type | RTO | RPO |
|---|
| Cloud SQL HA failover | 2 min | 0 |
| Pub/Sub regional outage | 5 min | 0 (durable outbox) |
| KMS regional outage | 10 min | 0 |
| Region failover (M2) | 15 min | ≤ 30 s |
| Backup restore (PITR) | 1 h | 7 d max |
18. Postmortem Trigger
A postmortem is required for:
- Any P1 alert
- Data loss in
clock_entries, audit_events, staff (any row)
- Cross-tenant data leak (RLS bypass)
- IAM revoke cascade failure that leads to active session post-termination > 30 min
- DSAR overdue
- Staffing gap that resulted in unstaffed property for > 30 min and was caused by service-side bug
Postmortems live in runbooks/staff/postmortems/YYYY-MM-DD-<slug>.md.