Skip to main content

OBSERVABILITY — staff-service

Sibling: DEPLOYMENT_TOPOLOGY · FAILURE_MODES · SECURITY_MODEL

Strategic anchors: 02 §12 Resilience · 02 §14 Observability · standards/SERVICE_TEMPLATE

staff-service exposes the canonical platform telemetry stack (OpenTelemetry → Cloud Trace / Cloud Logging / Cloud Monitoring; Pub/Sub → BigQuery for events). The dashboards and SLOs below are operationally focused: capacity-signal freshness, clock-in correctness, and outbox health are the three things that absolutely must alert.


1. SLOs

SLOTargetWindowError budget
Punch availability (POST /clock/punch 2xx)99.95 %30 d21.6 min
Punch latency p95< 400 ms30 dn/a (latency)
Capacity freshness≤ 60 s end-to-end (event-to-cache)30 d1 % > 60 s
Outbox publish lag p99< 30 s30 d0.1 % > 30 s
Inbox processing lag p99< 60 s30 d0.1 % > 60 s
Schedule grid availability (GET /shifts 2xx)99.9 %30 d43.2 min
DSAR export turnaround< 48 h90 dn/a (compliance)

Error budget exhaustion triggers the freeze policy in DOD §Error budget.


2. SLIs (Cloud Monitoring)

SLI nameSourceNotes
staff.api.punch.success_rateAPI HTTP metrics, code in [200, 299]Excludes intentional 4xx (PIN failures)
staff.api.punch.latency.p95API HTTP metricsPer-region
staff.api.shifts.success_rateAPI HTTP metrics
staff.outbox.depthCustom gauge, staff.outbox count where published_at IS NULLSampled every 15 s
staff.outbox.publish.latencyOutbox relay span timing
staff.inbox.lagPub/Sub oldest_unacked_message_agePer subscription
staff.capacity.cache.hit_rateCustom counterTarget ≥ 90 %
staff.capacity.event_to_cache_msTrace spanPunch-event publish to cache invalidate
staff.pin.failure_rate.per_propertyCustom counterExcessive → security alert
staff.shift.staffing_gap_countCounter from staffing_gap_detected.v1Trended in dashboard, not alerted (informational)
staff.iam.revoke.failure_countCounterAlert if > 0 in 5 min window
staff.ai.shift_suggestion.applied_rateCounterHealth of AI value loop
staff.ai.edge.anomaly_ratePer-property gauge from edge telemetryDrift signal

3. Dashboards

3.1 staff/overview (Cloud Monitoring)

  • Punch RPS, success rate, latency p50/p95/p99
  • Schedule grid RPS + latency
  • Capacity freshness (event_to_cache_ms p95 + cache hit rate)
  • Active shift count by property
  • DB CPU, connections, slow queries

3.2 staff/clock-in

  • Punches per minute, by source (electron_pin, electron_jwt, mobile_jwt, web_jwt, manager_override, offline_replay, system_auto)
  • PIN failure rate per property + per device
  • Manager-override count per actor (top 10)
  • Multi-device-collision count (rare, alert at > 0)
  • Offline-replay age distribution

3.3 staff/outbox

  • Outbox depth (gauge)
  • Publish latency p95
  • Failed publish count (DLQ candidates)
  • Per-topic publish rate

3.4 staff/inbox

  • Per-consumed-subject lag p99
  • DLQ count (alert at > 0)
  • Reprocessing count (replays)

3.5 staff/leave

  • Leave requests by status, type, day
  • Approval latency p95 (request → decided)
  • Force-unassigned count per approval

3.6 staff/ai

  • Shift suggestion: viewed, applied, rejected (counts)
  • Forecast call count + degraded rate
  • Edge anomaly rate, model version distribution

4. Tracing

OpenTelemetry instrumentation:

  • HTTP server spans (http.method, http.route, http.status_code, tenant.id, property.id, actor.id).
  • DB spans (db.system=postgresql, db.statement.template, parameter cardinality only — no values).
  • Pub/Sub spans (publish + subscribe; correlation via traceparent carried in event envelope metadata.traceparent).
  • Outbound HTTP to iam-service, property-service, ai-orchestrator-service.

Trace exemplars are linked from latency dashboards. Sampling: 100 % for 5xx and 4xx∈{401,403,409,422,423,429}; 5 % head-based for the rest; 1 % adaptive for high-volume clock.punch (always sampled when status non-2xx).


5. Logging

JSON-structured, single-line, fields:

FieldSource
timestampRFC 3339
severityINFO / WARN / ERROR / CRITICAL
service'staff-service'
tenantIdfrom app.tenant_id
actorIdJWT subject
propertyIdheader
routeHTTP route template
methodHTTP
statusHTTP
latencyMsrequest duration
traceIdW3C
spanIdW3C
eventone of request.complete, outbox.publish.ok, outbox.publish.fail, inbox.process.ok, inbox.process.fail, pin.fail, pin.lock, clock.override, iam.revoke.fail, audit.write, ai.degraded, migration.run
metadataevent-specific

PII-clean: emails, phone numbers, emergency contacts are never logged. PIN values never logged. The codebase has a CI lint rule (@ghasi/eslint-pii) that flags string templates containing forbidden field names.


6. Alerts

Pager-grade alerts route to PagerDuty service staff-service; warn-grade route to Slack #staff-svc-warnings.

AlertSeverityTriggerRunbook
Punch error rate > 0.5 % for 5 minP1SLO burn 14× normalrunbooks/staff/clock-error-spike.md
Outbox depth > 1000 sustained 5 minP1Publishers stuckrunbooks/staff/outbox-stalled.md
Outbox publish lag p99 > 60 s for 10 minP2Pub/Sub or DB pressuresame as above
Inbox lag p99 > 5 min for 5 minP1Consumer backed uprunbooks/staff/inbox-stalled.md
Inbox DLQ depth > 0P2Schema mismatch or consumer bugrunbooks/staff/inbox-dlq.md
iam.revoke.failure_count > 0 in 5 minP1Termination cascade incompleterunbooks/staff/iam-revoke-failed.md
multi_device_punch_detected > 0 in 1 hP2Reconciliation neededrunbooks/staff/multi-device-punch.md
PIN failure rate > 1 / s sustained 2 min for any propertyP2Possible brute-forcerunbooks/staff/pin-bruteforce.md
Capacity freshness p95 > 120 s for 10 minP2Cache invalidation stuckrunbooks/staff/capacity-stale.md
Edge anomaly model drift (FPR > 12 % weekly)P3Model needs retrainingrunbooks/staff/edge-anomaly-drift.md
DSAR export overdue > 48 hP2Compliance breach riskrunbooks/staff/dsar-overdue.md
Migration failed in prodP1Flyway non-zero exitrunbooks/staff/migration-failure.md

Each alert page includes the dashboard panel link and the runbook link. Runbooks live in runbooks/staff/ in the docs repo and ship with the service.


7. Health Endpoints

EndpointReturns
GET /health/startup200 once Flyway migrations complete and JWKS is loaded (deadline 30 s)
GET /health/live200 if process can serve any request (no deps)
GET /health/ready200 if Postgres reachable + Redis reachable + KMS reachable + Pub/Sub publisher healthy + outbox depth < 5000

/health/ready returns a JSON body with per-dep status for Cloud Run probes and on-call diagnosis.


8. Edge Telemetry (Anomaly Model Drift)

The Electron edge anomaly model (AI_INTEGRATION §4) emits a small heartbeat to bff-backoffice-service containing:

  • Aggregate count of scores in three buckets (< 0.3, 0.3..0.7, ≥ 0.7)
  • Model version
  • Property + device

Aggregated by staff-service as staff.ai.edge.anomaly_rate. A weekly batch job recomputes the false-positive rate (compared to ground-truth: did a manager-override actually happen for that punch?) and alerts if FPR > 12 %.


9. Audit Log Visibility

  • Audit rows are queryable via bff-backoffice-service admin endpoints (capability staff.audit.read).
  • BigQuery cold export: melmastoon.staff.audit_events table; partitioned daily, clustered by tenant_id.
  • Suspicious-action analytics dashboards are owned by security-service, not by us.

10. Cost Telemetry

CounterUse
Pub/Sub publish bytes per topicPub/Sub cost attribution
BigQuery export rows per dayCold storage cost
Cloud SQL row growth per tableDB sizing forecast
AI orchestrator calls per surfaceAI cost attribution (cross-checked with ai-orchestrator)
Egress bytes per BFFNetwork cost

A monthly tenant cost report is computed from BigQuery and surfaced via tenant-service (we do not bill).