OBSERVABILITY — staff-service

Sibling: DEPLOYMENT_TOPOLOGY · FAILURE_MODES · SECURITY_MODEL

Strategic anchors: 02 §12 Resilience · 02 §14 Observability · standards/SERVICE_TEMPLATE

staff-service exposes the canonical platform telemetry stack (OpenTelemetry → Cloud Trace / Cloud Logging / Cloud Monitoring; Pub/Sub → BigQuery for events). The dashboards and SLOs below are operationally focused: capacity-signal freshness, clock-in correctness, and outbox health are the three things that absolutely must alert.

1. SLOs

SLO	Target	Window	Error budget
Punch availability (`POST /clock/punch` 2xx)	99.95 %	30 d	21.6 min
Punch latency p95	< 400 ms	30 d	n/a (latency)
Capacity freshness	≤ 60 s end-to-end (event-to-cache)	30 d	1 % > 60 s
Outbox publish lag p99	< 30 s	30 d	0.1 % > 30 s
Inbox processing lag p99	< 60 s	30 d	0.1 % > 60 s
Schedule grid availability (`GET /shifts` 2xx)	99.9 %	30 d	43.2 min
DSAR export turnaround	< 48 h	90 d	n/a (compliance)

Error budget exhaustion triggers the freeze policy in DOD §Error budget.

2. SLIs (Cloud Monitoring)

SLI name	Source	Notes
`staff.api.punch.success_rate`	API HTTP metrics, code in [200, 299]	Excludes intentional 4xx (PIN failures)
`staff.api.punch.latency.p95`	API HTTP metrics	Per-region
`staff.api.shifts.success_rate`	API HTTP metrics
`staff.outbox.depth`	Custom gauge, `staff.outbox` count where `published_at IS NULL`	Sampled every 15 s
`staff.outbox.publish.latency`	Outbox relay span timing
`staff.inbox.lag`	Pub/Sub `oldest_unacked_message_age`	Per subscription
`staff.capacity.cache.hit_rate`	Custom counter	Target ≥ 90 %
`staff.capacity.event_to_cache_ms`	Trace span	Punch-event publish to cache invalidate
`staff.pin.failure_rate.per_property`	Custom counter	Excessive → security alert
`staff.shift.staffing_gap_count`	Counter from `staffing_gap_detected.v1`	Trended in dashboard, not alerted (informational)
`staff.iam.revoke.failure_count`	Counter	Alert if > 0 in 5 min window
`staff.ai.shift_suggestion.applied_rate`	Counter	Health of AI value loop
`staff.ai.edge.anomaly_rate`	Per-property gauge from edge telemetry	Drift signal

3. Dashboards

3.1 `staff/overview` (Cloud Monitoring)

Punch RPS, success rate, latency p50/p95/p99
Schedule grid RPS + latency
Capacity freshness (event_to_cache_ms p95 + cache hit rate)
Active shift count by property
DB CPU, connections, slow queries

3.2 `staff/clock-in`

Punches per minute, by source (electron_pin, electron_jwt, mobile_jwt, web_jwt, manager_override, offline_replay, system_auto)
PIN failure rate per property + per device
Manager-override count per actor (top 10)
Multi-device-collision count (rare, alert at > 0)
Offline-replay age distribution

3.3 `staff/outbox`

Outbox depth (gauge)
Publish latency p95
Failed publish count (DLQ candidates)
Per-topic publish rate

3.4 `staff/inbox`

Per-consumed-subject lag p99
DLQ count (alert at > 0)
Reprocessing count (replays)

3.5 `staff/leave`

Leave requests by status, type, day
Approval latency p95 (request → decided)
Force-unassigned count per approval

3.6 `staff/ai`

Shift suggestion: viewed, applied, rejected (counts)
Forecast call count + degraded rate
Edge anomaly rate, model version distribution

4. Tracing

OpenTelemetry instrumentation:

HTTP server spans (http.method, http.route, http.status_code, tenant.id, property.id, actor.id).
DB spans (db.system=postgresql, db.statement.template, parameter cardinality only — no values).
Pub/Sub spans (publish + subscribe; correlation via traceparent carried in event envelope metadata.traceparent).
Outbound HTTP to iam-service, property-service, ai-orchestrator-service.

Trace exemplars are linked from latency dashboards. Sampling: 100 % for 5xx and 4xx∈{401,403,409,422,423,429}; 5 % head-based for the rest; 1 % adaptive for high-volume clock.punch (always sampled when status non-2xx).

5. Logging

JSON-structured, single-line, fields:

Field	Source
`timestamp`	RFC 3339
`severity`	`INFO` / `WARN` / `ERROR` / `CRITICAL`
`service`	`'staff-service'`
`tenantId`	from `app.tenant_id`
`actorId`	JWT subject
`propertyId`	header
`route`	HTTP route template
`method`	HTTP
`status`	HTTP
`latencyMs`	request duration
`traceId`	W3C
`spanId`	W3C
`event`	one of `request.complete`, `outbox.publish.ok`, `outbox.publish.fail`, `inbox.process.ok`, `inbox.process.fail`, `pin.fail`, `pin.lock`, `clock.override`, `iam.revoke.fail`, `audit.write`, `ai.degraded`, `migration.run`
`metadata`	event-specific

PII-clean: emails, phone numbers, emergency contacts are never logged. PIN values never logged. The codebase has a CI lint rule (@ghasi/eslint-pii) that flags string templates containing forbidden field names.

6. Alerts

Pager-grade alerts route to PagerDuty service staff-service; warn-grade route to Slack #staff-svc-warnings.

Alert	Severity	Trigger	Runbook
Punch error rate > 0.5 % for 5 min	P1	SLO burn 14× normal	`runbooks/staff/clock-error-spike.md`
Outbox depth > 1000 sustained 5 min	P1	Publishers stuck	`runbooks/staff/outbox-stalled.md`
Outbox publish lag p99 > 60 s for 10 min	P2	Pub/Sub or DB pressure	same as above
Inbox lag p99 > 5 min for 5 min	P1	Consumer backed up	`runbooks/staff/inbox-stalled.md`
Inbox DLQ depth > 0	P2	Schema mismatch or consumer bug	`runbooks/staff/inbox-dlq.md`
`iam.revoke.failure_count` > 0 in 5 min	P1	Termination cascade incomplete	`runbooks/staff/iam-revoke-failed.md`
`multi_device_punch_detected` > 0 in 1 h	P2	Reconciliation needed	`runbooks/staff/multi-device-punch.md`
PIN failure rate > 1 / s sustained 2 min for any property	P2	Possible brute-force	`runbooks/staff/pin-bruteforce.md`
Capacity freshness p95 > 120 s for 10 min	P2	Cache invalidation stuck	`runbooks/staff/capacity-stale.md`
Edge anomaly model drift (FPR > 12 % weekly)	P3	Model needs retraining	`runbooks/staff/edge-anomaly-drift.md`
DSAR export overdue > 48 h	P2	Compliance breach risk	`runbooks/staff/dsar-overdue.md`
Migration failed in prod	P1	Flyway non-zero exit	`runbooks/staff/migration-failure.md`

Each alert page includes the dashboard panel link and the runbook link. Runbooks live in runbooks/staff/ in the docs repo and ship with the service.

7. Health Endpoints

Endpoint	Returns
`GET /health/startup`	200 once Flyway migrations complete and JWKS is loaded (deadline 30 s)
`GET /health/live`	200 if process can serve any request (no deps)
`GET /health/ready`	200 if Postgres reachable + Redis reachable + KMS reachable + Pub/Sub publisher healthy + outbox depth < 5000

/health/ready returns a JSON body with per-dep status for Cloud Run probes and on-call diagnosis.

8. Edge Telemetry (Anomaly Model Drift)

The Electron edge anomaly model (AI_INTEGRATION §4) emits a small heartbeat to bff-backoffice-service containing:

Aggregate count of scores in three buckets (< 0.3, 0.3..0.7, ≥ 0.7)
Model version
Property + device

Aggregated by staff-service as staff.ai.edge.anomaly_rate. A weekly batch job recomputes the false-positive rate (compared to ground-truth: did a manager-override actually happen for that punch?) and alerts if FPR > 12 %.

9. Audit Log Visibility

Audit rows are queryable via bff-backoffice-service admin endpoints (capability staff.audit.read).
BigQuery cold export: melmastoon.staff.audit_events table; partitioned daily, clustered by tenant_id.
Suspicious-action analytics dashboards are owned by security-service, not by us.

10. Cost Telemetry

Counter	Use
Pub/Sub publish bytes per topic	Pub/Sub cost attribution
BigQuery export rows per day	Cold storage cost
Cloud SQL row growth per table	DB sizing forecast
AI orchestrator calls per surface	AI cost attribution (cross-checked with `ai-orchestrator`)
Egress bytes per BFF	Network cost

A monthly tenant cost report is computed from BigQuery and surfaced via tenant-service (we do not bill).

1. SLOs​

2. SLIs (Cloud Monitoring)​

3. Dashboards​

3.1 staff/overview (Cloud Monitoring)​

3.2 staff/clock-in​

3.3 staff/outbox​

3.4 staff/inbox​

3.5 staff/leave​

3.6 staff/ai​

4. Tracing​

5. Logging​

6. Alerts​

7. Health Endpoints​

8. Edge Telemetry (Anomaly Model Drift)​

9. Audit Log Visibility​

10. Cost Telemetry​