compliance-engine — Observability
Status: populated | Last updated: 2026-04-18
1. Prometheus Metrics
All metrics are exposed at GET /metrics on port 3002 in Prometheus text format.
1.1 Evaluation Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
compliance_evaluations_total | Counter | verdict, tenant_id, rule_set_id | Total evaluations by verdict |
compliance_evaluation_duration_seconds | Histogram | verdict | End-to-end gRPC evaluation latency |
compliance_evaluation_budget_exceeded_total | Counter | tenant_id | Times 70 ms budget was exceeded mid-evaluation |
compliance_evaluation_errors_total | Counter | error_type | Evaluation errors (db_error, redis_error, ai_error) |
compliance_unavailable_retry_total | Counter | — | Times sms-orchestrator NATS consumer deferred a message due to compliance unavailability (fail-closed redelivery) |
Histogram buckets for compliance_evaluation_duration_seconds:
[0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.25, 0.5, 1.0]
1.2 Rule Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
compliance_rule_matches_total | Counter | rule_type, action, rule_id | Rule match counts (identifies hot rules) |
compliance_rule_cache_hits_total | Counter | — | Redis rule set cache hits |
compliance_rule_cache_misses_total | Counter | — | Redis rule set cache misses |
compliance_active_rules_total | Gauge | rule_type, action | Current count of active rules |
1.3 Hold Queue Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
compliance_hold_queue_pending_total | Gauge | — | Total messages currently in PENDING status |
compliance_hold_queue_size_by_tenant | Gauge | tenant_id | Per-tenant pending hold count |
compliance_hold_reviews_total | Counter | action (RELEASE / REJECT / AUTO_EXPIRED) | Hold review decisions |
compliance_hold_age_seconds | Histogram | — | Age of messages when reviewed |
1.4 Tenant Score Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
compliance_tenant_score | Gauge | tenant_id, risk_tier | Current compliance score per tenant |
compliance_tier_distribution | Gauge | tier | Count of tenants per risk tier |
compliance_tier_transitions_total | Counter | from_tier, to_tier | Tier change events |
compliance_scoring_cycle_duration_seconds | Histogram | — | Time to complete one full scoring cycle |
1.5 AI Classification Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
compliance_ai_requests_total | Counter | provider, status (success/error/cached) | AI API call volume |
compliance_ai_duration_seconds | Histogram | provider | LLM API call latency |
compliance_ai_cache_hits_total | Counter | — | AI result cache hits (body hash) |
compliance_ai_categories_total | Counter | category | Category match counts for AI rules |
compliance_ai_fallback_total | Counter | fallback_action | Times AI fallback was applied |
1.6 DLR Stats Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
compliance_dlr_events_consumed_total | Counter | — | DLR events processed |
compliance_dlr_failure_rate | Gauge | tenant_id, account_id, window | Current DLR failure rate by window |
2. Structured Log Events
All log output is valid JSON (Pino format). Log level controlled by LOG_LEVEL env var.
2.1 Evaluation Events
{
"level": "info",
"time": "2026-04-18T14:22:00.123Z",
"event": "compliance.evaluated",
"messageId": "d4e55f11-...",
"tenantId": "a1b22c33-...",
"accountId": "b2c33d44-...",
"verdict": "ALLOW",
"evaluationId": "e5f66g22-...",
"ruleSetId": "ruleset-uuid",
"findingsCount": 0,
"latencyMs": 12,
"traceId": "abc123",
"spanId": "def456"
}
{
"level": "warn",
"event": "compliance.evaluated",
"verdict": "BLOCK",
"findingsCount": 2,
"ruleTypes": ["KEYWORD", "AI_CLASSIFICATION"],
"to": "+937***",
"latencyMs": 45
}
Note: to field appears as +CCNNN*** (country code + first 3 digits + asterisks). Body is NEVER logged.
2.2 Hold Queue Events
{
"level": "info",
"event": "compliance.held",
"holdId": "c3d44e00-...",
"messageId": "d4e55f11-...",
"tenantId": "a1b22c33-...",
"triggerCount": 1,
"reviewPriority": 87,
"autoExpiresAt": "2026-04-19T14:22:00Z"
}
{
"level": "info",
"event": "compliance.hold.reviewed",
"holdId": "c3d44e00-...",
"action": "RELEASE",
"reviewedBy": "admin-uuid",
"ageSeconds": 2580
}
2.3 Tier Change Events
{
"level": "warn",
"event": "compliance.tier.changed",
"tenantId": "a1b22c33-...",
"previousTier": "MONITOR",
"newTier": "RESTRICTED",
"overallScore": 55.3,
"calculatedAt": "2026-04-18T14:00:00Z"
}
{
"level": "error",
"event": "compliance.tier.changed",
"tenantId": "a1b22c33-...",
"previousTier": "RESTRICTED",
"newTier": "SUSPENDED",
"overallScore": 22.1
}
2.4 AI Classification Events
{
"level": "info",
"event": "compliance.ai.classified",
"bodyHash": "sha256:abcdef...",
"provider": "claude",
"categories": {
"SPAM": 0.12,
"PHISHING": 0.87,
"FINANCIAL_FRAUD": 0.05
},
"cacheHit": false,
"latencyMs": 380
}
2.5 Error Events
{
"level": "error",
"event": "compliance.evaluation.error",
"errorType": "db_unavailable",
"failOpenApplied": true,
"messageId": "d4e55f11-...",
"err": { "message": "Connection timeout", "code": "ECONNREFUSED" }
}
{
"level": "warn",
"event": "compliance.evaluation.budget_exceeded",
"messageId": "d4e55f11-...",
"budgetMs": 70,
"elapsedMs": 74,
"skippedRuleTypes": ["AI_CLASSIFICATION"]
}
3. OpenTelemetry Tracing
Parent span: compliance-engine.EvaluateCompliance
| Span | Operation | Attributes |
|---|---|---|
compliance.cache.check | Redis GET eval:cache | cache.hit |
compliance.ruleset.load | Redis GET or PG query | cache.hit, rule_count |
compliance.rules.evaluate | In-process evaluation | verdict, rules_evaluated |
compliance.ai.classify | LLM API call | provider, cache.hit, categories |
compliance.hold.insert | PG INSERT hold_queue | hold_id, priority |
compliance.audit.write | PG INSERT + NATS publish | evaluation_id |
Trace context propagated from grpc-trace-bin header (W3C Trace Context).
4. Alerting Rules
| Alert | Condition | Severity | Action |
|---|---|---|---|
ComplianceHoldQueueHigh | compliance_hold_queue_pending_total > 500 for 5 min | HIGH | Page compliance team |
ComplianceHoldQueueCritical | compliance_hold_queue_pending_total > 2000 for 2 min | CRITICAL | Page on-call |
TenantSuspended | compliance_tier_transitions_total{to_tier="SUSPENDED"} > 0 | HIGH | Notify compliance team |
ComplianceEvalP95High | histogram_quantile(0.95, compliance_evaluation_duration_seconds) > 0.1 | MEDIUM | Investigate |
ComplianceEvalErrorHigh | rate(compliance_evaluation_errors_total[5m]) > 0.001 | HIGH | Investigate |
ComplianceUnavailableRetries | rate(compliance_unavailable_retry_total[5m]) > 1 | HIGH | Investigate compliance-engine health; messages are being held by fail-closed NATS redelivery |
AIServiceUnavailable | rate(compliance_ai_fallback_total[5m]) > 0.1 | HIGH | Notify platform team |
HoldQueueAutoExpiring | rate(compliance_hold_reviews_total{action="AUTO_EXPIRED"}[1h]) > 50 | MEDIUM | Notify compliance team |
5. Grafana Dashboard Panels
Dashboard: dashboards/compliance-engine.json
| Panel | Query | Visualization |
|---|---|---|
| Evaluation Rate | rate(compliance_evaluations_total[5m]) by verdict | Stacked area |
| Evaluation P95 Latency | histogram_quantile(0.95, ...) | Time series |
| Verdict Distribution | compliance_evaluations_total by verdict (last 24h) | Pie chart |
| Hold Queue Depth | compliance_hold_queue_pending_total | Gauge + time series |
| Hold Queue by Tenant | compliance_hold_queue_size_by_tenant top 10 | Bar chart |
| Tenant Risk Tier Distribution | compliance_tier_distribution by tier | Pie chart |
| Tier Transitions (24h) | compliance_tier_transitions_total | Heatmap |
| AI Classification Rate | rate(compliance_ai_requests_total[5m]) | Time series |
| AI Cache Hit Rate | compliance_ai_cache_hits_total / compliance_ai_requests_total | Gauge |
| Top Triggered Rules | topk(10, compliance_rule_matches_total) | Bar chart |
| DLR Failure Rate by Tenant | compliance_dlr_failure_rate{window="24h"} | Table |