Skip to main content

compliance-engine — Observability

Status: populated | Last updated: 2026-04-18

1. Prometheus Metrics

All metrics are exposed at GET /metrics on port 3002 in Prometheus text format.

1.1 Evaluation Metrics

MetricTypeLabelsDescription
compliance_evaluations_totalCounterverdict, tenant_id, rule_set_idTotal evaluations by verdict
compliance_evaluation_duration_secondsHistogramverdictEnd-to-end gRPC evaluation latency
compliance_evaluation_budget_exceeded_totalCountertenant_idTimes 70 ms budget was exceeded mid-evaluation
compliance_evaluation_errors_totalCountererror_typeEvaluation errors (db_error, redis_error, ai_error)
compliance_unavailable_retry_totalCounterTimes sms-orchestrator NATS consumer deferred a message due to compliance unavailability (fail-closed redelivery)

Histogram buckets for compliance_evaluation_duration_seconds: [0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.25, 0.5, 1.0]

1.2 Rule Metrics

MetricTypeLabelsDescription
compliance_rule_matches_totalCounterrule_type, action, rule_idRule match counts (identifies hot rules)
compliance_rule_cache_hits_totalCounterRedis rule set cache hits
compliance_rule_cache_misses_totalCounterRedis rule set cache misses
compliance_active_rules_totalGaugerule_type, actionCurrent count of active rules

1.3 Hold Queue Metrics

MetricTypeLabelsDescription
compliance_hold_queue_pending_totalGaugeTotal messages currently in PENDING status
compliance_hold_queue_size_by_tenantGaugetenant_idPer-tenant pending hold count
compliance_hold_reviews_totalCounteraction (RELEASE / REJECT / AUTO_EXPIRED)Hold review decisions
compliance_hold_age_secondsHistogramAge of messages when reviewed

1.4 Tenant Score Metrics

MetricTypeLabelsDescription
compliance_tenant_scoreGaugetenant_id, risk_tierCurrent compliance score per tenant
compliance_tier_distributionGaugetierCount of tenants per risk tier
compliance_tier_transitions_totalCounterfrom_tier, to_tierTier change events
compliance_scoring_cycle_duration_secondsHistogramTime to complete one full scoring cycle

1.5 AI Classification Metrics

MetricTypeLabelsDescription
compliance_ai_requests_totalCounterprovider, status (success/error/cached)AI API call volume
compliance_ai_duration_secondsHistogramproviderLLM API call latency
compliance_ai_cache_hits_totalCounterAI result cache hits (body hash)
compliance_ai_categories_totalCountercategoryCategory match counts for AI rules
compliance_ai_fallback_totalCounterfallback_actionTimes AI fallback was applied

1.6 DLR Stats Metrics

MetricTypeLabelsDescription
compliance_dlr_events_consumed_totalCounterDLR events processed
compliance_dlr_failure_rateGaugetenant_id, account_id, windowCurrent DLR failure rate by window

2. Structured Log Events

All log output is valid JSON (Pino format). Log level controlled by LOG_LEVEL env var.

2.1 Evaluation Events

{
"level": "info",
"time": "2026-04-18T14:22:00.123Z",
"event": "compliance.evaluated",
"messageId": "d4e55f11-...",
"tenantId": "a1b22c33-...",
"accountId": "b2c33d44-...",
"verdict": "ALLOW",
"evaluationId": "e5f66g22-...",
"ruleSetId": "ruleset-uuid",
"findingsCount": 0,
"latencyMs": 12,
"traceId": "abc123",
"spanId": "def456"
}
{
"level": "warn",
"event": "compliance.evaluated",
"verdict": "BLOCK",
"findingsCount": 2,
"ruleTypes": ["KEYWORD", "AI_CLASSIFICATION"],
"to": "+937***",
"latencyMs": 45
}

Note: to field appears as +CCNNN*** (country code + first 3 digits + asterisks). Body is NEVER logged.

2.2 Hold Queue Events

{
"level": "info",
"event": "compliance.held",
"holdId": "c3d44e00-...",
"messageId": "d4e55f11-...",
"tenantId": "a1b22c33-...",
"triggerCount": 1,
"reviewPriority": 87,
"autoExpiresAt": "2026-04-19T14:22:00Z"
}
{
"level": "info",
"event": "compliance.hold.reviewed",
"holdId": "c3d44e00-...",
"action": "RELEASE",
"reviewedBy": "admin-uuid",
"ageSeconds": 2580
}

2.3 Tier Change Events

{
"level": "warn",
"event": "compliance.tier.changed",
"tenantId": "a1b22c33-...",
"previousTier": "MONITOR",
"newTier": "RESTRICTED",
"overallScore": 55.3,
"calculatedAt": "2026-04-18T14:00:00Z"
}
{
"level": "error",
"event": "compliance.tier.changed",
"tenantId": "a1b22c33-...",
"previousTier": "RESTRICTED",
"newTier": "SUSPENDED",
"overallScore": 22.1
}

2.4 AI Classification Events

{
"level": "info",
"event": "compliance.ai.classified",
"bodyHash": "sha256:abcdef...",
"provider": "claude",
"categories": {
"SPAM": 0.12,
"PHISHING": 0.87,
"FINANCIAL_FRAUD": 0.05
},
"cacheHit": false,
"latencyMs": 380
}

2.5 Error Events

{
"level": "error",
"event": "compliance.evaluation.error",
"errorType": "db_unavailable",
"failOpenApplied": true,
"messageId": "d4e55f11-...",
"err": { "message": "Connection timeout", "code": "ECONNREFUSED" }
}
{
"level": "warn",
"event": "compliance.evaluation.budget_exceeded",
"messageId": "d4e55f11-...",
"budgetMs": 70,
"elapsedMs": 74,
"skippedRuleTypes": ["AI_CLASSIFICATION"]
}

3. OpenTelemetry Tracing

Parent span: compliance-engine.EvaluateCompliance

SpanOperationAttributes
compliance.cache.checkRedis GET eval:cachecache.hit
compliance.ruleset.loadRedis GET or PG querycache.hit, rule_count
compliance.rules.evaluateIn-process evaluationverdict, rules_evaluated
compliance.ai.classifyLLM API callprovider, cache.hit, categories
compliance.hold.insertPG INSERT hold_queuehold_id, priority
compliance.audit.writePG INSERT + NATS publishevaluation_id

Trace context propagated from grpc-trace-bin header (W3C Trace Context).


4. Alerting Rules

AlertConditionSeverityAction
ComplianceHoldQueueHighcompliance_hold_queue_pending_total > 500 for 5 minHIGHPage compliance team
ComplianceHoldQueueCriticalcompliance_hold_queue_pending_total > 2000 for 2 minCRITICALPage on-call
TenantSuspendedcompliance_tier_transitions_total{to_tier="SUSPENDED"} > 0HIGHNotify compliance team
ComplianceEvalP95Highhistogram_quantile(0.95, compliance_evaluation_duration_seconds) > 0.1MEDIUMInvestigate
ComplianceEvalErrorHighrate(compliance_evaluation_errors_total[5m]) > 0.001HIGHInvestigate
ComplianceUnavailableRetriesrate(compliance_unavailable_retry_total[5m]) > 1HIGHInvestigate compliance-engine health; messages are being held by fail-closed NATS redelivery
AIServiceUnavailablerate(compliance_ai_fallback_total[5m]) > 0.1HIGHNotify platform team
HoldQueueAutoExpiringrate(compliance_hold_reviews_total{action="AUTO_EXPIRED"}[1h]) > 50MEDIUMNotify compliance team

5. Grafana Dashboard Panels

Dashboard: dashboards/compliance-engine.json

PanelQueryVisualization
Evaluation Raterate(compliance_evaluations_total[5m]) by verdictStacked area
Evaluation P95 Latencyhistogram_quantile(0.95, ...)Time series
Verdict Distributioncompliance_evaluations_total by verdict (last 24h)Pie chart
Hold Queue Depthcompliance_hold_queue_pending_totalGauge + time series
Hold Queue by Tenantcompliance_hold_queue_size_by_tenant top 10Bar chart
Tenant Risk Tier Distributioncompliance_tier_distribution by tierPie chart
Tier Transitions (24h)compliance_tier_transitions_totalHeatmap
AI Classification Raterate(compliance_ai_requests_total[5m])Time series
AI Cache Hit Ratecompliance_ai_cache_hits_total / compliance_ai_requests_totalGauge
Top Triggered Rulestopk(10, compliance_rule_matches_total)Bar chart
DLR Failure Rate by Tenantcompliance_dlr_failure_rate{window="24h"}Table