Fraud Intelligence Service — Sync Contract
Version: 1.0 Status: Draft Owner: Trust and Safety Last Updated: 2026-04-21 Companion: API_CONTRACTS · EVENT_SCHEMAS · DATA_MODEL
This document defines what other services depend on from fraud-intel-service and what fraud-intel-service depends on from others. It also specifies per-aggregate conflict policy and the outbox/inbox patterns used for cross-service consistency.
1. Consumers of Fraud Intelligence
| Service | Interface | Dependency type | SLA expectation |
|---|---|---|---|
compliance-engine | gRPC FraudIntelService/Score | Synchronous read in async pipeline | P95 ≤ 50 ms; availability 99.5% |
compliance-engine | NATS consumer fraud.tenant_score.updated.v1, fraud.detected.otp_grinding.v1 | Async event-driven | Event lag ≤ 60 s |
routing-engine | gRPC FraudIntelService/Score | Synchronous read for high-risk sender check | P95 ≤ 50 ms; availability 99.5% |
sms-firewall-service | NATS consumer fraud.detected.simbox.v1, .ait_ring.v1, .greyroute.v1, .simbox_network.v1 | Async event-driven | Event lag ≤ 60 s |
sender-id-registry-service | NATS consumer fraud.case.action_dispatched.v1 (filter SUSPEND_SENDER_ID) | Async event-driven | Event lag ≤ 60 s |
noc-dashboard | HTTP REST /v1/fraud/dashboard/*, SSE on fraud.detected.> | Synchronous + push | P95 ≤ 500 ms; SSE latency ≤ 2 s |
regulator-portal-service | HTTP REST /v1/internal/fraud/feed/import (push); consumer of fraud.feed.exported.v1 for SFTP mirror | Mixed | Daily batch; alert on > 30 min lag |
analytics-service | NATS consumer fraud.audit.v1, fraud.detected.> | Long-term archival | Best-effort |
Async contract semantics
The fraud-intel-service is detection-plane, not enforcement-plane: every detection is published as a NATS event consumed by the appropriate enforcement service. Consumers are responsible for their own enforcement state and idempotency. fraud-intel-service does not directly mutate firewall.peer_quarantine, sender_id.lifecycle_state, or any other consumer's data.
The synchronous Score gRPC is fail-closed-with-default: callers treat UNAVAILABLE / DEADLINE_EXCEEDED / INTERNAL as tier = PROBATION (neutral) and proceed. This is the fraud-intel-service's deliberate choice not to cascade outages into a compliance freeze.
2. Dependencies of Fraud Intelligence
| Dependency | Interface | Failure mode if unavailable |
|---|---|---|
PostgreSQL fraud schema | Read/write SQL via PgBouncer | Service returns INTERNAL; readiness probe fails → HPA-replaced; events queue in outbox |
ClickHouse fraud_features cluster | TCP / native protocol | Stream ingestion buffers to on-disk WAL (1 h × 10K eps); pipelines skip the run with alert; Score gRPC uses Postgres L2 fallback |
| Redis | GET/SET/ZADD/ZCOUNT/BITOP | L1 cache miss → DB lookup; OTP-grinding streaming aggregator degrades to per-pod state with cross-replica drift |
| NATS JetStream | Consumer + publish | Outbox absorbs publish failures; consumer durable from last ACK on recovery |
| Triton Inference Server | gRPC ModelInfer | Pipeline falls back to rule-based pattern matchers (FraudPattern rows for that category); fraud.alert.model.unavailable.v1 emitted |
| MinIO (model artifacts, feed exports) | S3 | Model promotion blocked; feed export deferred; alert |
| Vault Transit / PKI | mTLS cert + signing key access | Cert rotation blocked (existing certs valid until expiry); HSM signing failure → MISP export deferred + CRITICAL alert |
| HSM (PKCS#11) | Library call | MISP export deferred; alert |
sender-id-registry-service (lookup) | gRPC LookupSenderId | OTP-destination-class set to GENERIC (downgrade); detection still proceeds |
consent-ledger-service | NATS consumer | OTP-harvest pipeline runs without revocation join; downgraded to flag-only output |
cdr-mediation-service | NATS consumer | Grey-route pipeline runs with reduced fidelity |
3. Proto definition
syntax = "proto3";
package ghasi.sms.fraud.v1;
option go_package = "github.com/ghasi/sms-gateway/fraud/v1";
import "google/protobuf/timestamp.proto";
import "google/protobuf/struct.proto";
service FraudIntelService {
// Hot path. Called by compliance-engine, routing-engine, sender-id-registry.
rpc Score(ScoreRequest) returns (ScoreResponse);
// Server-streaming bulk score. Max 1000 entries per request.
rpc BulkScore(BulkScoreRequest) returns (stream ScoreResponse);
// Read recent fraud signals for a subject. Used by NOC dashboards.
rpc GetSignals(GetSignalsRequest) returns (GetSignalsResponse);
}
enum ScoreScope {
SCORE_SCOPE_UNSPECIFIED = 0;
TENANT = 1;
SENDER_ID = 2;
MSISDN = 3;
PEER_ASN = 4;
}
enum FraudTier {
FRAUD_TIER_UNSPECIFIED = 0;
SAFE = 1;
WATCH = 2;
RISKY = 3;
HIGH_RISK = 4;
PROBATION = 5;
}
message ScoreRequest {
ScoreScope scope = 1;
string id = 2;
string trace_id = 3;
}
message ScoreResponse {
string subject_id = 1;
ScoreScope scope = 2;
float score = 3; // 0..1
FraudTier tier = 4;
repeated ContributingFactor contributing_factors = 5;
string model_id = 6;
string model_version = 7;
google.protobuf.Timestamp computed_at = 8;
int32 stale_seconds = 9;
string trace_id = 10;
}
message ContributingFactor {
string category = 1;
float weight = 2;
string detection_id = 3;
}
message BulkScoreRequest {
repeated ScoreRequest entries = 1;
string trace_id = 2;
}
message GetSignalsRequest {
ScoreScope scope = 1;
string id = 2;
google.protobuf.Timestamp since = 3;
int32 limit = 4;
}
message GetSignalsResponse {
repeated FraudSignal signals = 1;
string next_cursor = 2;
}
message FraudSignal {
string signal_id = 1;
google.protobuf.Timestamp event_ts = 2;
string source_stream = 3;
string category = 4;
google.protobuf.Struct evidence = 5;
}
4. Per-aggregate conflict policy
| Aggregate | Conflict policy | Rationale |
|---|---|---|
FraudSignal | append_only | Raw evidence is immutable. Replicas append independently; reads union. |
FraudDetection | append_only | Detections are immutable findings; revisions create new detections. |
FraudCase | server_authoritative (kbl-primary) | Workflow state with referential integrity to decisions; only kbl-primary mutates; mzr is read-only replica. |
FraudFeed.Indicators | idempotent_upsert on (feedId, sourceUuid) | Re-import is the norm; (source, source_uuid) is the natural key; latest write wins per indicator. |
MlModel, ModelVersion | server_authoritative (kbl-primary) | Promotion/rollback must be atomic; mzr replicas read-only and lag by < 30 s. |
EntityScore | server_authoritative (kbl-primary) | Computed by kbl scoring worker; mzr serves Score gRPC reads with eventual consistency. |
Allowlist | server_authoritative + two-person rule | Mutations blocked unless addedBy != approvedBy. |
AuditLog | append_only | Regulatory immutability. |
Outbox | server_authoritative (kbl-primary) | Each region has its own outbox; only kbl publishes events globally to avoid duplicates. |
5. Multi-region replication
Per ADR-0004 §3:
- kbl = primary write region for all aggregates above. ClickHouse cluster spans both regions for analytical reads but writes are kbl-primary.
- mzr = warm standby. Reads
entity_scores,detections,signalsvia replication; serves regionalScoregRPC traffic with read-from-replica policy. - Failover. On
kbloutage, mzr promotes within 5 minutes (manual operator confirmation per ADR-0004 §3.4 to avoid split-brain). Pipelines pause during failover; backfill runs on recovery. - Cross-region event flow.
FRAUD_EVENTS,FRAUD_TENANT_SCORE,FRAUD_MODELstreams are mirrored kbl → mzr via NATS Leaf Node bridging. Mirror lag P95 ≤ 5 s.
6. Outbox pattern
All state-changing transactions in fraud-intel-service write to fraud.outbox in the same Postgres transaction. A relay process polls unpublished rows every 100 ms (in-process scheduler) and publishes to NATS with:
Nats-Msg-Id: {eventId}for JetStream dedup window.- Exponential backoff: 100 ms, 500 ms, 2 s, 10 s, 60 s, then DLQ.
published_atset on first successful ack; row retained 7 days for audit.
This guarantees:
- No emitted event without a persisted state change (ACID-bound).
- No unpublished state change for more than a few seconds (relay polling).
- Idempotent consumers (NATS dedup window 2 min + downstream idempotency keys).
7. Inbox pattern
For consumed events (firewall.audit.v1, sms.dlr.inbound.v1, etc.) we use NATS JetStream durable consumers with AckExplicit. The consumer:
- Receives a message and computes
inboxKey = sha256(subject + msgId). - Attempts
INSERT INTO fraud.inbox (inbox_key, processed_at) VALUES (...). PK violation → already processed → ack and skip. - Processes the message in a Postgres transaction (signal projection + outbox emit if applicable).
- ACKs NATS on transaction commit.
This guarantees exactly-once-effective processing for consumed events, even on consumer restart or NATS redelivery.
CREATE TABLE fraud.inbox (
inbox_key TEXT PRIMARY KEY,
source_stream TEXT NOT NULL,
processed_at TIMESTAMPTZ NOT NULL DEFAULT now()
) PARTITION BY RANGE (processed_at);
-- Daily partitions; retention 7 days.
8. Schema stability guarantees
gRPC proto
| Field | Stability |
|---|---|
ScoreRequest.scope, id, trace_id | Stable |
ScoreResponse.score, tier, model_id, model_version | Stable |
ScoreResponse.contributing_factors | Stable shape; new categories may appear (additive) |
FraudTier enum values | Stable; new values additive; consumers must handle UNSPECIFIED |
BulkScore streaming order | Preserves request order |
REST API
- Routes under
/v1/fraud/*maintain backwards compatibility within the major version. - Breaking changes require
/v2/fraud/prefix and 90-day deprecation window.
9. Versioning policy
- gRPC package:
ghasi.sms.fraud.v1. Breaking change →v2package, 90-day deprecation, dual-publish during overlap. - REST API:
/v1/fraud/*→/v2/fraud/*withSunsetheader on/v1during deprecation. - Event subjects:
fraud.<topic>.v1→.v2per EVENT_SCHEMAS §7. - Model artifact format versioned in
ModelVersion.evaluation_metrics.artifact_format_version.