Skip to main content

sender-id-registry-service — Observability

Version: 1.0 Status: Draft Owner: Trust & Safety + Platform SRE Last Updated: 2026-04-21 Companion: SERVICE_OVERVIEW · APPLICATION_LOGIC · FAILURE_MODES


1. Prometheus Metrics

All metrics exposed at GET /metrics on port 3091 in Prometheus text format.

1.1 Verify gRPC (hot path)

MetricTypeLabelsDescription
sid_verify_requests_totalCounterstatus, caller_serviceVerify requests by status (ACTIVE/SUSPENDED/REVOKED/UNKNOWN/TENANT_MISMATCH/PENDING/error)
sid_verify_duration_secondsHistogramstatusEnd-to-end gRPC Verify latency
sid_verify_cache_hit_totalCountercache_layer (redis_l1, local_l2)Cache hit counts
sid_verify_cache_miss_totalCounterDB lookups (cache miss)
sid_verify_db_unavail_totalCounterFalls to UNKNOWN due to DB down
sid_get_reputation_requests_totalCounterGetReputation calls
sid_get_reputation_duration_secondsHistogramLatency
sid_batch_verify_requests_totalCounterbatch_size_bucketBatchVerify calls

Histogram buckets for sid_verify_duration_seconds (microsecond targets): [0.0005, 0.001, 0.002, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5]

1.2 Registration & KYC

MetricTypeLabelsDescription
sid_submissions_totalCountertype, category, restricted_pattern_matchedNew submissions
sid_kyc_review_decisions_totalCounterdecision (APPROVE/REJECT/REQUEST_INFO), reviewer_roleReviewer actions
sid_kyc_review_age_secondsHistogramdecisionTime from submission to decision
sid_kyc_review_sla_breach_totalCounterSubmissions past 5-business-day SLA
sid_kyc_doc_view_totalCounterviewer_roleKYC document inline views
sid_kyc_doc_view_per_reviewerGaugereviewer_user_idDaily KYC view count per reviewer (for outlier detection)
sid_kyc_doc_tamper_detected_totalCounterNightly hash-mismatch alerts
sid_kyc_upload_failures_totalCounterreason (s3_unavailable/encryption_failed/hash_mismatch/too_large)KYC upload failures

1.3 Verification

MetricTypeLabelsDescription
sid_verifications_started_totalCountermethodVerification challenges initiated
sid_verifications_succeeded_totalCountermethodSuccessful verifications
sid_verifications_failed_totalCountermethod, reasonFailed verifications
sid_otp_attempts_totalCounterresult (success/wrong/expired/max_attempts)OTP submission attempts
sid_otp_issuance_rate_limited_totalCounterOTP rate-limited (3/h cap)
sid_dns_check_duration_secondsHistogramresolverDNS-TXT resolution latency
sid_dns_check_failures_totalCounterreason (nxdomain/timeout/mismatch/sole_resolver)DNS verification failures
sid_notarised_dual_control_blocked_totalCounterSame-reviewer dual-control attempts blocked

1.4 State machine

MetricTypeLabelsDescription
sid_state_transitions_totalCounterfrom_state, to_state, trigger (manual/auto/system)All state transitions
sid_active_countGaugecategoryCurrently ACTIVE sender-IDs by category
sid_suspended_countGaugereason_class (manual/reputation/fraud)Currently SUSPENDED sender-IDs
sid_revoked_countGaugeCurrently REVOKED (12-mo reservation)
sid_pending_review_countGaugeSUBMITTED + KYC_REVIEW + INFO_REQUESTED

1.5 Reputation

MetricTypeLabelsDescription
sid_reputation_scoreGaugeband (AUTO_SUSPEND/POOR/NEUTRAL/GOOD/EXCELLENT)Distribution histogram
sid_reputation_band_distributionGaugebandCount of sender-IDs per band
sid_reputation_band_transitions_totalCounterfrom_band, to_bandBand crossings
sid_reputation_cron_duration_secondsHistogramDaily cron cycle time
sid_reputation_cron_failures_totalCounterreasonFailed cron runs
sid_reputation_intraday_deltas_totalCountersource (fraud/compliance/firewall/complaint)Intra-day delta events processed
sid_auto_suspend_totalCounterSender-IDs auto-suspended due to reputation < 30
MetricTypeLabelsDescription
sid_public_search_requests_totalCounterresult (hit/miss/rate_limited/tarpit)Public search calls
sid_public_search_duration_secondsHistogramcache_layerLatency
sid_public_search_per_ip_distinct_queriesGaugeTop-N IPs by distinct-query count
sid_public_search_abuse_blocks_totalCounterIPs soft-blocked for abuse pattern

1.7 Regulator export

MetricTypeLabelsDescription
sid_regulator_export_totalCountertriggered_by, resultExport generations
sid_regulator_export_duration_secondsHistogramGeneration time
sid_regulator_export_row_countHistogramRows per export
sid_regulator_export_sftp_failures_totalCounterSFTP transmission failures
sid_regulator_export_signing_duration_secondsHistogramHSM signing latency
sid_regulator_export_lag_secondsGaugeTime since last successful export

1.8 Outbox / NATS

MetricTypeLabelsDescription
sid_outbox_publish_totalCounterresult (ok/retry/dead)Outbox publishes
sid_outbox_lag_rowsGaugeUnpublished rows count
sid_outbox_lag_secondsGaugeOldest unpublished row age
sid_inbox_processed_totalCountersubjectConsumed events
sid_inbox_dedup_hit_totalCountersubjectDuplicate events ignored

1.9 AI assist

MetricTypeLabelsDescription
sid_ai_kyc_ocr_duration_secondsHistogramOCR latency
sid_ai_llm_kyc_validation_duration_secondsHistogramLLM analysis latency
sid_ai_impersonation_flag_totalCounterconfidence_bucketAI impersonation detection
sid_ai_restricted_fuzzy_match_totalCounterEmbedding-based restricted matches
sid_ai_llm_invalid_output_totalCounterLLM output failed schema validation
sid_ai_reviewer_override_totalCounterai_signal_type, agreedReviewer agreement with AI signal

2. Structured Log Events

All log output is valid JSON (Pino format). Log level controlled by LOG_LEVEL.

2.1 Verify

{
"level": "debug",
"time": "2026-04-21T08:14:02.123Z",
"event": "sid.verify",
"senderIdValue": "BANK-XYZ",
"senderIdType": "ALPHA",
"tenantId": "t_...",
"callerService": "compliance-engine",
"status": "ACTIVE",
"currentLevel": "NOTARISED",
"hasDomainDns": true,
"reputationScore": 87,
"cacheHit": true,
"latencyMs": 1.2,
"traceId": "abc",
"spanId": "def"
}

2.2 KYC review decision

{
"level": "info",
"event": "sid.kyc.decision",
"senderIdInternalId": "sid_...",
"value": "BANK-XYZ",
"decision": "APPROVE",
"reviewerUserId": "u_...",
"reviewerRole": "platform.sid.reviewer",
"reviewAgeSeconds": 86400,
"restrictedPatternMatched": true,
"kycDocCount": 3,
"aiSignals": {
"impersonationRiskScore": 0.42,
"forgeryIndicatorsCount": 0
}
}

2.3 Suspension

{
"level": "warn",
"event": "sid.suspended",
"senderIdInternalId": "sid_...",
"value": "SOMEBRAND",
"trigger": "AUTO_REPUTATION",
"reputationAtSuspension": 22,
"reason": "LOW_REPUTATION",
"previousState": "ACTIVE"
}

2.4 KYC document view (audit-class)

{
"level": "info",
"event": "sid.kyc.view",
"kycDocId": "kyc_...",
"senderIdInternalId": "sid_...",
"viewerUserId": "u_...",
"viewerRole": "platform.sid.reviewer",
"viewerIp": "10.4.2.18",
"watermarkApplied": true,
"watermarkArtefactKey": "s3://ghasi-sid-kyc-views-kbl/u_.../kyc_....pdf"
}

2.5 Public search abuse detection

{
"level": "warn",
"event": "sid.public_search.abuse",
"ip": "203.0.113.42",
"distinctQueries1h": 1247,
"rateRpm": 158,
"action": "TARPIT_AND_ALERT"
}

2.6 Regulator export

{
"level": "info",
"event": "sid.regulator.export",
"exportId": "exp_...",
"triggeredBy": "CRON",
"windowFrom": "2026-04-20T00:00:00Z",
"windowTo": "2026-04-21T00:00:00Z",
"rowCount": 12482,
"format": "JSONL",
"signedAtMs": 412,
"transmittedToAtraMs": 8421,
"acknowledgmentRef": "ATRA-RX-2026-04-21-0001"
}

Pino redactor masks registrantContactMsisdn, registrantContactEmail; forbids kyc_doc_content, otp_plaintext entirely.


3. OpenTelemetry Tracing

Parent span: sid.Verify (or sid.SubmitRegistration, sid.KycDecision, sid.RegulatorExport).

SpanOperationAttributes
sid.cache.checkRedis GETcache.hit, cache.layer
sid.db.lookupPG SELECT sender_idsdb.rows, state
sid.kyc.encryptVault Transit encrypttenant_id, kek_ref
sid.kyc.uploadS3 PUTbytes, mime
sid.dns.resolveDoT TXT lookupdomain, resolver, consensus
sid.ai.ocrOCR sidecarmodel, confidence
sid.ai.llmlocal-llmmodel_version, cache_hit, latency_ms
sid.outbox.writePG INSERT outboxsubject, event_id
sid.outbox.publishNATS publishsubject, attempt, result
sid.regulator.signHSM signkey_slot, latency_ms
sid.regulator.sftpSFTP puthost, bytes, latency_ms

Trace context propagated from grpc-trace-bin (W3C Trace Context).


4. Alerting Rules

groups:
- name: sender-id-registry
rules:
- alert: SidVerifyLatencyHigh
expr: histogram_quantile(0.95, sum(rate(sid_verify_duration_seconds_bucket[5m])) by (le)) > 0.010
for: 5m
labels: { severity: high, team: platform-sre }
annotations:
summary: "sid.Verify P95 > 10ms (target ≤ 5ms)"
runbook: "https://runbooks.ghasi.local/sid/verify-latency"

- alert: SidVerifyLatencyCritical
expr: histogram_quantile(0.99, sum(rate(sid_verify_duration_seconds_bucket[5m])) by (le)) > 0.050
for: 2m
labels: { severity: critical, team: platform-sre }

- alert: SidVerifyDbUnavailable
expr: rate(sid_verify_db_unavail_total[5m]) > 1
for: 2m
labels: { severity: critical, team: platform-sre }
annotations:
summary: "Verify falling back to UNKNOWN due to DB unavailability — compliance HOLD storm imminent"

- alert: SidKycReviewSlaBreach
expr: sid_kyc_review_sla_breach_total > 0
for: 0m
labels: { severity: medium, team: trust-safety }
annotations:
summary: "{{ $value }} registrations past 5-business-day SLA"

- alert: SidKycBulkAccessAlert
expr: sid_kyc_doc_view_per_reviewer > 100
for: 10m
labels: { severity: high, team: security }
annotations:
summary: "Reviewer {{ $labels.reviewer_user_id }} viewed > 100 KYC docs / day — investigate"

- alert: SidPublicSearchAbuse
expr: rate(sid_public_search_requests_total{result="rate_limited"}[5m]) > 10
for: 5m
labels: { severity: medium, team: trust-safety }

- alert: SidReputationCollapse
expr: rate(sid_auto_suspend_total[1h]) > 50
for: 10m
labels: { severity: high, team: trust-safety }
annotations:
summary: "Auto-suspension storm — > 50 sender-IDs auto-suspended in 1h. Possible reputation-formula bug or coordinated attack."

- alert: SidReputationCronStale
expr: time() - sid_reputation_cron_last_success_timestamp > 7200
for: 5m
labels: { severity: high, team: trust-safety }
annotations:
summary: "Reputation cron has not completed in > 2 h"

- alert: SidRegulatorExportLag
expr: sid_regulator_export_lag_seconds > 172800
for: 5m
labels: { severity: high, team: regulator-liaison }
annotations:
summary: "No successful regulator export for > 48h"

- alert: SidExportSignerDown
expr: rate(sid_regulator_export_total{result="signing_failed"}[15m]) > 0
for: 5m
labels: { severity: critical, team: security }
annotations:
summary: "HSM signing failing for regulator export"

- alert: SidOutboxLag
expr: sid_outbox_lag_rows > 1000
for: 5m
labels: { severity: high, team: platform-sre }

- alert: SidNotarisedDualControlAttempt
expr: rate(sid_notarised_dual_control_blocked_total[1h]) > 0
for: 0m
labels: { severity: high, team: security }
annotations:
summary: "Same-reviewer attempted both notarised approvals — possible insider abuse"

- alert: SidSplitBrainConflict
expr: rate(sid_replication_conflict_total[15m]) > 0
for: 0m
labels: { severity: critical, team: platform-sre }
annotations:
summary: "Multi-region conflict detected on sender-IDs — manual reconciliation needed"

5. Grafana Dashboard Panels

Dashboard: dashboards/sender-id-registry.json.

PanelQueryVisualization
Verify P50/P95/P99 latencyhistogram_quantile(0.5/0.95/0.99, ...)Multi-line time series
Verify rate by statusrate(sid_verify_requests_total[5m]) by statusStacked area
Verify cache hit ratesid_verify_cache_hit_total / (sid_verify_cache_hit_total + sid_verify_cache_miss_total)Gauge
Submissions per dayincrease(sid_submissions_total[1d])Bar chart
KYC review queue depthsid_pending_review_countTime series + threshold
KYC review SLA trackersid_kyc_review_age_seconds heatmapHeatmap
KYC views per reviewer (daily)topk(10, sid_kyc_doc_view_per_reviewer)Bar chart
State distributionsid_active_count, sid_suspended_count, sid_revoked_countPie chart
Reputation band distributionsid_reputation_band_distribution by bandPie chart
Reputation band transitions (24h)sid_reputation_band_transitions_totalHeatmap
Auto-suspension raterate(sid_auto_suspend_total[1h])Time series
Verification success rate by methodsid_verifications_succeeded_total / sid_verifications_started_total by methodGauge × 4
OTP issuance rate-limitedrate(sid_otp_issuance_rate_limited_total[5m])Time series
Public search abuse blocksrate(sid_public_search_abuse_blocks_total[1h])Time series
Regulator export statussid_regulator_export_lag_seconds + last success tsStat + gauge
Outbox lagsid_outbox_lag_rows, sid_outbox_lag_secondsTime series
AI assist agreement ratesid_ai_reviewer_override_total{agreed="true"} / sid_ai_reviewer_override_totalGauge per signal type

6. SLI / SLO

SLISLOWindow
Verify P95 latency≤ 5 ms7 days
Verify P99 latency≤ 15 ms7 days
Verify availability (non-error rate)≥ 99.99%30 days
GetReputation P95≤ 5 ms7 days
KYC review SLA95% reviewed within 5 business days30 days
Reputation cron freshnesslast successful run ≤ 25 h oldcontinuous
Regulator export freshness≤ 48 h since last successful exportcontinuous
Outbox lag< 1000 unpublished rowscontinuous

Bound to platform NFR catalog (docs/15-nfr-sla-catalog.md).