Why does payment fraud scoring need to run in under 100ms?

The card network authorization window is fixed — issuer, acquirer, and scheme handshakes must complete within roughly 100ms from payment initiation to authorization response. Fraud scoring is one step inside that window, alongside network routing, feature retrieval, and response transmission. If the fraud model takes 200ms, the transaction times out and is either declined by default or routed around the fraud check — neither is acceptable. The 100ms constraint is not a performance target; it is a hard operational limit imposed by the payment network infrastructure.

What is a feature store and why does fraud scoring need one?

A feature store is a low-latency data infrastructure layer that serves pre-computed features to ML models at inference time. Fraud models need 20–100+ signals per prediction: the cardholder's transaction history over the last 30 days, velocity counts (how many transactions in the last 5 minutes), BIN-level decline rates, device fingerprint history, and behavioral signals. Computing these from raw data at inference time would take seconds. A feature store pre-computes and caches these features with sub-millisecond retrieval — Redis and Aerospike are the standard production choices. Without a feature store, you either use stale features (batch-updated daily) or miss the latency window entirely.

What is a cascade architecture for fraud scoring?

A cascade (or waterfall) architecture runs transactions through a sequence of models in increasing order of complexity and cost. The first model — typically logistic regression or lightweight gradient boosting — scores every transaction in 2–5ms. It is tuned for speed and recall: it passes anything that looks clearly legitimate through to approval with no further scoring, and flags the high-risk population for the next model. The second model is slower and more accurate — a deeper gradient boosting ensemble or neural network that runs in 20–40ms. This architecture reduces average scoring latency across the full volume while preserving accuracy on the high-risk cases that matter most.

What is the labeling lag problem in fraud ML?

Chargebacks — the ground truth labels for fraud — arrive 60 to 120 days after the transaction. Visa and Mastercard give cardholders up to 120 days to dispute. This means a model cannot evaluate its own performance in real time: transactions scored in January don't have confirmed fraud/legitimate labels until March or April. Models can degrade significantly in that window with no signal. Proxy metrics (decline rate, chargeback rate, manual review rate) can detect degradation earlier, but they measure downstream effects rather than model accuracy directly.

When should you use rules versus ML for fraud scoring?

Rules handle known fraud patterns and hard business constraints: block all transactions from a specific device fingerprint that filed 3 chargebacks last week; decline all CNP transactions over $5,000 to a new customer with no purchase history. These patterns are known, the response is binary, and they execute in microseconds. ML handles statistical patterns across hundreds of features simultaneously — correlations a rule writer couldn't anticipate or maintain. The right architecture runs rules first (microseconds, eliminate obvious fraud and comply with hard constraints), then ML on the residual population.

How does the accuracy-latency tradeoff work in production?

More accurate models are generally larger, slower, and require more features — all of which increase latency. A gradient boosting model with 100 trees runs in 2–5ms but has a ceiling on accuracy because it cannot model deep feature interactions. A deep neural network can model complex interactions and achieve higher accuracy, but may take 30–50ms on CPU and require GPU inference infrastructure to stay within the 100ms window. In practice, production systems solve this with the cascade: use the fast model where confidence is high, escalate to the accurate model where it isn't.

Ai And Automation 11 min read

Real-Time Fraud Decisioning: How Payment AI Makes Sub-100ms Calls

You have roughly 100 milliseconds to score a transaction and authorize or block it. Here is how production fraud decisioning systems are actually built — feature stores, model cascades, latency budgets, and the accuracy tradeoff that nobody talks about.

By Shaun Toh May 7, 2026

TL;DR

100ms total window. ML scoring: 10–50ms. Feature store must be sub-millisecond (Redis, Aerospike). Cascade: fast model filters 90%+ volume, accurate model scores the rest. Labeling lag: chargebacks arrive 60–120 days after the transaction.

Every payment authorization runs on a clock that most operators never think about. From the moment a customer taps, swipes, or clicks pay, the card network gives roughly 100 milliseconds for the entire authorization chain to complete — network routing, fraud scoring, acquirer processing, issuer response, and return. Your fraud model gets a slice of that 100ms. In production systems, that slice is 10 to 50 milliseconds.

This is a different problem from building an accurate fraud model. Accuracy is a data science problem. Latency is an infrastructure and architecture problem. Production fraud decisioning requires both to be solved simultaneously, and the constraints are harder than most ML engineering contexts because the timeout is not a performance SLA — it is the card network authorization window. Miss it and the transaction either fails or routes around your fraud check.

This article covers the architecture of real-time fraud decisioning: the feature store, the cascade model design, the latency budget, and the labeling lag problem that makes evaluating production fraud ML structurally difficult. For the broader AI fraud detection landscape and vendor overview, that article covers the full stack. For what happens downstream when fraud gets through, see AI-powered chargeback representment.

The 100ms Budget

A useful way to think about the authorization window is as a fixed budget that must be allocated across several steps with hard ordering constraints.

The breakdown in a typical card-not-present authorization:

Network transmission (customer → PSP): 10–20ms depending on geography
Feature retrieval from the feature store: must be sub-millisecond at scale
Fraud model inference: 10–50ms for gradient boosting; more for neural networks
Routing and acquirer submission: 10–20ms
Issuer authorization: 30–50ms (outside your control)
Response transmission back to the merchant: 5–10ms

The issuer authorization step — which you don’t control — consumes 30–50ms of the budget. That leaves roughly 50ms for everything you do control: feature retrieval, fraud scoring, and routing. At scale, sub-millisecond feature retrieval is not a luxury; it is the prerequisite that makes everything else possible.

The constraint is different for different payment types. Real-time bank-to-bank payments (Pix, UPI, SEPA Instant) may have tighter windows depending on scheme rules. Card-present transactions at POS terminals have slightly more flexibility in some implementations. But for card-not-present e-commerce — the highest-fraud-risk payment context — the 100ms window is the relevant constraint for most operators.

The Feature Store: Why You Can’t Compute Features at Inference Time

A fraud model that scores a transaction on card number, amount, and merchant alone is a weak fraud model. Effective fraud scoring requires behavioral and historical context: how many transactions has this card made in the last 5 minutes? What is the historical decline rate for this BIN on this acquirer? Has this device fingerprint appeared in previous fraud cases? What is the average transaction amount for this customer, and how much does today’s transaction deviate from it?

Computing these features from raw data at inference time would take seconds, not milliseconds — you would need to query transaction history databases, aggregate across time windows, join device fingerprint data, and calculate rolling statistics, all within the scoring window. It is not feasible.

A feature store solves this by pre-computing features continuously and caching them for sub-millisecond retrieval. The architecture has two components: an offline pipeline that computes batch features (30-day transaction history, customer lifetime statistics, BIN-level performance metrics) on a scheduled basis, and an online pipeline that maintains real-time streaming features (last-5-minute velocity, current session behavior) with near-instant updates. At inference time, the model retrieves the pre-computed feature vector for the transaction in under a millisecond and passes it to the model.

Redis and Aerospike are the standard infrastructure choices for feature stores in payment fraud systems. Both are in-memory data stores optimized for sub-millisecond reads at high throughput. Redis is more commonly used at mid-scale; Aerospike is preferred at very high transaction volumes where memory-to-disk hybrid storage is needed to manage feature storage costs. The feature store is infrastructure investment — you cannot build a real-time fraud scoring system without one.

The Cascade: Solving the Accuracy-Latency Tradeoff

The naive approach to fraud scoring is to run every transaction through the most accurate model available. The problem is that accuracy and latency trade off. More accurate models are larger, require more features, model more complex interactions, and take longer to execute. A deep neural network that achieves state-of-the-art fraud detection accuracy may take 40–80ms on CPU — consuming most of your budget before you have scored the transaction.

The production answer is a cascade architecture, sometimes called a waterfall. Transactions flow through models in increasing order of complexity:

Stage 1: Fast filter. A lightweight model — logistic regression, or gradient boosting with a shallow depth limit — scores every transaction in 2–5ms. This model is tuned for high recall on legitimate transactions: its job is to rapidly identify the 85–95% of volume that is clearly not fraud and route it to approval without further scoring. For the residual population flagged as potentially suspicious, the transaction escalates to Stage 2.

Stage 2: Accurate scorer. A deeper gradient boosting ensemble or neural network scores the flagged population in 20–40ms. This model uses the full feature set and is tuned for precision — it needs to distinguish genuine fraud from the false positives the Stage 1 filter produced. The output is a fraud score that determines whether the transaction is approved, flagged for manual review, or declined.

Stage 3 (optional): Rules engine. Hard business rules — block known fraudulent device fingerprints, enforce velocity limits, apply geographic restrictions — can run before Stage 1 (blocking obvious fraud in microseconds) or after Stage 2 (enforcing constraints that the ML model should not override). See rule engines vs ML hybrid architecture for the design pattern. The pre-filter layer is also the primary defense against card testing and enumeration attacks, which generate high authorisation-attempt velocity that ML models alone cannot block fast enough.

The cascade reduces average scoring latency because most volume never reaches the expensive Stage 2 model. At a 90% pass-through rate on Stage 1, 90% of transactions complete scoring in under 10ms. The overall average latency across all transactions stays well within budget even though the Stage 2 model alone would exceed it.

Model Architecture Choices

Gradient boosting (XGBoost, LightGBM) is the industry standard for fraud scoring. It handles the class imbalance inherent in fraud datasets (where fraud cases are 0.1–1% of transactions) well with class weighting or cost-sensitive learning. It is interpretable relative to neural networks — feature importance scores tell you which signals drive each prediction, which matters for regulatory explainability requirements. And it runs fast on CPU: a well-tuned gradient boosting model with 100–300 trees scores a transaction in 2–10ms without GPU infrastructure.

Neural networks and transformers achieve higher accuracy by modeling complex interactions between features that gradient boosting misses. Stripe’s transition to TabTransformer+ (and now the Payments Foundation Model) is the most public example: the transformer architecture captured feature interactions that the prior gradient boosting model missed, recovering $6 billion in falsely declined transactions in 2024 through better retry scoring. The cost is infrastructure: transformer inference at production scale requires GPU hardware, NVIDIA partnerships, and dedicated serving infrastructure that is not accessible to operators running in-house models.

Gradient boosting with neural embeddings is a practical middle path: embed categorical features (BIN, merchant category, device type) with learned representations from a neural network, then feed those embeddings into a gradient boosting model. This captures some of the interaction modeling benefits of transformers while keeping inference latency within the 10–20ms range on CPU.

For operators building in-house fraud scoring, gradient boosting is the right starting architecture. The marginal accuracy gain from transformer models does not justify the infrastructure investment unless you are processing at Stripe-scale transaction volumes.

The Labeling Lag Problem

The hardest operational challenge in production fraud ML is not the model or the infrastructure. It is the feedback signal.

Fraud labels come from chargebacks — cardholders disputing transactions as unauthorized. Visa and Mastercard give cardholders up to 120 days to file a dispute. The practical labeling lag is 60–120 days from the transaction date to confirmed fraud label. A model that scores transactions in May won’t have fraud labels for those transactions until July or August.

This creates a structural blindspot. A model can degrade significantly in the gap between transaction and label — new fraud tactics, a new fraud ring, a change in attack patterns — and the degradation won’t show up in accuracy metrics until the labels arrive months later. By the time you detect the problem, the fraud has already run through.

Proxy metrics help but don’t solve the problem. Chargeback rate can signal fraud volume trends. Decline rate shifts can indicate the model is scoring differently. Manual review rates can flag increased uncertainty. None of these directly measure model accuracy — they measure downstream effects that may have multiple causes. The payment AI MLOps article covers the full monitoring architecture for fraud model drift, including PSI-based drift detection that can catch distribution shifts before labels arrive.

What This Means for Operators

If you are using a fraud platform (Sift, Kount, Stripe Radar, Adyen Protect): the feature store and cascade architecture are handled for you. The platform’s infrastructure serves features sub-millisecond and cascades models internally. Your operational work is calibrating score thresholds (what score triggers a block vs a flag vs a review), monitoring your false positive rate, and instrumenting the feedback loop so the platform’s models can learn from your specific merchant context.

If you are building in-house scoring: the feature store is the prerequisite investment, not the model. Start with the data infrastructure — get transaction history, device signals, and velocity features into an in-memory store with sub-millisecond retrieval before training a model. A simple logistic regression with good features outperforms a sophisticated model with stale batch features.

On the accuracy-latency tradeoff: benchmark your model inference time in the production environment, not a development laptop. GPU-optimized development environments hide latency that appears in CPU production deployments. Measure the 99th-percentile latency, not the mean — the tail latency is what breaches the 100ms budget.

On labeling lag: instrument proxy metrics from day one. Track chargeback rate, dispute rate by card type and geography, and model score distributions on a daily basis. Set threshold alerts for distribution shifts (PSI above 0.1 is the standard trigger for investigation). Do not wait for the 90-day label window to discover model degradation.

On the authorization rate connection: fraud scoring and authorization optimization interact. A fraud model tuned too aggressively increases false positives — legitimate transactions declined — which directly reduces authorization rate. The right operating point is not minimum fraud; it is maximum revenue net of fraud losses and false decline costs. At high transaction volumes, a 0.1% false positive rate improvement is worth more in recovered revenue than a 0.1% fraud rate reduction in prevented losses.

Sources

Redis: AI in Payment ProcessingIndustry data

Payment systems have roughly 100ms to pull data, score fraud risk, route transactions, and get authorization

Checked: 2026-05-14

Redis: AI Fraud Detection Real-Time SystemsIndustry data

ML models score transactions for fraud risk in the 10–50ms range in high-performance systems

Checked: 2026-05-14

Aerospike: Real-Time Fraud Detection for PaymentsIndustry data

Fraud models typically need 20–100+ features per prediction within a 100ms window; sub-millisecond feature store retrieval required

Checked: 2026-05-14

Sift: Fraud Score — How AI Calculates Transaction Risk in Real TimeIndustry data

Sift fraud platform intercepts payments, scores in real time (sub-100ms), returns approve, flag, or block before settlement completes

Checked: 2026-05-14

Chargeback Gurus / ChargebackHelpIndustry data

Visa and Mastercard chargeback windows up to 120 days — ground truth labeling lag of 60–120 days from transaction

Checked: 2026-05-14

TechCrunch: Stripe unveils AI foundation model for paymentsIndustry data

Stripe Payments Foundation Model trains on tens of billions of transactions with NVIDIA partnership for real-time inference; increased large-merchant attack detection 64%

Checked: 2026-05-14

HyperVerge Real-Time Fraud Detection Guide 2026Industry data

HyperVerge: Real-Time Fraud Detection AI-Ready Guide for 2026 — gradient boosting with class weighting standard for imbalanced fraud datasets

Checked: 2026-05-14

TensorMesh: Solving AI Inference LatencyIndustry data

Dedicated bare metal servers provide predictable sub-100ms latencies for fraud detection; NVMe + high-bandwidth networking required for production scale

Checked: 2026-05-14

Source types explained in our Methodology.

By Shaun Toh · Director, Digital Payments · Razer

Subscribers get the PSP Selection RFP Kit — 60+ structured questions, evaluation scorecard, and negotiation playbook — delivered to your inbox instantly.

Related briefings

AI & Automation 13 min read

Real-Time Fraud Decisioning: How Payment AI Makes Sub-100ms Calls

The 100ms Budget

The Feature Store: Why You Can’t Compute Features at Inference Time

The Cascade: Solving the Accuracy-Latency Tradeoff

Model Architecture Choices

The Labeling Lag Problem

What This Means for Operators

Sources

Related briefings

AI Agents and Payment APIs: MCP, Stripe Agent Toolkit, Visa, and Mastercard

Payment AI's MLOps Problem: Model Drift, Retraining Cadence, and Production Monitoring

AI-Powered Payment Routing: How ML Replaced the Static Routing Table