Real-Time Fraud Decisioning: How Payment AI Makes Sub-100ms Calls
You have roughly 100 milliseconds to score a transaction and authorize or block it. Here is how production fraud decisioning systems are actually built — feature stores, model cascades, latency budgets, and the accuracy tradeoff that nobody talks about.
100ms total window. ML scoring: 10–50ms. Feature store must be sub-millisecond (Redis, Aerospike). Cascade: fast model filters 90%+ volume, accurate model scores the rest. Labeling lag: chargebacks arrive 60–120 days after the transaction.
Every payment authorization runs on a clock that most operators never think about. From the moment a customer taps, swipes, or clicks pay, the card network gives roughly 100 milliseconds for the entire authorization chain to complete — network routing, fraud scoring, acquirer processing, issuer response, and return. Your fraud model gets a slice of that 100ms. In production systems, that slice is 10 to 50 milliseconds.
This is a different problem from building an accurate fraud model. Accuracy is a data science problem. Latency is an infrastructure and architecture problem. Production fraud decisioning requires both to be solved simultaneously, and the constraints are harder than most ML engineering contexts because the timeout is not a performance SLA — it is the card network authorization window. Miss it and the transaction either fails or routes around your fraud check.
This article covers the architecture of real-time fraud decisioning: the feature store, the cascade model design, the latency budget, and the labeling lag problem that makes evaluating production fraud ML structurally difficult. For the broader AI fraud detection landscape and vendor overview, that article covers the full stack. For what happens downstream when fraud gets through, see AI-powered chargeback representment.
The 100ms Budget
A useful way to think about the authorization window is as a fixed budget that must be allocated across several steps with hard ordering constraints.
The breakdown in a typical card-not-present authorization:
- Network transmission (customer → PSP): 10–20ms depending on geography
- Feature retrieval from the feature store: must be sub-millisecond at scale
- Fraud model inference: 10–50ms for gradient boosting; more for neural networks
- Routing and acquirer submission: 10–20ms
- Issuer authorization: 30–50ms (outside your control)
- Response transmission back to the merchant: 5–10ms
The issuer authorization step — which you don’t control — consumes 30–50ms of the budget. That leaves roughly 50ms for everything you do control: feature retrieval, fraud scoring, and routing. At scale, sub-millisecond feature retrieval is not a luxury; it is the prerequisite that makes everything else possible.
The constraint is different for different payment types. Real-time bank-to-bank payments (Pix, UPI, SEPA Instant) may have tighter windows depending on scheme rules. Card-present transactions at POS terminals have slightly more flexibility in some implementations. But for card-not-present e-commerce — the highest-fraud-risk payment context — the 100ms window is the relevant constraint for most operators.
The Feature Store: Why You Can’t Compute Features at Inference Time
A fraud model that scores a transaction on card number, amount, and merchant alone is a weak fraud model. Effective fraud scoring requires behavioral and historical context: how many transactions has this card made in the last 5 minutes? What is the historical decline rate for this BIN on this acquirer? Has this device fingerprint appeared in previous fraud cases? What is the average transaction amount for this customer, and how much does today’s transaction deviate from it?
Computing these features from raw data at inference time would take seconds, not milliseconds — you would need to query transaction history databases, aggregate across time windows, join device fingerprint data, and calculate rolling statistics, all within the scoring window. It is not feasible.
A feature store solves this by pre-computing features continuously and caching them for sub-millisecond retrieval. The architecture has two components: an offline pipeline that computes batch features (30-day transaction history, customer lifetime statistics, BIN-level performance metrics) on a scheduled basis, and an online pipeline that maintains real-time streaming features (last-5-minute velocity, current session behavior) with near-instant updates. At inference time, the model retrieves the pre-computed feature vector for the transaction in under a millisecond and passes it to the model.
Redis and Aerospike are the standard infrastructure choices for feature stores in payment fraud systems. Both are in-memory data stores optimized for sub-millisecond reads at high throughput. Redis is more commonly used at mid-scale; Aerospike is preferred at very high transaction volumes where memory-to-disk hybrid storage is needed to manage feature storage costs. The feature store is infrastructure investment — you cannot build a real-time fraud scoring system without one.
The Cascade: Solving the Accuracy-Latency Tradeoff
The naive approach to fraud scoring is to run every transaction through the most accurate model available. The problem is that accuracy and latency trade off. More accurate models are larger, require more features, model more complex interactions, and take longer to execute. A deep neural network that achieves state-of-the-art fraud detection accuracy may take 40–80ms on CPU — consuming most of your budget before you have scored the transaction.
The production answer is a cascade architecture, sometimes called a waterfall. Transactions flow through models in increasing order of complexity:
Stage 1: Fast filter. A lightweight model — logistic regression, or gradient boosting with a shallow depth limit — scores every transaction in 2–5ms. This model is tuned for high recall on legitimate transactions: its job is to rapidly identify the 85–95% of volume that is clearly not fraud and route it to approval without further scoring. For the residual population flagged as potentially suspicious, the transaction escalates to Stage 2.
Stage 2: Accurate scorer. A deeper gradient boosting ensemble or neural network scores the flagged population in 20–40ms. This model uses the full feature set and is tuned for precision — it needs to distinguish genuine fraud from the false positives the Stage 1 filter produced. The output is a fraud score that determines whether the transaction is approved, flagged for manual review, or declined.
Stage 3 (optional): Rules engine. Hard business rules — block known fraudulent device fingerprints, enforce velocity limits, apply geographic restrictions — can run before Stage 1 (blocking obvious fraud in microseconds) or after Stage 2 (enforcing constraints that the ML model should not override). See rule engines vs ML hybrid architecture for the design pattern. The pre-filter layer is also the primary defense against card testing and enumeration attacks, which generate high authorisation-attempt velocity that ML models alone cannot block fast enough.
The cascade reduces average scoring latency because most volume never reaches the expensive Stage 2 model. At a 90% pass-through rate on Stage 1, 90% of transactions complete scoring in under 10ms. The overall average latency across all transactions stays well within budget even though the Stage 2 model alone would exceed it.
Model Architecture Choices
Gradient boosting (XGBoost, LightGBM) is the industry standard for fraud scoring. It handles the class imbalance inherent in fraud datasets (where fraud cases are 0.1–1% of transactions) well with class weighting or cost-sensitive learning. It is interpretable relative to neural networks — feature importance scores tell you which signals drive each prediction, which matters for regulatory explainability requirements. And it runs fast on CPU: a well-tuned gradient boosting model with 100–300 trees scores a transaction in 2–10ms without GPU infrastructure.
Neural networks and transformers achieve higher accuracy by modeling complex interactions between features that gradient boosting misses. Stripe’s transition to TabTransformer+ (and now the Payments Foundation Model) is the most public example: the transformer architecture captured feature interactions that the prior gradient boosting model missed, recovering $6 billion in falsely declined transactions in 2024 through better retry scoring. The cost is infrastructure: transformer inference at production scale requires GPU hardware, NVIDIA partnerships, and dedicated serving infrastructure that is not accessible to operators running in-house models.
Gradient boosting with neural embeddings is a practical middle path: embed categorical features (BIN, merchant category, device type) with learned representations from a neural network, then feed those embeddings into a gradient boosting model. This captures some of the interaction modeling benefits of transformers while keeping inference latency within the 10–20ms range on CPU.
For operators building in-house fraud scoring, gradient boosting is the right starting architecture. The marginal accuracy gain from transformer models does not justify the infrastructure investment unless you are processing at Stripe-scale transaction volumes.
The Labeling Lag Problem
The hardest operational challenge in production fraud ML is not the model or the infrastructure. It is the feedback signal.
Fraud labels come from chargebacks — cardholders disputing transactions as unauthorized. Visa and Mastercard give cardholders up to 120 days to file a dispute. The practical labeling lag is 60–120 days from the transaction date to confirmed fraud label. A model that scores transactions in May won’t have fraud labels for those transactions until July or August.
This creates a structural blindspot. A model can degrade significantly in the gap between transaction and label — new fraud tactics, a new fraud ring, a change in attack patterns — and the degradation won’t show up in accuracy metrics until the labels arrive months later. By the time you detect the problem, the fraud has already run through.
Proxy metrics help but don’t solve the problem. Chargeback rate can signal fraud volume trends. Decline rate shifts can indicate the model is scoring differently. Manual review rates can flag increased uncertainty. None of these directly measure model accuracy — they measure downstream effects that may have multiple causes. The payment AI MLOps article covers the full monitoring architecture for fraud model drift, including PSI-based drift detection that can catch distribution shifts before labels arrive.
What This Means for Operators
If you are using a fraud platform (Sift, Kount, Stripe Radar, Adyen Protect): the feature store and cascade architecture are handled for you. The platform’s infrastructure serves features sub-millisecond and cascades models internally. Your operational work is calibrating score thresholds (what score triggers a block vs a flag vs a review), monitoring your false positive rate, and instrumenting the feedback loop so the platform’s models can learn from your specific merchant context.
If you are building in-house scoring: the feature store is the prerequisite investment, not the model. Start with the data infrastructure — get transaction history, device signals, and velocity features into an in-memory store with sub-millisecond retrieval before training a model. A simple logistic regression with good features outperforms a sophisticated model with stale batch features.
On the accuracy-latency tradeoff: benchmark your model inference time in the production environment, not a development laptop. GPU-optimized development environments hide latency that appears in CPU production deployments. Measure the 99th-percentile latency, not the mean — the tail latency is what breaches the 100ms budget.
On labeling lag: instrument proxy metrics from day one. Track chargeback rate, dispute rate by card type and geography, and model score distributions on a daily basis. Set threshold alerts for distribution shifts (PSI above 0.1 is the standard trigger for investigation). Do not wait for the 90-day label window to discover model degradation.
On the authorization rate connection: fraud scoring and authorization optimization interact. A fraud model tuned too aggressively increases false positives — legitimate transactions declined — which directly reduces authorization rate. The right operating point is not minimum fraud; it is maximum revenue net of fraud losses and false decline costs. At high transaction volumes, a 0.1% false positive rate improvement is worth more in recovered revenue than a 0.1% fraud rate reduction in prevented losses.
Sources
Payment systems have roughly 100ms to pull data, score fraud risk, route transactions, and get authorization
Checked:
ML models score transactions for fraud risk in the 10–50ms range in high-performance systems
Checked:
Fraud models typically need 20–100+ features per prediction within a 100ms window; sub-millisecond feature store retrieval required
Checked:
Sift fraud platform intercepts payments, scores in real time (sub-100ms), returns approve, flag, or block before settlement completes
Checked:
Visa and Mastercard chargeback windows up to 120 days — ground truth labeling lag of 60–120 days from transaction
Checked:
Stripe Payments Foundation Model trains on tens of billions of transactions with NVIDIA partnership for real-time inference; increased large-merchant attack detection 64%
Checked:
HyperVerge: Real-Time Fraud Detection AI-Ready Guide for 2026 — gradient boosting with class weighting standard for imbalanced fraud datasets
Checked:
Dedicated bare metal servers provide predictable sub-100ms latencies for fraud detection; NVMe + high-bandwidth networking required for production scale
Checked:
Source types explained in our Methodology.
Subscribers get the PSP Selection RFP Kit — 60+ structured questions, evaluation scorecard, and negotiation playbook — delivered to your inbox instantly.