Ai And Automation 8 min read

Why You Still Need Rule Engines in 2026

ML fraud models outperform rules on aggregate metrics. But rules still win on regulatory explainability, instant deployment, and edge cases where training data is sparse. The best fraud stacks are hybrid — here's how to architect the handoff.

PB
By Shaun Toh
TL;DR

ML fraud models beat rules on aggregate metrics, but rules win on explainability, instant deployment for novel attacks, and sparse-data edge cases. Pure fraud detection is excluded from EU AI Act Annex III high-risk — hybrid waterfall stacks are the production standard.

Every fraud vendor’s pitch deck contains the same slide: ML model outperforms rules by 2x on fraud detection, 3x on false positive reduction. The numbers are usually real. The conclusion — “move to ML, deprecate rules” — is not.

Production fraud stacks at Stripe, Adyen, PayPal, and every serious payment operator that has invested in this problem run hybrid architectures. Rules aren’t a legacy component being tolerated until the ML model matures. They’re a distinct capability that solves problems ML cannot, and the best-performing fraud stacks are the ones that understand exactly which problems each approach owns.

Where ML Genuinely Wins

The performance advantage of ML in fraud is real and quantifiable in specific areas:

Aggregate false positive rate. A well-tuned gradient boosting model operating on a large feature set produces 2-4x fewer false positives than a comparable rule set on card-not-present authorization decisions. False positives — declining legitimate transactions — cost merchants revenue and damage customer relationships. Stripe Radar published data showing 8% authorization rate uplift for merchants migrating from rule-only to Radar ML models. This is the most commercially significant advantage of ML.

Drift detection. Fraud patterns change constantly. When a new attack vector emerges — a new BIN targeted, a new synthetic identity pattern, a new device spoofing technique — ML models that are regularly retrained detect the pattern from transaction data faster than humans can write rules for it. A model retrained weekly incorporates new signals automatically. A rule set requires a fraud analyst to identify the pattern, write a rule, test it, and deploy it — a process that takes days at best.

High-dimensional signal integration. Device fingerprint, session behaviour, graph relationships between accounts, merchant category history, IP velocity, BIN issuer risk — integrating these signals into rule logic requires writing an exponentially growing number of rules as signal count increases. ML handles arbitrary feature counts natively. The practical result: ML models can use 200+ features effectively; rule sets typically saturate around 30-50 rules before false positive rates become unmanageable.

Behavioural anomaly detection. A cardholder who normally shops in London at grocery and coffee merchants, then suddenly initiates a $3,000 electronics purchase from a US IP address at 3am local time — this is detectable by an ML model trained on their transaction history. Writing a rule that captures this pattern for every cardholder permutation is impossible.

Where Rules Still Win

The ML marketing never mentions these cases, but they’re operationally critical:

Regulatory explainability. The EU AI Act, effective August 2024 for high-risk AI systems, requires that automated decisions affecting individuals include “meaningful information about the logic involved” and “the right to obtain an explanation.” A fraud decline is an automated decision affecting an individual. A rule — “declined because card country (Nigeria) does not match billing country (Germany) and transaction amount exceeds €500” — is fully explainable. A gradient boosting model output of “fraud probability 0.73” is not, without additional interpretability tooling. For operators serving EU consumers, explainability is a compliance requirement, not a nice-to-have. Rules provide it for free; ML requires SHAP values, LIME, or similar explainability frameworks that add complexity and still produce outputs that regulators may not accept.

Instant deployment. A novel fraud attack — a specific BIN range compromised in a breach, a merchant category being used for card testing, a velocity pattern from a specific IP subnet — requires an immediate response. Writing and deploying a rule takes minutes. Retraining an ML model, validating it, and deploying it takes hours to days. For the gap between attack detection and ML response, rules are the only defence.

Low-volume edge cases. ML models require training data. A fraud type with 50 historical examples cannot produce a reliable model — you don’t have the label density. Recurring corporate card fraud patterns, specific merchant category attacks, card-present skimming at specific terminal types — these may appear in volumes too low to train on but are still exploitable at meaningful dollar value. Rules are the only tool for low-data-density patterns.

Hard business rules that should never be ML. Certain decisions should not be probabilistic. Transactions from specific OFAC-sanctioned countries must be blocked — not “probably blocked based on risk score.” Transactions above a specific amount threshold requiring manual review are a compliance requirement, not a risk preference. Hard rules ensure these decisions are deterministic and auditable.

The Waterfall Architecture

The production architecture that handles all of these requirements is a waterfall: rules run first, then ML scores the remainder, then rules run again for overrides.

Stage 1: Pre-filter rules (fast blocks and allows). This layer handles obvious cases that don’t need ML scoring: hard blocks (OFAC, velocity limits, known bad BINs, specific fraud flags), and fast allows (transactions from known-good customers below a threshold, recurring charges from established relationships). This stage processes in microseconds and removes ~20-30% of volume from ML scoring. The purpose is efficiency as much as fraud prevention — not paying ML compute costs on obvious cases.

Stage 2: ML scoring. The remaining 70-80% of volume goes through the ML model. The model outputs a fraud probability score (0.0-1.0) and may also output feature importance scores for explainability. The score is then mapped to a decision: approve (low score), review (medium score), decline (high score). Threshold calibration is critical and market-specific.

Stage 3: Post-score override rules. Rules run again after ML scoring for business-specific overrides: approve high-value customers even if ML score is borderline (explicit whitelist), escalate certain merchant categories regardless of score (high-risk merchant category rules), apply 3DS2 challenge at medium scores rather than decline (friction as alternative to decline).

Stage 4: Review queue. Medium-score transactions that aren’t auto-decided go to a review queue with a defined SLA. This is where the explainability requirement becomes operationally important — reviewers need to understand why a transaction was flagged.

EU AI Act Implications for Fraud Teams

The EU AI Act classifies “AI systems used to evaluate the creditworthiness of natural persons or establish their credit score” as high-risk under Annex III. Pure fraud detection is explicitly excluded from the high-risk category under Annex III Point 5(b) — standalone fraud scoring that does not simultaneously make access or credit decisions is not high-risk AI. However, systems that combine fraud scoring with access or credit decisioning may still qualify as high-risk depending on implementation, and the August 2026 enforcement date for high-risk provisions is when full obligations apply.

High-risk AI systems under the Act require: technical documentation, a conformity assessment, a fundamental rights impact assessment, human oversight measures, and transparency to affected individuals. For fraud specifically, this means your ML model needs documented feature sets, training data provenance, performance metrics, and bias assessments. It also means you need to be able to explain a decline to an affected customer in plain language — the “right to explanation.”

The practical implication for architecture: your rule engine provides the explainability layer for decisions where the ML model is the primary signal. Rather than “declined by ML fraud model,” the customer communication becomes “declined because our system detected unusual account activity” — which is both accurate and explainable.

Operators must also distinguish between AI systems that support a human decision (lower regulatory burden) versus systems that make autonomous decisions (higher burden). If every fraud decline is reviewed by a human before being communicated to the customer, the autonomy threshold is not met. Many operators are adding human review gates at medium-score thresholds specifically to manage AI Act exposure.

Tooling Reality

Enterprise (Stripe, Adyen, PayPal, in-house): Both rule engines and ML models are custom-built and deeply integrated. Adyen’s RevenueProtect and Stripe Radar both expose rule configuration interfaces to merchants on top of their ML core — this is the waterfall model made product. Merchants get rules as a product, ML as infrastructure.

Mid-market (Sift, Sardine, Unit21): Purpose-built fraud platforms offering both rule configuration and ML scoring. Sardine specifically focuses on the behavioural biometrics layer that augments both rules and ML with device and session signals. Unit21 is particularly strong on the rule engine and case management side.

Startups: At low volume, hosted ML fraud scoring via Stripe Radar or Sift is sufficient. A custom rule engine makes sense above ~$50M annual GMV when merchant-specific patterns require tuning that off-the-shelf models can’t deliver.

The Practical Framework

A tiered decision for how much to invest in each layer:

Under $10M GMV: Use your PSP’s built-in fraud tooling (Stripe Radar, Adyen RevenueProtect). The default ML models are well-calibrated for most merchant types. Add 3-5 merchant-specific rules for your obvious patterns. Don’t build custom ML.

$10M–$100M GMV: Invest in rule configuration on your PSP’s platform. Review fraud analytics weekly. Consider a supplementary tool (Sift, Sardine) for device fingerprinting if your fraud rate is elevated. Still not worth building custom ML.

$100M+ GMV: Hybrid architecture is warranted. Custom ML models trained on your transaction data outperform generic models at this volume. Rule engines are essential for the explainability, instant response, and hard-rule cases described above. Invest in the handoff logic — the rule-ML-rule waterfall — as a first-class architectural component.

The operators who treat the rule-ML debate as binary — “we’re AI-native now, rules are deprecated” — are the ones who get caught by the novel attack pattern their model hasn’t seen yet, or the EU AI Act audit that asks for an explanation their model can’t give.

Rules and ML are not in competition. They solve different problems. The waterfall architecture is the right answer precisely because both capabilities are genuinely necessary.

Shaun Toh By Shaun Toh · Director, Digital Payments · Razer

Related briefings