Ai And Automation 7 min read

Reconciliation Automation: Where LLMs Actually Move the Needle

Most 'AI for finance ops' pitches are vapor. Reconciliation and exception matching is one area where LLMs genuinely work — but only for specific sub-problems. Here's what ships and what doesn't.

PB
By Shaun Toh
TL;DR

LLMs work for fuzzy merchant-name matching and exception triage in reconciliation, but fail at high-precision matching where money is on the line. Production architecture: deterministic matching for clean data, LLM triage for exceptions.

The “AI for finance operations” pitch deck typically shows a robot automating everything and a CFO smiling. The production reality is narrower and more interesting. For payment reconciliation specifically, large language models are genuinely useful — but for a well-defined subset of the problem. Understanding which subset determines whether your automation project ships or becomes a cautionary tale.

The Reconciliation Problem at Volume

Payment reconciliation means matching what your payment processor says happened against what your internal systems say happened, then investigating the gaps. At scale, this involves three data sources running constantly: settlement files from acquirers and PSPs, internal transaction records from your order management system or database, and bank statements showing actual cash movements.

A mid-size e-commerce operator processing 100,000 transactions per day will generate roughly 2-3 million lines of settlement data weekly across multiple payment methods and currencies. The mechanical process — match PSP transaction ID to internal order ID, confirm amount, confirm timing — handles the vast majority cleanly. But even a 1% exception rate generates 1,000 unmatched items per day that require human attention.

At larger volumes the numbers are more dramatic. A payment facilitator processing 5 million transactions daily at 0.5% exceptions produces 25,000 exception items per day. Each one requires someone to look at it, understand why it didn’t match, and either resolve it or escalate it. Multiply that by analyst cost and you have a meaningful finance operations problem.

Where Deterministic Matching Works — And Where It Breaks Down

Deterministic reconciliation — exact-match logic on transaction identifiers, amounts, and timestamps — handles somewhere between 95% and 98% of volume cleanly for well-run operations. The logic is simple: if PSP transaction ID X for $125.00 at 14:32:07 matches internal order Y for $125.00 submitted at 14:31:58, they’re the same transaction. No AI needed.

The remaining 2-5% breaks for structural reasons that deterministic logic can’t resolve:

Merchant name variation across acquirers. The same merchant appears as “STARBUCKS COFFEE 1247 LONDON”, “STBCKS CNRY WHRF”, and “STARBUCKS CORP” depending on the terminal and acquirer. If you’re reconciling against a merchant master database by name, these won’t match deterministically.

Rounding and FX differences. A transaction settled in EUR and converted to GBP at slightly different rates at the PSP versus your treasury system produces a $0.02 discrepancy that fails exact-match logic. It’s the same transaction; the amounts are genuinely different.

Timing mismatches. A T+1 settlement PSP and a T+2 bank credit create legitimate cross-period gaps. The transaction is real; the timing disagrees with the matching window.

Partial settlements and split captures. A $500 order partially fulfilled and partially refunded may produce a $300 settlement, a $200 refund, and three internal records. Matching these across settlement files is a combinatorial problem, not a lookup.

Reference field corruption. PSP reference fields get truncated, prefixed, or re-encoded differently depending on terminal firmware and acquirer processing. “ORD-20240915-84721” becomes “ORD20240915847” somewhere in the transmission chain.

These cases are where humans spend their time — and where automation is worth building.

The Specific LLM Opportunity

LLMs don’t solve the reconciliation problem broadly. They solve three specific sub-problems better than the alternatives that came before them:

Fuzzy merchant name matching. String-distance algorithms (Levenshtein distance, Jaro-Winkler) are decent but produce false positives on short strings and false negatives on semantically equivalent long strings. LLMs understand that “MCU PICTURES ONLINE” and “MARVEL CINEMA UNLIMITED” might be the same entity in a specific context. More practically, an LLM understands that “SBUX #4721 GATWICK” means Starbucks at Gatwick Airport — a purely algorithmic matcher doesn’t have that semantic layer. In practice: feeding pairs of unmatched merchant names to an LLM with a classification prompt (“same entity, different entity, or uncertain?”) consistently outperforms string-distance approaches, particularly on abbreviated and location-suffixed merchant names.

Exception narrative generation. Most reconciliation exceptions end up in a human review queue with a bare record: “Item ID 4829471, $340.00, unmatched, status: open.” A human analyst opens it, finds the PSP record, finds the internal record, identifies the discrepancy type, and writes a triage note. This is routine analytical work that LLMs handle well. Feeding the exception’s structured data to an LLM generates a natural-language summary: “PSP settled $340.00 on 2024-09-15 for transaction ref TXN-9821 with MCU in merchant name; closest internal match is order ORD-4821 for $340.00 placed 2024-09-14 with Marvel Cinema Subscription. Likely timing mismatch, recommend check T+1 batch.” That note, automatically generated, can reduce analyst triage time by 60-70% on routine exceptions.

Exception triage and routing. Classifying exceptions by root cause — timing mismatch, amount discrepancy, reference field issue, genuine unmatched transaction — determines which team handles them. LLMs classify exception types reliably from the structured exception data, allowing automatic routing to the right queue without manual triage.

Where LLMs Fail in Reconciliation

The failure modes are as important as the wins, because deploying LLMs in the wrong places breaks reconciliation rather than fixing it.

High-precision amount matching. LLMs hallucinate. For tasks where the answer is “does $1,247,832.44 equal $1,247,832.44,” no LLM is reliable enough. The answer is a database comparison, not a language model prompt. Any reconciliation workflow that routes amount-matching decisions through an LLM is one hallucination away from a material accounting error. This line is absolute: amounts and reference fields are deterministic-matching territory only.

Audit trail generation. SOX compliance and financial audit requirements demand that reconciliation decisions be reproducible and explainable. LLM outputs are not deterministic — the same input can produce different outputs at different temperatures, and the reasoning is not fully auditable. For any reconciliation that touches audit-critical records, human review or deterministic logic is required regardless of what the LLM suggests.

Multi-leg settlement chains. Complex reconciliation — netting arrangements, multi-currency triangulation, funds-in-transit across multiple settlement windows — requires precise multi-step logic that LLMs handle unreliably. The combinatorial search space is better handled by purpose-built matching engines.

The Architecture That Actually Ships

The production architecture that works combines deterministic and LLM components with clear handoff rules:

Layer 1: Deterministic matching engine. Exact match on transaction ID, amount, and timestamp within a tolerance window. This handles the 95-98% clean volume. Nothing else touches these records once matched. Tools: Python with pandas, or a purpose-built reconciliation platform (AutoRek, ReconArt, or Aurum depending on ERP integration requirements).

Layer 2: LLM triage for exceptions. Unmatched records from Layer 1 flow into an LLM classification step. The LLM receives the exception data and the closest candidate matches, classifies the exception type, generates a narrative, and routes to the appropriate queue. It does not make financial decisions — it makes routing and summarisation decisions. Tools: GPT-4o or Claude Sonnet via API, with structured output format enforced.

Layer 3: Human review for amount discrepancies. Any exception involving an amount difference above a defined threshold (typically $1 for low-volume, $10 for high-volume) routes directly to a human reviewer, bypassing LLM involvement entirely. The LLM can summarise context; it cannot approve a resolution.

Layer 4: Escalation and ageing. Exceptions unresolved after a defined SLA (24 hours for routine, 4 hours for above-threshold) escalate automatically regardless of LLM classification.

Build vs Buy

The build case: your PSP settlement file formats are non-standard, your internal transaction IDs have unusual structure, you process in unusual currency combinations, or you have netting arrangements that commercial tools don’t handle. If you’re rebuilding the deterministic layer anyway, adding LLM triage on top of your own pipeline is straightforward.

The buy case: standard PSP integrations, standard ERP (NetSuite, SAP), and exception volumes below ~500/day. Commercial reconciliation tools have invested significantly in ML-assisted matching — AutoRek’s AI matching, ReconArt’s exception handling — and are now integrating LLM-based narrative generation. For most operators, getting a commercial tool configured correctly delivers faster than a custom build.

The decision is often driven by the settlement file zoo. If you’re ingesting files from Stripe, Adyen, PayPal, and three regional acquirers, each in proprietary format, the format-normalisation layer alone is a substantial project. At that point, building the LLM triage layer on top of your normalisation effort makes more sense.

What to Measure

The metric that matters is exception resolution time, not exception rate. Exception rate is largely determined by your transaction mix and PSP quality, not by your reconciliation stack. Resolution time — how quickly an unresolved exception is identified, routed, and closed — is directly improved by LLM triage and routing.

A well-implemented LLM triage layer typically reduces analyst time-per-exception by 40-60% on the triage step (reading the exception, understanding the context, writing the first note). That’s meaningful for finance operations teams processing hundreds of exceptions daily, and it’s achievable without introducing the hallucination risks that come with asking LLMs to make matching decisions.

The discipline is using LLMs where their capabilities match the task. Semantic understanding, narrative generation, classification — yes. Financial arithmetic, audit trails, amount matching — no. The reconciliation stack that ships is the one that respects that boundary.

Shaun Toh By Shaun Toh · Director, Digital Payments · Razer

Related briefings