Skip to content
Risk And Compliance 11 min read

Fraud Incident Response Runbook for Payment Operators

A live fraud spike is an attack in progress, not a metric. An incident-command runbook to confirm, classify, contain with reversible controls, and recover.

PB
By Shaun Toh
TL;DR

A live fraud attack is an incident, not a dashboard number. This runbook is the command workflow: confirm it is real, classify the attack type, contain with reversible controls while watching the false-positive cost, then recover by rolling those controls back deliberately.

Operator Summary

A fraud incident is a live attack in progress — a spike of fraudulent auths, abuse, or account takeovers — not a steady-state metric, so run it as an incident. First confirm it is real, not a model blip or a legitimate surge, then classify the attack type and contain with proportional, reversible controls: velocity tightening, step-up auth, temporary payment-method, BIN, country, or device-cluster restrictions, payout holds, reward freezes, and manual-review escalation. Watch the false-positive cost throughout — over-blocking legitimate users is its own incident. Assign an incident owner and war-room roles, preserve evidence, communicate internally, recover by rolling back the temporary rules safely, and run a blameless post-mortem. Per-attack detection lives in the detection articles; this runbook is the command workflow.

A fraud incident is not a number on a dashboard. It is an attack in progress: a sudden wave of fraudulent authorizations against your checkout, a burst of account takeovers, a promo campaign being drained by a multi-accounting ring, or a payout path being routed toward an attacker. The fraud-operations scorecard tells you, over weeks, whether your controls are healthy. An incident is the hour in which they are being actively beaten, and the clock is running.

The detection articles on this site tell you how to spot each of these attacks — the signals, the models, the thresholds. This runbook is about what you do once one is live: how to confirm it, classify it, contain it without breaking your good customers, and recover cleanly afterward. Detection is the smoke alarm; this is the fire drill. The two are different disciplines, and conflating them is how operators end up staring at a screaming dashboard with no agreed process for who decides what.

The structure here is deliberately the standard security incident-response lifecycle — detection and analysis, containment, recovery, and a post-incident review — adapted to payments fraud. That lifecycle is well-trodden public ground (NIST SP 800-61 and the SANS PICERL process), so this runbook does not reinvent it; it translates it into the specific levers, roles, and metrics a payments operator reaches for when fraud, rather than a server, is the thing on fire.

What counts as a fraud incident

Not every alert is an incident. The single most expensive mistake at the start is mobilizing a war-room for something that was never an attack — and the second is dismissing a real attack as noise. Before you declare, separate three things that can all look alike on a chart:

  • A real attack. A coordinated, often automated, adversary is exploiting a specific weakness — testing stolen cards, taking over accounts, farming a promo, pushing fraudulent disbursements. The signature is usually a sharp, structured spike: concentrated in particular BINs, devices, IP ranges, account clusters, or a single endpoint.
  • A model blip or alerting artifact. A retrained model, a threshold change, a data-pipeline lag, or a downstream outage can make fraud signals jump without any real attacker. Check whether anything in your stack changed before you assume an adversary did.
  • A legitimate surge. A viral moment, an influencer post, a flash sale, or expansion into a new country produces a spike in volume — and sometimes in declines — that is real demand, not fraud. Block it and you have manufactured an incident out of your best day.

Confirm it is real before you mobilize. Severity tiers help here, and the ones below are illustrative — calibrate them to your own volume and risk appetite:

  • SEV-1 — material, ongoing loss or a live takeover wave; payout integrity or large auth volume at risk. Full war-room, incident owner, immediate containment.
  • SEV-2 — a confirmed attack with contained blast radius (one segment, one method). Owner assigned, targeted controls, monitored.
  • SEV-3 — a suspicious but unconfirmed pattern. Investigate and quantify; do not yet apply customer-facing blocks.

Declare an incident when the pattern is confirmed structured and it is causing — or is about to cause — real loss or abuse. Declaring promotes the situation from “an analyst is looking at it” to “someone owns it, and there is a process.”

The incident lifecycle

The workflow below is the standard incident-response lifecycle applied to fraud: detect → triage → contain → communicate → recover → post-mortem. It maps directly onto the public frameworks — NIST SP 800-61’s detection-and-analysis, containment/eradication/recovery, and post-incident-activity phases, and the SANS PICERL sequence (Preparation, Identification, Containment, Eradication, Recovery, Lessons Learned). Nothing about the shape of fraud response is novel; what is payments-specific is the content of each phase. The rest of this runbook walks them in order, starting with the triage decision that routes the whole incident.

Triage: classify without re-teaching detection

Triage in an incident is not the place to re-derive how each attack is detected — that work lives in the detection articles, and during an incident you do not have time to redo it. The triage job is narrower and faster: take the observed signal, name the most likely attack type, and route it to the containment lever that fits. Get the classification wrong and you apply the wrong control — tightening card-auth velocity does nothing to a promo-farming ring.

The table maps the common cases. The detection reference is where the how-do-I-know lives; the containment lever is what you reach for once you do.

Observed signalLikely attack typeDetection referencePrimary containment lever
Burst of low-value auths/declines across many cards on one endpointCard testing / enumerationcard testingVelocity tightening, CAPTCHA, BIN/IP rate limits
Spike in logins, credential-stuffing patterns, profile/payout changesAccount takeoveraccount takeoverStep-up authentication, session/device challenges, payout holds
Surge in signups, referrals, or reward claims clustered by device/instrumentPromo / referral abusepromo/referral abuseReward freeze, eligibility hardening, cluster rules
Rising disputes/refunds from real customers’ own transactionsFirst-party / friendly fraudfirst-party fraudEvidence capture, dispute-defense routing (not blunt blocking)
New accounts with fabricated identities passing onboarding, then transactingSynthetic identitysynthetic identityOnboarding-step-up, manual-review escalation, account-cluster rules
Fraudulent or coerced real-time transfers / push paymentsAuthorized push payment (APP) fraudAPP fraudPayout/transfer holds, beneficiary checks, manual review

If the signal does not fit cleanly, treat it as a SEV-3 and quantify before acting — a misclassified incident is worse than a slow one.

First-hour actions

The first hour decides whether the incident stays contained or spreads. Work these in order; the sequence matters more than speed on any single step.

  1. Declare and assign the incident owner. One named person owns the incident end to end — the decision-maker, not necessarily the most senior person in the room. Everything else hangs off this.
  2. Open the war-room. A single channel and a running, timestamped log. From this moment, every signal seen, control applied, and decision made is written down as it happens.
  3. Quantify before blocking. Size the attack — volume, rate, targeted segments, money at risk — before you apply customer-facing controls. The same principle drives the card-testing-specific instance in card testing’s ## Incident Response, which is the tactical card-testing playbook (BIN concentration, CAPTCHA, PSP fraud-team engagement); this runbook generalizes it across attack types. Blocking blind can miss the attack entirely while harming good traffic.
  4. Preserve evidence early. Snapshot the attack signature and the state of your rules before you start changing things. Once you tighten controls, the original signal is gone (see the evidence section below).
  5. Apply proportional containment. Reach for the narrowest control that fits the classified attack type, time-boxed from the moment you switch it on.
  6. Communicate. Tell the internal stakeholders — risk, payments, support, finance, and where relevant compliance/legal — that an incident is live, what is known, and who owns it.

Containment controls

Containment is where incidents are won or lost on the false-positive axis. The governing rule: every control here is proportional and reversible, and every control is time-boxed from the moment it goes on. Nothing in this section is a permanent setting; each is a temporary tourniquet you will deliberately remove.

  • Velocity tightening. Lower the rate limits a velocity check enforces — attempts per card, per device, per IP, per account, per window — below normal for the duration. The fastest blunt instrument against automated attacks; also the easiest to over-apply, so scope it to the attacked segment where you can.
  • Step-up authentication. Force additional verification on risky sessions or transactions — a challenge, a re-auth, a stronger factor. Where card rails apply, this can mean routing more traffic through SCA and 3DS2 so the issuer carries more of the authentication. Reversible: you relax the trigger when the attack subsides.
  • Temporary payment-method restrictions. Disable or hard-gate a method being abused (a specific card type, wallet, or rail) while the attack runs. Coarse, so prefer it only when the abuse is concentrated in one method.
  • BIN / issuer / country / device / account-cluster rules. Targeted blocks or step-ups scoped to the dimension where the attack actually concentrates — a BIN range, an issuer, a country, a device cluster, or a linked-account cluster. Cluster-level rules are the surgical option; country and BIN bans are the blunt option and the most likely to catch good customers.
  • Payout holds. Pause disbursements or withdrawals to the suspect accounts or routes while you verify. This sits directly on the payout path, so coordinate with the payout/disbursement failure runbook — holding genuine payouts is itself a customer-impacting event.
  • Promo / reward freezes. Freeze the incentive — the bonus, credit, or referral payout — rather than the user’s own funds. This is the abuse instance of a reversible rollback; the proportionality, evidence, and appeal discipline for it is the promo/referral abuse controls ## Operational response, which this runbook points to rather than restating.
  • Manual-review-queue escalation. Divert suspect traffic to human reviewers instead of auto-declining it. Slower and capacity-bound, but it preserves good customers and generates evidence — the right lever when the false-positive cost of an auto-block is high.

Apply the narrowest lever that contains the attack, write down when you switched it on, and set the expectation that it comes off again.

Avoiding over-blocking legitimate users

This section is load-bearing, because the most common self-inflicted wound in a fraud incident is over-blocking. Over-blocking legitimate users is its own incident — it just shows up on the approval-rate and revenue side of the ledger instead of the loss side, which is exactly why it gets ignored under pressure.

When an attack is live, the instinct is to reach for the broadest, most reassuring control — ban the country, block the BIN, slam the velocity limits to the floor. Those blunt rules feel decisive and they do stop the attack, but they also decline the good customers who happen to share that country, that BIN, or that traffic window. The damage is invisible on the fraud dashboard and very visible in next week’s approval rate.

Three disciplines keep this in check:

  • Watch the approval-rate impact live. Put the approval rate — segmented, against baseline — on the same screen as the attack signal. A containment control that drops good-customer approvals is not free; you are trading one loss for another and should know the exchange rate.
  • Prefer targeted cluster rules over broad bans. A rule scoped to the device cluster, account ring, or BIN-plus-behavior signature that is the attack will catch far fewer good customers than a whole-country or whole-BIN ban that merely contains it.
  • Escalate to step-up or manual review before you hard-decline. A challenge or a review queue lets a real customer through and only stops the attacker; a hard decline stops both. When in doubt, add friction before you add a wall.

The goal of containment is to stop the attacker at the lowest cost to everyone else — not to drive the fraud number to zero at any price.

Incident roles and the war-room

An incident without assigned roles is a crowd watching a chart. The war-room exists so that decisions have an owner and information has a single home. The roles below are functions, not headcount — in a small team one person wears several hats, but each function still has to be covered.

RoleOwns during the incident
Incident owner / commanderThe decision of record: declares, sets severity, approves containment, calls recovery and stand-down. The single accountable person.
Fraud / risk analystClassifying the attack, sizing it, choosing and tuning the specific controls, watching the false-positive cost.
Payments / engineeringApplying controls in the systems, pulling telemetry, and ensuring changes are reversible and logged.
Support leadFront-line customer impact: prepared responses, “was I blocked?” handling, feeding customer reports back as signal.
FinanceQuantifying money at risk and money lost; payout-hold and reconciliation implications.
Compliance / legal liaisonThe owner of any notification or reporting question — not a runbook decision (see communication, below).
CommsInternal cadence and, where needed, external messaging principles — factual, no over-promising, no premature blame.

The owner runs the room; everyone else reports into it. The log is the room’s memory.

Communication: internal and external

Communication during an incident is mostly internal, and mostly about keeping a shared, current picture. Run a regular internal cadence — a short, timestamped update at a fixed interval — across risk, payments, support, finance, and compliance/legal, so no one is acting on a stale view. State what is known, what is being done, what is uncertain, and what changed since the last update.

External communication, where it is needed at all, follows principles rather than a script: be factual, do not over-promise a fix time or a “no one was affected” assurance you cannot stand behind, and do not assign blame before the root cause is confirmed. Support gets a prepared, honest line for customers who were caught by a control.

On notifications and regulatory reporting: route the question to compliance and legal — do not answer it from this runbook. Whether a given incident triggers a customer-notification, breach-notification, or regulatory-reporting obligation depends on jurisdiction, the nature of the incident, and your specific regulatory status, and it changes over time. This runbook deliberately does not prescribe universal obligations and does not give legal advice. The operational rule is simply: flag it early to counsel, and let them own the determination.

Evidence preservation

Evidence is the first casualty of a fast containment, so capture it before you change anything. The moment you tighten rules, the attack signature you were seeing is gone — and you will want it for the post-mortem, for disputes, and for audit. Capture, at minimum:

  • The attack signature — the BINs, devices, IPs, account clusters, endpoints, and patterns that defined it, as they looked at peak.
  • The timeline — first signal, detection, declaration, each containment action with a timestamp, and stand-down.
  • The rules you applied and when — exactly which controls went on, scoped how, at what time, and when each came off.
  • The decisions and their rationale — who decided what, and why, while it was fresh.

For card-scheme monitoring in particular, documented attack events matter: a well-documented, externally caused, remediated spike is the context acquirers and Visa will consider, and it feeds the VAMP discussion. The mechanics of that live in the chargeback-rules reference and the documentation point in card testing’s incident-response section — this runbook’s job is only to make sure the evidence exists to use them. Evidence underpins disputes, the post-mortem, and any audit that follows.

Recovery: rolling back temporary controls safely

This is the most-skipped step in the entire lifecycle, and the one that quietly does the most long-term damage. The temporary controls you applied during the incident — the tightened velocity, the country block, the step-up trigger, the payout hold — must be rolled back deliberately, or they silently become permanent. A velocity limit slammed to the floor during an attack and never restored is now a standing false-positive source that nobody remembers creating.

Recover in stages, not all at once:

  • Verify the attack has actually stopped — not just paused. Confirm the signal has returned to baseline and stayed there, accounting for the attacker probing whether you have relaxed.
  • Roll back one control at a time, monitored. Loosen the most customer-impacting controls first, watch for the attack resuming and for approval rate recovering, and pause if the signal returns.
  • Document which controls stay. Some incident learnings become permanent controls deliberately — that is fine, but it should be a recorded decision with an owner, not an accident of a rollback that never happened.

A clean recovery restores the approval rate as carefully as the containment protected the loss rate. An incident is not closed when the attack stops; it is closed when the temporary controls are off and the system is back to its normal operating point.

Post-incident review

Every incident ends in a blameless post-mortem — blameless because the goal is to fix the system, and people hide information when the exercise is about assigning fault. Reconstruct the timeline from the log: when the attack started, when the first signal appeared, when you detected it, when you contained it, and when you recovered. Then find the control gap that let the attack in or let it grow before you saw it.

The single most actionable number is usually the gap between first signal and detected — the time the attack was running before anyone declared. Closing that gap is almost always higher-leverage than adding another control. Feed the fixes back: the detection threshold that should have fired sooner, the containment lever that was missing or untested, the classification that took too long. And update this runbook itself — a runbook that is not revised after each incident decays into fiction.

Incident KPI scorecard

These are incident-lifecycle metrics — they measure how well you ran this incident, and they are distinct from the steady-state fraud scorecard. The cross-cutting loss, detection, friction, and model-quality metrics live in fraud operations KPIs, which is the sibling scorecard you track continuously; the metrics below sit alongside it and are measured per incident.

MetricDefinitionWhy it matters
Time to detectFirst attack signal → incident detected/declaredThe most actionable gap; every minute here is unmonitored attack time
Time to containDetected → attack materially stoppedHow fast the war-room turned a declaration into effective controls
Attack approval leakageFraudulent volume/value that was approved during the incident windowThe direct loss the incident let through before containment held
False-positive impactGood-customer approvals lost to incident controls vs. baselineThe over-blocking cost made visible — the other side of the ledger
Manual-review backlogCases queued by incident escalation vs. review capacityWhether the manual-review lever is absorbing or drowning
Chargeback spilloverFraud that got approved and later lands as fraud-category chargebacksThe delayed financial tail of the incident; ties to scheme monitoring
Recovery timeStand-down → all temporary controls rolled back to baselineMeasures the most-skipped step; long recovery time means lingering false positives

Each is a per-incident number; tracked across incidents, they show whether your response is getting faster and cheaper or not.

Operator readiness checklist

The work that determines how an incident goes happens before the attack. Have these in place ahead of time:

  • Declared severity tiers — written definitions of SEV-1/2/3 (or your equivalent) with the declaration criteria, agreed before you need them.
  • A named incident owner and on-call roster — one accountable owner role and a roster covering the war-room functions around the clock.
  • A war-room process — a known channel, a logging standard, and a stand-up cadence, so the room assembles itself rather than being invented mid-attack.
  • Pre-built and tested containment levers with a rollback plan — every lever above built, scoped, switchable, and exercised, each with a documented way to take it off again.
  • An evidence-capture standard — a defined list of what to snapshot, and the habit of capturing it before changing rules.
  • Comms templates — internal update format and external/ support lines drafted in advance, not written under pressure.
  • A blameless post-mortem cadence — a standing expectation that every declared incident gets a review, with the timeline and the control gap as required outputs.
  • Detection references mapped to containment levers — the triage table above, kept current, so classification routes to the right control without re-deriving detection.

Scope note

The lifecycle in this runbook is adapted from public incident-response frameworks — NIST SP 800-61 (Rev. 2 for the four-phase lifecycle; Rev. 3, the current revision, reframes incident response around the NIST CSF 2.0 functions) and the SANS PICERL process — not from any payments-specific standard. Per-attack detection lives in the linked detection articles; this runbook covers the command workflow once an attack is confirmed live. Severity tiers, thresholds, roles, levers, and KPI definitions here are illustrative operator synthesis — calibrate them to your own volume, risk appetite, contracts, and authority. This is operational guidance, not legal advice. Customer-notification and regulatory-reporting obligations vary by jurisdiction, incident type, and regulatory status, change over time, and are a question for your counsel — they are deliberately not prescribed here. The throughline is proportional, reversible controls; preserved evidence; auditability; and active management of the false-positive cost, because over-blocking legitimate users is its own incident.

For term definitions — velocity check, SCA, 3DS2, BIN, and issuer — see the Payments Glossary.

Sources & methodology (5)

The incident-response lifecycle adapted here — detection and analysis, then containment, eradication, and recovery, then post-incident activity — is the classic process defined in NIST SP 800-61 Revision 2, the Computer Security Incident Handling Guide (August 2012)

Rev. 2 is the source of the four-phase lifecycle cited here. It was superseded on 3 April 2025 by Rev. 3, which reframes incident response around the NIST CSF 2.0 functions rather than restating the lifecycle; cited for the lifecycle, not as the current revision.

Checked:

NIST SP 800-61 Revision 3 (final, April 2025), Incident Response Recommendations and Considerations for Cybersecurity Risk Management: A CSF 2.0 Community Profile, is the current revision; it superseded Rev. 2 and maps incident-response guidance to the CSF 2.0 functions (Govern, Identify, Protect, Detect, Respond, Recover)

Cited to state the current revision accurately; the lifecycle framing in this runbook traces to Rev. 2, which Rev. 3 superseded.

Checked:

The SANS Incident Handler's Handbook defines the six-phase PICERL incident-handling process — Preparation, Identification, Containment, Eradication, Recovery, and Lessons Learned — a widely used public IR process that aligns with the lifecycle used here

Secondary public anchor for the lifecycle; the 'Lessons Learned' phase corresponds to the blameless post-mortem in this runbook.

Checked:

Acquirers and Visa accept documented attack events as context in monitoring-programme remediation discussions — documenting an externally caused, remediated attack spike (start date, volume, actions) is the operator practice that ties incident evidence to scheme monitoring exposure

Internal synthesis; the scheme-monitoring context is summarised, not quoted as a published scheme rule.

Checked:

The payments-specific containment levers, incident roles, severity tiers, incident-lifecycle KPIs, and readiness checklist in this runbook are PaymentBrief operator synthesis — illustrative frameworks, not scheme rules or vendor commitments, and must be calibrated against your own contracts, baselines, and authority

Checked:

Source types explained in our Methodology.

Shaun Toh By Shaun Toh · Director, Digital Payments · Razer

More Risk And Compliance briefings