Fraud Incident Response Runbook for Payment Operators
A live fraud spike is an attack in progress, not a metric. An incident-command runbook to confirm, classify, contain with reversible controls, and recover.
A live fraud attack is an incident, not a dashboard number. This runbook is the command workflow: confirm it is real, classify the attack type, contain with reversible controls while watching the false-positive cost, then recover by rolling those controls back deliberately.
A fraud incident is a live attack in progress — a spike of fraudulent auths, abuse, or account takeovers — not a steady-state metric, so run it as an incident. First confirm it is real, not a model blip or a legitimate surge, then classify the attack type and contain with proportional, reversible controls: velocity tightening, step-up auth, temporary payment-method, BIN, country, or device-cluster restrictions, payout holds, reward freezes, and manual-review escalation. Watch the false-positive cost throughout — over-blocking legitimate users is its own incident. Assign an incident owner and war-room roles, preserve evidence, communicate internally, recover by rolling back the temporary rules safely, and run a blameless post-mortem. Per-attack detection lives in the detection articles; this runbook is the command workflow.
A fraud incident is not a number on a dashboard. It is an attack in progress: a sudden wave of fraudulent authorizations against your checkout, a burst of account takeovers, a promo campaign being drained by a multi-accounting ring, or a payout path being routed toward an attacker. The fraud-operations scorecard tells you, over weeks, whether your controls are healthy. An incident is the hour in which they are being actively beaten, and the clock is running.
The detection articles on this site tell you how to spot each of these attacks — the signals, the models, the thresholds. This runbook is about what you do once one is live: how to confirm it, classify it, contain it without breaking your good customers, and recover cleanly afterward. Detection is the smoke alarm; this is the fire drill. The two are different disciplines, and conflating them is how operators end up staring at a screaming dashboard with no agreed process for who decides what.
The structure here is deliberately the standard security incident-response lifecycle — detection and analysis, containment, recovery, and a post-incident review — adapted to payments fraud. That lifecycle is well-trodden public ground (NIST SP 800-61 and the SANS PICERL process), so this runbook does not reinvent it; it translates it into the specific levers, roles, and metrics a payments operator reaches for when fraud, rather than a server, is the thing on fire.
What counts as a fraud incident
Not every alert is an incident. The single most expensive mistake at the start is mobilizing a war-room for something that was never an attack — and the second is dismissing a real attack as noise. Before you declare, separate three things that can all look alike on a chart:
- A real attack. A coordinated, often automated, adversary is exploiting a specific weakness — testing stolen cards, taking over accounts, farming a promo, pushing fraudulent disbursements. The signature is usually a sharp, structured spike: concentrated in particular BINs, devices, IP ranges, account clusters, or a single endpoint.
- A model blip or alerting artifact. A retrained model, a threshold change, a data-pipeline lag, or a downstream outage can make fraud signals jump without any real attacker. Check whether anything in your stack changed before you assume an adversary did.
- A legitimate surge. A viral moment, an influencer post, a flash sale, or expansion into a new country produces a spike in volume — and sometimes in declines — that is real demand, not fraud. Block it and you have manufactured an incident out of your best day.
Confirm it is real before you mobilize. Severity tiers help here, and the ones below are illustrative — calibrate them to your own volume and risk appetite:
- SEV-1 — material, ongoing loss or a live takeover wave; payout integrity or large auth volume at risk. Full war-room, incident owner, immediate containment.
- SEV-2 — a confirmed attack with contained blast radius (one segment, one method). Owner assigned, targeted controls, monitored.
- SEV-3 — a suspicious but unconfirmed pattern. Investigate and quantify; do not yet apply customer-facing blocks.
Declare an incident when the pattern is confirmed structured and it is causing — or is about to cause — real loss or abuse. Declaring promotes the situation from “an analyst is looking at it” to “someone owns it, and there is a process.”
The incident lifecycle
The workflow below is the standard incident-response lifecycle applied to fraud: detect → triage → contain → communicate → recover → post-mortem. It maps directly onto the public frameworks — NIST SP 800-61’s detection-and-analysis, containment/eradication/recovery, and post-incident-activity phases, and the SANS PICERL sequence (Preparation, Identification, Containment, Eradication, Recovery, Lessons Learned). Nothing about the shape of fraud response is novel; what is payments-specific is the content of each phase. The rest of this runbook walks them in order, starting with the triage decision that routes the whole incident.
Triage: classify without re-teaching detection
Triage in an incident is not the place to re-derive how each attack is detected — that work lives in the detection articles, and during an incident you do not have time to redo it. The triage job is narrower and faster: take the observed signal, name the most likely attack type, and route it to the containment lever that fits. Get the classification wrong and you apply the wrong control — tightening card-auth velocity does nothing to a promo-farming ring.
The table maps the common cases. The detection reference is where the how-do-I-know lives; the containment lever is what you reach for once you do.
| Observed signal | Likely attack type | Detection reference | Primary containment lever |
|---|---|---|---|
| Burst of low-value auths/declines across many cards on one endpoint | Card testing / enumeration | card testing | Velocity tightening, CAPTCHA, BIN/IP rate limits |
| Spike in logins, credential-stuffing patterns, profile/payout changes | Account takeover | account takeover | Step-up authentication, session/device challenges, payout holds |
| Surge in signups, referrals, or reward claims clustered by device/instrument | Promo / referral abuse | promo/referral abuse | Reward freeze, eligibility hardening, cluster rules |
| Rising disputes/refunds from real customers’ own transactions | First-party / friendly fraud | first-party fraud | Evidence capture, dispute-defense routing (not blunt blocking) |
| New accounts with fabricated identities passing onboarding, then transacting | Synthetic identity | synthetic identity | Onboarding-step-up, manual-review escalation, account-cluster rules |
| Fraudulent or coerced real-time transfers / push payments | Authorized push payment (APP) fraud | APP fraud | Payout/transfer holds, beneficiary checks, manual review |
If the signal does not fit cleanly, treat it as a SEV-3 and quantify before acting — a misclassified incident is worse than a slow one.
First-hour actions
The first hour decides whether the incident stays contained or spreads. Work these in order; the sequence matters more than speed on any single step.
- Declare and assign the incident owner. One named person owns the incident end to end — the decision-maker, not necessarily the most senior person in the room. Everything else hangs off this.
- Open the war-room. A single channel and a running, timestamped log. From this moment, every signal seen, control applied, and decision made is written down as it happens.
- Quantify before blocking. Size the attack — volume, rate, targeted segments, money at risk — before you apply customer-facing controls. The same principle drives the card-testing-specific instance in card testing’s
## Incident Response, which is the tactical card-testing playbook (BIN concentration, CAPTCHA, PSP fraud-team engagement); this runbook generalizes it across attack types. Blocking blind can miss the attack entirely while harming good traffic. - Preserve evidence early. Snapshot the attack signature and the state of your rules before you start changing things. Once you tighten controls, the original signal is gone (see the evidence section below).
- Apply proportional containment. Reach for the narrowest control that fits the classified attack type, time-boxed from the moment you switch it on.
- Communicate. Tell the internal stakeholders — risk, payments, support, finance, and where relevant compliance/legal — that an incident is live, what is known, and who owns it.
Containment controls
Containment is where incidents are won or lost on the false-positive axis. The governing rule: every control here is proportional and reversible, and every control is time-boxed from the moment it goes on. Nothing in this section is a permanent setting; each is a temporary tourniquet you will deliberately remove.
- Velocity tightening. Lower the rate limits a velocity check enforces — attempts per card, per device, per IP, per account, per window — below normal for the duration. The fastest blunt instrument against automated attacks; also the easiest to over-apply, so scope it to the attacked segment where you can.
- Step-up authentication. Force additional verification on risky sessions or transactions — a challenge, a re-auth, a stronger factor. Where card rails apply, this can mean routing more traffic through SCA and 3DS2 so the issuer carries more of the authentication. Reversible: you relax the trigger when the attack subsides.
- Temporary payment-method restrictions. Disable or hard-gate a method being abused (a specific card type, wallet, or rail) while the attack runs. Coarse, so prefer it only when the abuse is concentrated in one method.
- BIN / issuer / country / device / account-cluster rules. Targeted blocks or step-ups scoped to the dimension where the attack actually concentrates — a BIN range, an issuer, a country, a device cluster, or a linked-account cluster. Cluster-level rules are the surgical option; country and BIN bans are the blunt option and the most likely to catch good customers.
- Payout holds. Pause disbursements or withdrawals to the suspect accounts or routes while you verify. This sits directly on the payout path, so coordinate with the payout/disbursement failure runbook — holding genuine payouts is itself a customer-impacting event.
- Promo / reward freezes. Freeze the incentive — the bonus, credit, or referral payout — rather than the user’s own funds. This is the abuse instance of a reversible rollback; the proportionality, evidence, and appeal discipline for it is the promo/referral abuse controls
## Operational response, which this runbook points to rather than restating. - Manual-review-queue escalation. Divert suspect traffic to human reviewers instead of auto-declining it. Slower and capacity-bound, but it preserves good customers and generates evidence — the right lever when the false-positive cost of an auto-block is high.
Apply the narrowest lever that contains the attack, write down when you switched it on, and set the expectation that it comes off again.
Avoiding over-blocking legitimate users
This section is load-bearing, because the most common self-inflicted wound in a fraud incident is over-blocking. Over-blocking legitimate users is its own incident — it just shows up on the approval-rate and revenue side of the ledger instead of the loss side, which is exactly why it gets ignored under pressure.
When an attack is live, the instinct is to reach for the broadest, most reassuring control — ban the country, block the BIN, slam the velocity limits to the floor. Those blunt rules feel decisive and they do stop the attack, but they also decline the good customers who happen to share that country, that BIN, or that traffic window. The damage is invisible on the fraud dashboard and very visible in next week’s approval rate.
Three disciplines keep this in check:
- Watch the approval-rate impact live. Put the approval rate — segmented, against baseline — on the same screen as the attack signal. A containment control that drops good-customer approvals is not free; you are trading one loss for another and should know the exchange rate.
- Prefer targeted cluster rules over broad bans. A rule scoped to the device cluster, account ring, or BIN-plus-behavior signature that is the attack will catch far fewer good customers than a whole-country or whole-BIN ban that merely contains it.
- Escalate to step-up or manual review before you hard-decline. A challenge or a review queue lets a real customer through and only stops the attacker; a hard decline stops both. When in doubt, add friction before you add a wall.
The goal of containment is to stop the attacker at the lowest cost to everyone else — not to drive the fraud number to zero at any price.
Incident roles and the war-room
An incident without assigned roles is a crowd watching a chart. The war-room exists so that decisions have an owner and information has a single home. The roles below are functions, not headcount — in a small team one person wears several hats, but each function still has to be covered.
| Role | Owns during the incident |
|---|---|
| Incident owner / commander | The decision of record: declares, sets severity, approves containment, calls recovery and stand-down. The single accountable person. |
| Fraud / risk analyst | Classifying the attack, sizing it, choosing and tuning the specific controls, watching the false-positive cost. |
| Payments / engineering | Applying controls in the systems, pulling telemetry, and ensuring changes are reversible and logged. |
| Support lead | Front-line customer impact: prepared responses, “was I blocked?” handling, feeding customer reports back as signal. |
| Finance | Quantifying money at risk and money lost; payout-hold and reconciliation implications. |
| Compliance / legal liaison | The owner of any notification or reporting question — not a runbook decision (see communication, below). |
| Comms | Internal cadence and, where needed, external messaging principles — factual, no over-promising, no premature blame. |
The owner runs the room; everyone else reports into it. The log is the room’s memory.
Communication: internal and external
Communication during an incident is mostly internal, and mostly about keeping a shared, current picture. Run a regular internal cadence — a short, timestamped update at a fixed interval — across risk, payments, support, finance, and compliance/legal, so no one is acting on a stale view. State what is known, what is being done, what is uncertain, and what changed since the last update.
External communication, where it is needed at all, follows principles rather than a script: be factual, do not over-promise a fix time or a “no one was affected” assurance you cannot stand behind, and do not assign blame before the root cause is confirmed. Support gets a prepared, honest line for customers who were caught by a control.
On notifications and regulatory reporting: route the question to compliance and legal — do not answer it from this runbook. Whether a given incident triggers a customer-notification, breach-notification, or regulatory-reporting obligation depends on jurisdiction, the nature of the incident, and your specific regulatory status, and it changes over time. This runbook deliberately does not prescribe universal obligations and does not give legal advice. The operational rule is simply: flag it early to counsel, and let them own the determination.
Evidence preservation
Evidence is the first casualty of a fast containment, so capture it before you change anything. The moment you tighten rules, the attack signature you were seeing is gone — and you will want it for the post-mortem, for disputes, and for audit. Capture, at minimum:
- The attack signature — the BINs, devices, IPs, account clusters, endpoints, and patterns that defined it, as they looked at peak.
- The timeline — first signal, detection, declaration, each containment action with a timestamp, and stand-down.
- The rules you applied and when — exactly which controls went on, scoped how, at what time, and when each came off.
- The decisions and their rationale — who decided what, and why, while it was fresh.
For card-scheme monitoring in particular, documented attack events matter: a well-documented, externally caused, remediated spike is the context acquirers and Visa will consider, and it feeds the VAMP discussion. The mechanics of that live in the chargeback-rules reference and the documentation point in card testing’s incident-response section — this runbook’s job is only to make sure the evidence exists to use them. Evidence underpins disputes, the post-mortem, and any audit that follows.
Recovery: rolling back temporary controls safely
This is the most-skipped step in the entire lifecycle, and the one that quietly does the most long-term damage. The temporary controls you applied during the incident — the tightened velocity, the country block, the step-up trigger, the payout hold — must be rolled back deliberately, or they silently become permanent. A velocity limit slammed to the floor during an attack and never restored is now a standing false-positive source that nobody remembers creating.
Recover in stages, not all at once:
- Verify the attack has actually stopped — not just paused. Confirm the signal has returned to baseline and stayed there, accounting for the attacker probing whether you have relaxed.
- Roll back one control at a time, monitored. Loosen the most customer-impacting controls first, watch for the attack resuming and for approval rate recovering, and pause if the signal returns.
- Document which controls stay. Some incident learnings become permanent controls deliberately — that is fine, but it should be a recorded decision with an owner, not an accident of a rollback that never happened.
A clean recovery restores the approval rate as carefully as the containment protected the loss rate. An incident is not closed when the attack stops; it is closed when the temporary controls are off and the system is back to its normal operating point.
Post-incident review
Every incident ends in a blameless post-mortem — blameless because the goal is to fix the system, and people hide information when the exercise is about assigning fault. Reconstruct the timeline from the log: when the attack started, when the first signal appeared, when you detected it, when you contained it, and when you recovered. Then find the control gap that let the attack in or let it grow before you saw it.
The single most actionable number is usually the gap between first signal and detected — the time the attack was running before anyone declared. Closing that gap is almost always higher-leverage than adding another control. Feed the fixes back: the detection threshold that should have fired sooner, the containment lever that was missing or untested, the classification that took too long. And update this runbook itself — a runbook that is not revised after each incident decays into fiction.
Incident KPI scorecard
These are incident-lifecycle metrics — they measure how well you ran this incident, and they are distinct from the steady-state fraud scorecard. The cross-cutting loss, detection, friction, and model-quality metrics live in fraud operations KPIs, which is the sibling scorecard you track continuously; the metrics below sit alongside it and are measured per incident.
| Metric | Definition | Why it matters |
|---|---|---|
| Time to detect | First attack signal → incident detected/declared | The most actionable gap; every minute here is unmonitored attack time |
| Time to contain | Detected → attack materially stopped | How fast the war-room turned a declaration into effective controls |
| Attack approval leakage | Fraudulent volume/value that was approved during the incident window | The direct loss the incident let through before containment held |
| False-positive impact | Good-customer approvals lost to incident controls vs. baseline | The over-blocking cost made visible — the other side of the ledger |
| Manual-review backlog | Cases queued by incident escalation vs. review capacity | Whether the manual-review lever is absorbing or drowning |
| Chargeback spillover | Fraud that got approved and later lands as fraud-category chargebacks | The delayed financial tail of the incident; ties to scheme monitoring |
| Recovery time | Stand-down → all temporary controls rolled back to baseline | Measures the most-skipped step; long recovery time means lingering false positives |
Each is a per-incident number; tracked across incidents, they show whether your response is getting faster and cheaper or not.
Operator readiness checklist
The work that determines how an incident goes happens before the attack. Have these in place ahead of time:
- Declared severity tiers — written definitions of SEV-1/2/3 (or your equivalent) with the declaration criteria, agreed before you need them.
- A named incident owner and on-call roster — one accountable owner role and a roster covering the war-room functions around the clock.
- A war-room process — a known channel, a logging standard, and a stand-up cadence, so the room assembles itself rather than being invented mid-attack.
- Pre-built and tested containment levers with a rollback plan — every lever above built, scoped, switchable, and exercised, each with a documented way to take it off again.
- An evidence-capture standard — a defined list of what to snapshot, and the habit of capturing it before changing rules.
- Comms templates — internal update format and external/ support lines drafted in advance, not written under pressure.
- A blameless post-mortem cadence — a standing expectation that every declared incident gets a review, with the timeline and the control gap as required outputs.
- Detection references mapped to containment levers — the triage table above, kept current, so classification routes to the right control without re-deriving detection.
Scope note
The lifecycle in this runbook is adapted from public incident-response frameworks — NIST SP 800-61 (Rev. 2 for the four-phase lifecycle; Rev. 3, the current revision, reframes incident response around the NIST CSF 2.0 functions) and the SANS PICERL process — not from any payments-specific standard. Per-attack detection lives in the linked detection articles; this runbook covers the command workflow once an attack is confirmed live. Severity tiers, thresholds, roles, levers, and KPI definitions here are illustrative operator synthesis — calibrate them to your own volume, risk appetite, contracts, and authority. This is operational guidance, not legal advice. Customer-notification and regulatory-reporting obligations vary by jurisdiction, incident type, and regulatory status, change over time, and are a question for your counsel — they are deliberately not prescribed here. The throughline is proportional, reversible controls; preserved evidence; auditability; and active management of the false-positive cost, because over-blocking legitimate users is its own incident.
Related references
- Card-Testing and Enumeration Attacks — the detection playbook and the card-testing-specific tactical
## Incident Responseinstance this runbook generalizes. - Account Takeover Detection in Payments — how to detect the takeover wave this runbook tells you to contain.
- Promo and Referral Abuse Payment Controls — the reward-freeze, clawback, and appeal discipline for the promo-abuse containment instance.
- First-Party and Friendly Fraud — the dispute-heavy attack type that needs evidence, not blunt blocking.
- Synthetic Identity Fraud Patterns — detecting the fabricated-identity attack routed to onboarding step-up and manual review.
- Authorized Push Payment Fraud — the real-time-rails attack contained with payout and transfer holds.
- Fraud Operations KPIs — the steady-state scorecard that runs continuously; this runbook’s incident-lifecycle metrics sit alongside it.
- Payout and Disbursement Failure Runbook — the payout path that payout holds sit on top of during an incident.
- Visa VCR and Mastercard Chargeback Rules 2026 — where documented attack evidence feeds scheme-monitoring and VAMP discussions.
For term definitions — velocity check, SCA, 3DS2, BIN, and issuer — see the Payments Glossary.
Sources & methodology (5)
The incident-response lifecycle adapted here — detection and analysis, then containment, eradication, and recovery, then post-incident activity — is the classic process defined in NIST SP 800-61 Revision 2, the Computer Security Incident Handling Guide (August 2012)
Rev. 2 is the source of the four-phase lifecycle cited here. It was superseded on 3 April 2025 by Rev. 3, which reframes incident response around the NIST CSF 2.0 functions rather than restating the lifecycle; cited for the lifecycle, not as the current revision.
Checked:
NIST SP 800-61 Revision 3 (final, April 2025), Incident Response Recommendations and Considerations for Cybersecurity Risk Management: A CSF 2.0 Community Profile, is the current revision; it superseded Rev. 2 and maps incident-response guidance to the CSF 2.0 functions (Govern, Identify, Protect, Detect, Respond, Recover)
Cited to state the current revision accurately; the lifecycle framing in this runbook traces to Rev. 2, which Rev. 3 superseded.
Checked:
The SANS Incident Handler's Handbook defines the six-phase PICERL incident-handling process — Preparation, Identification, Containment, Eradication, Recovery, and Lessons Learned — a widely used public IR process that aligns with the lifecycle used here
Secondary public anchor for the lifecycle; the 'Lessons Learned' phase corresponds to the blameless post-mortem in this runbook.
Checked:
Acquirers and Visa accept documented attack events as context in monitoring-programme remediation discussions — documenting an externally caused, remediated attack spike (start date, volume, actions) is the operator practice that ties incident evidence to scheme monitoring exposure
Internal synthesis; the scheme-monitoring context is summarised, not quoted as a published scheme rule.
Checked:
The payments-specific containment levers, incident roles, severity tiers, incident-lifecycle KPIs, and readiness checklist in this runbook are PaymentBrief operator synthesis — illustrative frameworks, not scheme rules or vendor commitments, and must be calibrated against your own contracts, baselines, and authority
Checked:
Source types explained in our Methodology.