Skip to content

PSP and Acquirer Outage Failover Runbook: Detecting Degradation, Failing Over Cleanly, and Recovering Without Double Charges

When a PSP or acquirer degrades, every second is failed payments. A runbook to detect the outage, fail over cleanly, and recover without double charges.

PB
By Shaun Toh
TL;DR

PSPs and acquirers go down, and multi-PSP is not automatic failover. This runbook covers detecting gray failures from your own telemetry, the failover decision and its risks, idempotency keys against double charges, what breaks across two processors, and a readiness checklist.

Operator Summary

In the first five minutes of a suspected PSP or acquirer outage, trust your own telemetry over the vendor status page — auth-rate drop, latency and timeout spikes, error-rate climbs, and decline-code clustering against your own baselines are the signals that justify action, and status pages lag the incident. Confirm it is the processor and not your integration, then fail over at the orchestration layer to a tested backup acquirer. The single most important safeguard before retrying or re-routing any transaction is an idempotency key on every payment request, so a transaction that already succeeded is not charged twice during failover. Treat in-flight or uncertain transactions as charged until you verify status — do not blindly retry. Open the comms and incident-log process immediately.

PSPs and acquirers go down. Not often, and rarely for long — but a processor or acquirer outage during your peak hour is, minute for minute, one of the most expensive things that can happen to a payments operation. The 2018 Visa outage failed roughly 5.2 million card payments across the UK and Ireland from a single rare hardware fault, and Visa itself acknowledged it had no software in place to detect the failure automatically. If a card network can be caught flat-footed, so can your stack.

The dangerous assumption is that running more than one PSP means you are covered. It does not. Multi-PSP is a procurement state; failover is an operational capability. Having a second acquirer in a contract folder does nothing on its own — traffic does not reroute itself, tokens do not follow the cardholder across processors for free, and the decision to fail over is a judgment call someone has to make under pressure. Building the routing capability is the strategic counterpart to this runbook; see multi-acquirer routing for the architecture. This runbook is what you reach for when the routing exists and something has just started to break.

The cost of an unmanaged outage is not only the declined transactions during the incident. It is the double charges from panicked retries, the reconciliation load from transactions split across two processors, the chargebacks that land on the wrong entity, and the customer trust that erodes faster than any dashboard shows. Most of that damage is self-inflicted during a clumsy failover — which is exactly what a runbook exists to prevent.

PSP and acquirer outage failover at a glance: detecting degradation from your own telemetry before the vendor status page, classifying the failure type, failing over safely with idempotency keys to avoid double charges, what breaks after failover (reconciliation, tokens, 3DS/SCA, chargeback ownership), and post-incident recovery.
Surviving a PSP or acquirer outage without double charges: detect from your own telemetry, classify the failure, fail over with idempotency keys, and manage what breaks across two processors afterward.

What an outage actually looks like

“Outage” is the wrong mental model because it implies a clean binary. Real processor degradation lives on a spectrum:

  • Full outage — the processor returns errors or nothing at all for effectively all traffic. This is the easy case: it is obvious, it is unambiguous, and the action is clear.
  • Partial degradation — a subset of traffic fails. Maybe one card scheme, one region, one BIN range, or one acquiring bank behind the PSP. The processor’s overall numbers may look fine while your specific slice is on fire.
  • Elevated declines — authorizations are returning, but the authorization rate has dropped. Cards that should approve are getting soft declines. The plumbing works; the outcomes are wrong.
  • Latency and timeout creep — responses still come back, but slowly. Some requests cross your timeout threshold and fail; the rest convert at a lower rate because checkout feels broken. This often precedes a full outage.

The hard cases are the middle three — the gray failures. A clean outage trips every alarm you have. Partial degradation hides inside normal variance: a two-point auth-rate dip at 3 a.m. looks like noise, and elevated latency looks like a slow afternoon until it does not. Gray failures are harder to detect, harder to attribute, and far easier to dismiss — which is why detection, not failover, is the part of this runbook that earns its keep.

Detection: the signals that justify failover

You cannot fail over on a hunch and you cannot wait for certainty. The job is to assemble enough corroborating signals — measured against your own baselines — to justify action. The signals, roughly in order of usefulness:

  • Authorization-rate drop. The cleanest business-level signal. A sustained drop in approvals against your normal rate for the same hour, day, and traffic mix is the strongest single indicator that something is wrong downstream. Baseline matters: a 2 a.m. auth rate and a Black Friday auth rate are different distributions.
  • Latency and timeout spike. Rising p95/p99 response times and a climbing timeout count often lead the auth-rate drop. Latency creep is your earliest warning of a degrading processor.
  • HTTP and error-rate climb. 5xx responses, connection resets, and gateway errors at the API layer. A spike here that is not explained by your own deploy points outward.
  • Decline-code clustering. Watch the shape of declines, not just the count. A sudden concentration in specific decline codes — especially issuer-unavailable, do-not-honor surges, or a wave of soft declines that are normally rare — signals a systemic problem rather than ordinary cardholder-level failures.
  • Webhook silence. If a processor’s event stream goes quiet when volume says it should not, the processor may be degraded even while the synchronous API still answers.

The governing principle: your own telemetry beats the vendor status page. Status pages are updated by humans after an incident is confirmed internally, so they lag the failure your customers are already feeling — sometimes by many minutes. By the time a status page turns yellow, your auth-rate chart has been screaming for a while. Use the status page as confirmation, never as your trigger.

For this to work you need baselines and monitoring windows defined before the incident: what is normal auth rate by hour and segment, what p99 latency is acceptable, and over what rolling window a deviation has to persist before it counts. A short window catches problems fast but fires on noise; a long window is calm but slow. Pick deliberately and tune against history. For the metric definitions underneath this, see payment routing KPIs.

The failover decision

Once detection says “something is wrong,” someone or something has to decide to fail over. Two models:

  • Automatic triggers. Your orchestration layer reroutes when a metric breaches a threshold — for example, auth rate below a floor and error rate above a ceiling, both sustained across the monitoring window. Fast and consistent, but only as good as the thresholds, and dangerous if it flaps.
  • Manual triggers. A human confirms the signals and pulls the lever. Slower, but resistant to false positives and able to weigh context a metric cannot.

Many mature operators run a hybrid: automatic failover for unambiguous breaches, a human-confirmed path for the gray zone.

The thresholds you choose are yours to own. As an illustrative starting point only — not a prescription, because the right numbers depend entirely on your volume, mix, and contracts — an operator might treat a sustained auth-rate drop of a meaningful margin below baseline, combined with an error or timeout rate well above normal, held across the monitoring window, as a failover trigger. Treat any specific number you see (including any in this article) as a placeholder to calibrate against your own data, never as a setting to copy.

The reason to be disciplined about thresholds is that premature failover has real costs:

  • Double charges — re-routing in-flight transactions without idempotency means charging customers twice on two processors.
  • Reconciliation load — splitting a window of traffic across two processors multiplies the reconciliation work and the break types you have to chase.
  • Token gaps — the backup acquirer may not have the stored credentials or network tokens to convert returning customers as well as the primary, dropping your auth rate even though the processor is “up.”

Failing over too early to escape a blip you would have ridden out can cost more than the blip. The decision is a trade, and the runbook’s job is to make sure it is a considered trade.

Failover mechanics

How the reroute actually happens determines how much breaks.

Orchestration-layer failover vs config or DNS. Prefer rerouting inside a payment orchestration layer that holds connections to multiple processors and can switch per-transaction in real time. DNS or config-file failover is coarse, slow to propagate, and hard to roll back cleanly. An orchestration layer can shift a defined slice of traffic and reverse it instantly. Orchestration platforms such as Spreedly, Primer, and Gr4vy exist precisely to own this switching, and ML-driven approaches add automated decisioning on top — see AI payment routing and ML orchestration.

Idempotency keys to prevent double charges. This is the single most important safeguard in a failover. Attach a unique idempotency key to every payment request. If a request times out or fails ambiguously and you retry — on the same processor or a different one — the key lets the processor recognize the repeat and return the original result instead of creating a second charge. Stripe, for example, saves the status code and body of the first request for a given idempotency key and returns that same saved result on any subsequent request carrying the same key, so a retried request after a connection error does not perform the operation twice. Without idempotency, every retry during an outage is a potential double charge.

In-flight and uncertain transactions. The most dangerous transactions in an outage are the ones whose outcome you do not know — the request went out and no clean response came back. Treat an uncertain transaction as potentially charged until you have verified its status. Query the processor’s transaction-status API (or wait for the webhook) before retrying or re-routing. Blind retries on uncertain transactions are how outages turn into double-charge incidents.

Token continuity. This is the part operators most often get wrong, so be careful: network tokens improve cross-PSP continuity, but they are provisioned per token requestor and acquirer, and portability across acquirers is limited and still evolving — do not assume a token minted under your primary processor will simply work on your backup. Visa’s Token Service requires you to register as a token requestor before requesting tokens, and Mastercard’s MDES treats wallets, merchants, and PSPs as distinct token requestors whose tokens are managed through requestor-initiated APIs. The practical consequence is that your backup acquirer may not be able to use the primary’s stored network tokens, which can depress conversion on returning customers during failover even when the backup is perfectly healthy. Plan token strategy deliberately; for the distinction that matters here, see network tokens vs PSP tokens.

Outage-type, signal, action

The citable core of this runbook. Each row maps a degradation type to its leading signal, the immediate action, and the caveat that keeps the action from backfiring. Thresholds and timings referenced are illustrative — calibrate to your own baselines and contracts.

Outage typePrimary signalImmediate actionCaveat
Full outageError/timeout rate near total; webhook stream silentFail over all affected traffic to backup acquirerConfirm it is the processor, not your own integration or network, before rerouting
Partial degradation (scheme/region/BIN)Auth-rate drop and errors isolated to one segmentFail over only the affected slice; keep healthy traffic on primaryOver-broad failover moves healthy traffic onto a backup with weaker token coverage
Elevated declinesDecline-code clustering; auth rate down with normal latencyInvestigate decline shape; route affected segment if systemicMay be issuer-side, not processor-side — failover will not fix issuer outages
Latency / timeout creepRising p95/p99; climbing timeout countTighten timeouts; pre-stage failover; watch the windowPremature failover on a transient blip you would have ridden out costs more than it saves
Webhook / event-stream silenceExpected events absent at normal volumeSwitch to status-API polling; verify charge outcomesSynchronous API may still work — do not assume a full outage from webhook silence alone
Settlement/reporting onlyAuth path healthy; settlement files or reports missing/lateDo not fail over auth; open settlement reconciliation trackRe-routing live auth does nothing for a back-office settlement problem

What breaks during failover

Failover is never free. The moment you split traffic across two processors, a set of downstream problems activate:

  • Reconciliation across two processors. A single window of traffic now settles through two settlement reports, two ID schemes, and two payout cadences. Every reconciliation break type gets multiplied. When the dust settles, work through the PSP reconciliation failure runbook for the matching-ID and break taxonomy.
  • Chargeback ownership. A transaction authorized on one processor and any related activity on another can muddy which entity owns a dispute. Track which processor handled each transaction so chargebacks route to the right place.
  • Tokenization gaps. As above — stored credentials and network tokens may not transfer to the backup, lowering conversion on returning customers exactly when you need it most.
  • 3DS and SCA continuity. Strong-customer-authentication exemptions, soft-decline retry logic, and 3DS flows can differ between processors. A transaction soft-declined on the primary and retried on the backup may face a different authentication path.
  • Reporting and settlement-file splits. Daily reporting, finance feeds, and settlement files now arrive from two sources for the same period — your finance team needs to expect the split rather than treat it as a data error.
  • Retry interactions. Retry logic tuned for one processor can misbehave against another, and aggressive retries during an outage interact badly with failover. Coordinate retry policy with failover so the two do not compound. For the optimization context, see authorization optimization and card acceptance.

Customer and merchant communications

Technical recovery is half the incident; communication is the other half.

  • Status page. If you maintain a public status page, update it honestly and promptly. Acknowledge degradation without over-promising a fix time you cannot guarantee.
  • Support scripts. Give support staff a prepared script: what to tell customers seeing failed payments, whether to advise a retry, and how to handle “was I charged?” questions. The honest answer to the last one depends on idempotency and your ability to verify charge status — which is why the mechanics above matter for comms too.
  • What not to promise. Do not promise that no one was double-charged before you have verified it, do not give a precise restoration ETA you do not control, and do not blame the vendor publicly before the root cause is confirmed.
  • Internal comms and escalation. Open an incident channel, assign an incident lead, and keep a timestamped log from the first signal. The log is what makes the post-mortem and any SLA claim possible.

After the incident

When traffic is stable again, the incident is not over.

  • RCA and post-mortem. Reconstruct the timeline from your logs: when the first signal appeared, when you detected it, when you failed over, and when you recovered. The gap between first signal and detected is usually the most actionable number you will find.
  • Vendor SLA and credit claims. If your contract includes uptime commitments, your timestamped evidence supports a credit claim. Whether you are owed anything depends entirely on the specific contract terms — confirm with your account team rather than assuming.
  • Feed fixes back into the runbook. Every incident exposes a threshold that was wrong, an alert that did not fire, or a step that was unclear under pressure. Update the runbook while it is fresh. A runbook that is not revised after each incident decays into fiction.

Operator readiness checklist

The work that determines how a failover goes happens before the outage. Have these in place ahead of time:

  • A tested backup acquirer — not just contracted, but actually exercised with live traffic so you know it works and at what conversion.
  • A deliberate token strategy — know whether and how network tokens and stored credentials carry to the backup, and what conversion hit to expect if they do not.
  • Idempotency keys on every payment request — non-negotiable; this is your double-charge insurance during retries and failover.
  • Monitoring and alert thresholds with defined baselines — auth rate, latency, error rate, and decline clustering, each with a documented normal range and monitoring window.
  • A documented runbook with a named owner — this document, kept current, with one accountable person responsible for it.
  • Prepared comms templates — status-page copy, support scripts, and an internal escalation path drafted in advance, not written during the incident.
  • Periodic failover drills — scheduled game-days where you actually fail over to the backup. An untested failover path is a hypothesis, not a capability.
  • A clear routing economics view — know the cost difference between primary and backup so failover decisions account for margin, not just availability; see least-cost routing.

Scope note

The thresholds, monitoring windows, and timings in this runbook are illustrative operator guidance, not prescriptions — the right numbers depend on your transaction volume, traffic mix, and processor contracts, and must be calibrated against your own baselines. Processor-specific mechanics (idempotency behavior, token provisioning) are cited to official Stripe, Visa, and Mastercard documentation as of June 2026; the 2018 Visa outage is included as a documented real example, not as a basis for any uptime or SLA claim. This is operational guidance, not legal, contractual, or SLA advice — verify settlement terms, SLA commitments, and token portability with your specific PSP and acquirer before relying on them.

For term definitions — PSP, acquirer, authorization, and payment orchestration — see the Payments Glossary.

Sources & methodology (6)

Idempotency keys let a client safely retry a request after a connection error without performing the operation twice; Stripe saves the status code and body of the first request for a given key and returns the same result on subsequent requests with that key, preventing duplicate charges on retry or failover

Checked:

Stripe publishes a public system status page reporting the operational state of its payment services and incidents — evidence that processor outages occur and that vendor status reporting exists as a (lagging) signal

Checked:

To request network tokens from the Visa Token Service you must first register with Visa as a token requestor; tokens can be limited to a specific merchant or device, which constrains portability across acquirers and requestors

Cited to support the portability caveat, not to claim seamless cross-acquirer reuse.

Checked:

Mastercard MDES treats wallets, merchants, and payment service providers as token requestors; token lifecycle APIs are initiated by the token requestor — confirming network tokens are provisioned and managed per token requestor rather than being globally portable

Used for the network-token continuity caveat, not to overclaim portability.

Checked:

On 1 June 2018 a rare partial hardware failure in a switch at a Visa European data centre prevented the backup switch from activating, causing roughly 5.2 million card payments to fail across the UK and Ireland; Visa later acknowledged it had no software in place to automatically detect the failure

A documented, well-reported network-level outage; cited as a real example, not for any uptime/SLA claim.

Checked:

Failover thresholds, monitoring windows, and timing guidance in this runbook are PaymentBrief operator synthesis — illustrative decision frameworks, not processor commitments or scheme rules; specific numbers must be confirmed against your own contracts and baselines

Checked:

Source types explained in our Methodology.

Shaun Toh By Shaun Toh · Director, Digital Payments · Razer

More Psp And Infrastructure briefings