Edge Cases in Complex Payment Operations

Payment operations is complicated. Technical challenges can arise for myriad reasons. This journal details one edge case that we dealt with recently, how we responded, and what we're doing to ensure it doesn't happen again.

July 1, 2022

Sam AaronsCo-Founder & CTO

Payments are complicated. It seems like it should all be so simple: someone wants to pay or charge someone else, they click a button, and money moves.

But sometimes things go wrong. Earlier this year, we encountered an edge case at the intersection of state machines and ingesting asynchronous information from a bank. This bug impacted a single customer and a single duplicate payment. At the time of discovery we did not know the impact radius, and as such treated it as a p0.

This is our first public post about a bug. But following in the footsteps of great infrastructure companies, we’d prefer to be open and transparent about what we encounter and how we tackle technical challenges on behalf of our customers.

The Problem

It started simple: one of our customers sent a wire out to a counterparty.

Modern Treasury makes it easy to send payments across our supported payment rails and currencies in a single interface. For wires, the exact information required to complete a payment depends on the underlying bank’s technology. It’s common for one working combination of recipient information at one bank to be incompatible with another bank. If a payment instruction cannot be completed, the wire will fail. As this is a common scenario, Modern Treasury makes it easy to redraft the failed wire payment order, and transmit the instruction again with the corrected information.

Behind the scenes, Modern Treasury uses an internal object to represent every single payment attempt. These are called “Payment Order Attempts.” Every time a Payment Order is created, Modern Treasury stores a Payment Order Attempt. When a Payment Order is redrafted, Modern Treasury creates a new Payment Order Attempt. Tracking these attempts enables us to keep a full record of all information we need to transmit payment instructions.

This particular wire was to an international counterparty and required more details than a domestic wire, including foreign exchange information. The customer sent out the payment on day 1 (payment attempt #1). It failed that day. Our customer redrafted the payment and sent it again (payment attempt #2). This second payment succeeded. However, after succeeding, a second alert came from the bank about a wire failure.

The second alert was a repeat reference to the original payment (payment #1), but as it happened after the redraft (payment #2), we ingested the information and marked the second wire as failed. The customer redrafted for a second time, sending the payment again, and this time it succeeded (payment #3). However, this was the second wire to succeed, so the customer had sent out funds twice.

It would have been easy to detect the duplicate if there was a single transaction ID end-to-end. But bank core systems are often distinct and separate, and they do not maintain unique trace identifiers for payments with multiple attempts. Because of this limitation, we have built this state machine in house.

This is exactly why we emphasize idempotency in our API design. Messages can land at any time, sometimes repeatedly, and as this bug shows, it’s critical to understand what they refer to. Otherwise, you may pay someone twice.

In this specific case, there’s a layered challenge. It’s common that we see bank core systems that run different payment rails (e.g., ACH and wire) with different core systems and therefore different alerting setups. And at this particular bank, things are divided yet again. Domestic and international wires run on different payment systems.

The Solution

Once we realized there was a bug, we backtested all Payment Order Attempts in the history of Modern Treasury, and luckily, found that no other customer or bank had experienced this combination of events.

After that was cleared up, we implemented a series of changes. First, we instituted zero tolerance for any payments to have two successful payment attempts. We added additional processing guards and alerts to prevent payments from being redrafted when they have a previous successful attempt. We also added alerting for possible unexpected payment transitions from terminal or failure states to success states. Within months, these alerts had caught problems that we were able to rectify before a duplicate payment was initiated.

For this individual payment, we worked with the customer and the bank to get the funds returned.

In the future, our goal is to build a diagnostics timeline viewer for internal teams to dig into circumstances like this one. We have an ultimate goal of making this information user and customer viewable, but at a lower level of event detail.

Edge Cases

This bug teaches a lesson about payment operations, and why it’s inherently difficult. Every bank remits error messages differently: they have varying architectures, syntaxes, frequencies, and retry patterns.

Certain sequences of events that lead to edge cases only happen once in a while. It took us over $30B in payment volume to get to this edge case with this particular bank. In each circumstance, such as this one, we codify that operational knowledge into our platform to ensure the incident won’t be repeated. Nevertheless, we wanted to share this experience to be transparent about how we handle bugs when they happen.

If you’re curious about exception management in payments, reach out to us.