[The API Patterns] Atomicity of Bulk Changes

May 23, 2023

With this post, I’m continuing publishing the v2 of my book dedicated to APIs. If you like this book, please rate it on GitHub, Amazon, or Goodreads and

Chapter 23. Atomicity of Bulk Changes

Let's transition from webhooks back to developing direct-call APIs. The design of the orders/bulk-status-change endpoint, as described in the previous chapter, raises an interesting question: what should we do if some changes were successfully processed by our backend while others were not?

Let's consider a scenario where the partner notifies us about status changes that have occurred for two orders:

POST /v1/orders/bulk-status-change
{
  "status_changes": [{
    "order_id": "1",
    "new_status": "accepted",
    // Other relevant data,
    // such as estimated
    // preparation time
    …
  }, {
    "order_id": "2",
    "new_status": "rejected",
    "reason"
  }]
}
→
500 Internal Server Error

In this case, if changing the status of one order results in an error, how should we organize this “umbrella” endpoint (which acts as a proxy to process a list of nested sub-requests)? We can propose at least four different options:

A. Guarantee atomicity and idempotency. If any of the sub-requests fail, none of the changes are applied.
B. Guarantee idempotency but not atomicity. If some sub-requests fail, repeating the call with the same idempotency key results in no action and leaves the system exactly in the same state (i.e., unsuccessful calls will never be executed, even if the obstacles are resolved, until a new call with a new idempotency key is made).
C. Guarantee neither idempotency nor atomicity and process the sub-requests independently.
D. Do not guarantee atomicity and completely prohibit retries by requiring the inclusion of the actual resource revision in the request (see the “Synchronization Strategies” chapter).

From a general standpoint, it appears that the first option is most suitable for public APIs: if you can guarantee atomicity (despite it potentially poses scalability challenges), it is advisable to do so. In the first revision of this book, we unconditionally recommended adhering to this solution.

However, if we consider the situation from the partner's perspective, we realize that the decision is not as straightforward as one might initially think. Let's imagine that the partner has implemented the following functionality:

The partner's backend processes notifications about incoming orders through a webhook.
The backend makes inquiries to coffee shops regarding whether they can fulfill the orders.
Periodically, let's say once every 10 seconds, the partner collects all the status changes (i.e., responses from the coffee shops) and calls the bulk-status-change endpoint with the list of changes.

Now, let's consider a scenario where the partner receives an error from the API endpoint during the third step. What would developers do in such a situation? Most probably, one of the following solutions might be implemented in the partner's code:

Unconditional retry of the request:

// Retrieve the ongoing orders
const pendingOrders = await api
  .getPendingOrders();
// The partner checks the status
// of every order in its system
// and prepares the list of
// changes to perform
const changes = 
  await prepareStatusChanges(
    pendingOrders
  );

let result;
let tryNo = 0;
let timeout =
  DEFAULT_RETRY_TIMEOUT;
while (
  result && 
  tryNo++ < MAX_RETRIES
) {
  try {
    // Send the list
    // of changes
    result = await api
      .bulkStatusChange(
        changes,
        // Provide the newest
        // known revision
        pendingOrders.revision
      );
  } catch (e) {
    // If there is an error,
    // repeat the request
    // with some delay
    logger.error(e);
    await wait(timeout);
    timeout = min(
      timeout * 2,
      MAX_TIMEOUT
    );
  }
}

NB: in the code sample above, we provide the “right” retry policy with exponentially increasing delays and a total limit on the number of retries, as we recommended earlier in the “Describing Final Interfaces” chapter. However, be warned that real partners' code may frequently lack such precautions. For the sake of readability, we will skip this bulky construct in the following code samples.

Retrying only failed sub-requests:

const pendingOrders = await api
  .getPendingOrders();
let changes = 
  await prepareStatusChanges(
    pendingOrders
  );

let result;
while (changes.length) {
  let failedChanges = [];
  try {
    result = await api
      .bulkStatusChange(
        changes,
        pendingOrders.revision
      );
  } catch (e) {
    let i = 0;
    // Assuming that the `e.changes`
    // field contains the errors
    // breakdown
    for (
      i < e.changes.length; i++
    ) {
      if (e.changes[i].status ==
        'failed') {
        failedChanges.push(
          changes[i]
        );
      }
    }
  }
  // Prepare a new request
  // comprising only the failed
  // sub-requests
  changes = failedChanges;
}

Restarting the entire pipeline. In this case, the partner retrieves the list of pending orders anew and forms a new bulk change request:

do {
  const pendingOrders = await api
    .getPendingOrders();
  const changes = 
    await prepareStatusChanges(
      pendingOrders
    );
  // Request changes,
  // if there are any
  if (changes.length) {
    await api.bulkStatusChange(
      changes,
      pendingOrders.revision
    );
  }
} while (pendingOrders.length);

If we examine the possible combinations of client and server implementation options, we will discover that approaches (B) and (D) are incompatible with solution (1). Retrying the same request after a partial failure will never succeed, and the server will repeatedly attempt the failing request until it exhausts the remaining retry attempts.

Now, let's introduce another crucial condition to the problem statement: imagine that certain issues with a sub-request can not be resolved by retrying it. For example, if the partner attempts to confirm an order that has already been canceled by the customer. If a bulk status change request contains such a sub-request, the atomic server that implements paradigm (A) will immediately “penalize” the partner. Regardless of how many times and in what order the set of sub-requests is repeated, valid sub-requests will never be executed if there is even a single invalid one. On the other hand, a non-atomic server will at least continue processing the valid parts of bulk requests.

This leads us to a seemingly paradoxical conclusion: in order to ensure the partners' code continues to function somehow and to allow them time to address their invalid sub-requests we should adopt the least strict non-idempotent non-atomic approach to the design of the bulk state change endpoint. However, we consider this conclusion to be incorrect: the “zoo” of possible client and server implementations and the associated problems demonstrate that bulk state change endpoints are inherently undesirable. Such endpoints require maintaining an additional layer of logic in both server and client code, and the logic itself is quite non-obvious. The non-atomic non-idempotent bulk state changes will very soon result in nasty issues:

// A partner issues a refund
// and cancels the order
POST /v1/bulk-status-change
{
  "changes": [{
    "operation": "refund",
    "order_id"
  }, {
    "operation": "cancel",
    "order_id"
  }]
}
→
// During bulk change execution,
// the user was able to walk in
// and fetch the order
{
  "changes": [{
    // The refund is successful…
    "status": "success"
  }, {
    // …while canceling the order
    // is not
    "status": "fail",
    "reason": "already_served"
  }]
}

If sub-operations in the list depend on each other (as in the example above: the partner needs both refunding and canceling the order to succeed as there is no sense to fulfill only one of them) or the execution order is important, non-atomic endpoints will constantly lead to new problems. And if you think that in your subject area, there are no such problems, it might turn out at any moment that you have overlooked something.

So, our recommendations for bulk modifying endpoints are:

If you can avoid creating such endpoints — do it. In server-to-server integrations, the profit is marginal. In modern networks that support QUIC and request multiplexing, it's also dubious.
If you can not, make the endpoint atomic and provide SDKs to help partners avoid typical mistakes.
If implementing an atomic endpoint is not possible, elaborate on the API design thoroughly, keeping in mind the caveats we discussed.

One of the approaches that helps minimize potential issues is developing a “mixed” endpoint, in which the operations that can affect each other are grouped:

POST /v1/bulk-status-change
{
  "changes": [{
    "order_id": <first id>
    // Operations related
    // to a specific endpoint
    // are grouped in a single
    // structure and executed
    // atomically
    "operations": [
      "refund",
      "cancel"
    ]
  }, {
    // Operation sets for
    // different orders might
    // be executed in parallel
    // and non-atomically
    "order_id": <second id>
    …
  }]
}

Let us also stress that nested operations (or sets of operations) must be idempotent per se. If they are not, you need to somehow deterministically generate internal idempotency tokens for each operation. The simplest approach is to consider the internal token equal to the external one if it is possible within the subject area. Otherwise, you will need to employ some constructed tokens — in our case, let's say, in the <order_id>:<external_token> form.

This is Chapter 23 of “The API” book being written by Sergey Konstantinov. I also have a book on the history of beer and historical beer styles, a Telegram channel on interesting classical music recordings, a travel photo blog on Unsplash, and a website with ranking fantasy & science fiction novels based on awards they received.