[The API Patterns] Asynchronicity and Time Management

May 01, 2023

With this post, I’m continuing publishing the v2 of my book dedicated to APIs. If you like this book, please rate it on GitHub, Amazon, or Goodreads and

Chapter 19. Asynchronicity and Time Management

Let's continue working with the previous example: the application retrieves some system state upon start-up, perhaps not the most recent one. What else does the probability of collision depend on, and how can we lower it?

We remember that this probability is equal to the ratio of time periods: getting an actual state versus starting an app and making an order. The latter is almost out of our control (unless we deliberately introduce additional waiting periods in the API initialization function, which we consider an extreme measure). Let's then talk about the former.

Our usage scenario looks like this:

const pendingOrders = await api.
  getOngoingOrders();
if (pendingOrders.length == 0) {
  const order = await api
    .createOrder(…);
}
// App restart happens here,
// and all the same requests
// are repeated
const pendingOrders = await api.
  getOngoingOrders(); // → []
if (pendingOrders.length == 0) {
  const order = await api
    .createOrder(…);
}

Therefore, we're trying to minimize the following interval: network latency to deliver the createOrder call plus the time of executing the createOrder plus the time needed to propagate the newly created order to the replicas. We don't control the first summand (but we might expect the network latencies to be more or less constant during the session duration, so the next getOngoingOrders call will be delayed for roughly the same time period). The third summand depends on the infrastructure of the backend. Let's talk about the second one.

As we can see if the order creation itself takes a lot of time (meaning that it is comparable to the app restart time) then all our previous efforts were useless. The end user must wait until they get the server response back and might just restart the app to make a second createOrder call. It is in our best interest to ensure this never happens.

However, what we could do to improve this timing remains unclear. Creating an order might indeed take a lot of time as we need to carry out necessary checks and wait for the payment gateway response and confirmation from the coffee shop.

What could help us here is the asynchronous operations pattern. If our goal is to reduce the collision rate, there is no need to wait until the order is actually created as we need to quickly propagate the knowledge that the order is accepted for creation. We might employ the following technique: create a task for order creation and return its identifier, not the order itself.

const pendingOrders = await api.
  getOngoingOrders();
if (pendingOrders.length == 0) {
  // Instead of creating an order,
  // put the task for the creation
  const task = await api
    .putOrderCreationTask(…);
}
// App restart happens here,
// and all the same requests
// are repeated
const pendingOrders = await api.
  getOngoingOrders(); 
  // → { tasks: [task] }

Here we assume that task creation requires minimal checks and doesn't wait for any lingering operations, and therefore, it is carried out much faster. Furthermore, this operation (of creating an asynchronous task) might be isolated as a separate backend service for performing abstract asynchronous tasks. By having the functionality of creating tasks and retrieving the list of ongoing tasks we can significantly narrow the “gray zones” when clients can't learn the actual system state precisely.

Thus we naturally came to the pattern of organizing asynchronous APIs through task queues. Here we use the term “asynchronous” logically meaning the absence of mutual logical locks: the party that makes a request gets a response immediately and does not wait until the requested procedure is fully carried out being able to continue to interact with the API. Technically in modern application environments, locking (of both the client and server) almost universally doesn't happen during long-responding calls. However, logically allowing users to work with the API while waiting for a response from a modifying endpoint is error-prone and leads to collisions like the one we described above.

The asynchronous call pattern is useful for solving other practical tasks as well:

caching operation results and providing links to them (implying that if the client needs to reread the operation result or share it with another client, it might use the task identifier to do so)
ensuring operation idempotency (through introducing the task confirmation step we will actually get the draft-commit system as discussed in the “Describing Final Interfaces” chapter)
naturally improving resilience to peak loads on the service as the new tasks will be queuing up (possibly prioritized) in fact implementing the “token bucket” technique
organizing interaction in the cases of very long-lasting operations that require more time than reasonable timeouts (which are tens of seconds in the case of network calls) or can take unpredictable time.

Also, asynchronous communication is more robust from a future API development point of view: request handling procedures might evolve towards prolonging and extending the asynchronous execution pipelines whereas synchronous handlers must retain reasonable execution times which puts certain restrictions on possible internal architecture.

NB: in some APIs, an ambivalent decision is implemented where endpoints feature a double interface that might either return a result or a link to a task. Although from the API developer's point of view, this might look logical (if the request was processed “quickly”, e.g., served from cache, the result is to be returned immediately; otherwise, the asynchronous task is created), for API consumers, this solution is quite inconvenient as it forces them to maintain two execution branches in their code. Sometimes, a concept of providing a double set of endpoints (synchronous and asynchronous ones) is implemented, but this simply shifts the burden of making decisions onto partners.

The popularity of the asynchronicity pattern is also driven by the fact that modern microservice architectures “under the hood” operate in asynchronous mode through event queues or pub/sub middleware. Implementing an analogous approach in external APIs is the simplest solution to the problems caused by asynchronous internal architectures (the unpredictable and sometimes very long latencies of propagating changes). Ultimately, some API vendors make all API methods asynchronous (including the read-only ones) even if there are no real reasons to do so.

However, we must stress that excessive asynchronicity, though appealing to API developers, implies several quite objectionable disadvantages:

If a single queue service is shared by all endpoints, it becomes a single point of failure for the system. If unpublished events are piling up and/or the event processing pipeline is overloaded, all the API endpoints start to suffer. Otherwise, if there is a separate queue service instance for every functional domain, the internal architecture becomes much more complex, making monitoring and troubleshooting increasingly costly.
For partners, writing code becomes more complicated. It is not only about the physical volume of code (creating a shared component to communicate with queues is not that complex of an engineering task) but also about anticipating every endpoint to possibly respond slowly. With synchronous endpoints, we assume by default that they respond within a reasonable time, less than a typical response timeout (which, for client applications, means that just a spinner might be shown to a user). With asynchronous endpoints, we don't have such a guarantee as it's simply impossible to provide one.
Employing task queues might lead to some problems specific to the queue technology itself, i.e., not related to the business logic of the request handler:
- tasks might be “lost” and never processed
- events might be received in the wrong order or processed twice, which might affect public interfaces
- under the task identifier, wrong data might be published (corresponding to some other task) or the data might be corrupted.
These issues will be totally unexpected by developers and will lead to bugs in applications that are very hard to reproduce.
As a result of the above, the question of the viability of such an SLA level arises. With asynchronous tasks, it's rather easy to formally make the API uptime 100.00% — just some requests will be served in a couple of weeks when the maintenance team finds the root cause of the delay. Of course, that's not what API consumers want: their users need their problems solved now or at least in a reasonable time, not in two weeks.

Therefore, despite all the advantages of the approach, we tend to recommend applying this pattern only to those cases when they are really needed (as in the example we started with when we needed to lower the probability of collisions) and having separate queues for each case. The perfect task queue solution is the one that doesn't look like a task queue. For example, we might simply make the “order creation task is accepted and awaits execution” state a separate order status and make its identifier the future identifier of the order itself:

const pendingOrders = await api.
  getOngoingOrders();
if (pendingOrders.length == 0) {
  // Don't call it a “task”,
  // just create an order
  const order = await api
    .createOrder(…);
}
// App restart happens here,
// and all the same requests
// are repeated
const pendingOrders = await api.
  getOngoingOrders(); 
  /* → { orders: [{
    order_id: <task identifier>,
    status: "new"
  }]} */

NB: let us also mention that in the asynchronous format, it's possible to provide not only binary status (task done or not) but also execution progress as a percentage if needed.

This is Chapter 19 of “The API” book being written by Sergey Konstantinov. I also have a book on the history of beer and historical beer styles, a Telegram channel on interesting classical music recordings, a travel photo blog on Unsplash, and a website with ranking fantasy & science fiction novels based on awards they received.

[The API Patterns] Asynchronicity and Time Management

Chapter 19. Asynchronicity and Time Management

Discussion about this post