[The API Patterns] Lists and Accessing Them

May 08, 2023

With this post, I’m continuing publishing the v2 of my book dedicated to APIs. If you like this book, please rate it on GitHub, Amazon, or Goodreads and

Chapter 20. Lists and Accessing Them

In the previous chapter, we concluded with the following interface that allows minimizing collisions while creating orders:

const pendingOrders = await api.
  getOngoingOrders(); 
→
{ orders: [{
    order_id: <task identifier>,
    status: "new"
}, …]}

However, an attentive reader might notice that this interface violates the recommendation we previously gave in the “Describing Final Interfaces” chapter: the returned data volume must be limited, but there are no restrictions in our design. This problem was already present in the previous versions of the endpoint, but abolishing asynchronous order creation makes it much worse. The task creation operation must work as quickly as possible, and therefore, almost all limit checks are to be executed asynchronously. As a result, a client might easily create a large number of ongoing tasks which would potentially inflate the size of the getOngoingOrders response.

NB: having no limit at all on order task creation is unwise, and there must be some (involving as lightweight checks as possible). Let us, however, focus on the response size issue in this chapter.

Fixing this problem is rather simple: we might introduce a limit for the items returned in the response, and allow passing filtering and sorting parameters, like this:

api.getOngoingOrders({
  // The `limit` parameter 
  // is optional, but there is
  // a reasonable default value
  "limit": 100,
  "parameters": {
    "order_by": [{
      "field": "created_iso_time",
      "direction": "desc"
    }]
  }
})

However, introducing limits leads to another issue: if the number of items to return is higher than the limit, how would clients access them?

The standard approach is to add an offset parameter or a page number:

api.getOngoingOrders({
  // The `limit` parameter 
  // is optional, but there is
  // a reasonable default value
  "limit": 100,
  // The default value is 0
  "offset": 100,
  "parameters"
})

With this approach, however, other problems arise. Let us imagine three orders are being processed on behalf of the user:

[{
  "id": 3,
  "created_iso_time": "2022-12-22T15:35",
  "status": "new"
}, {
  "id": 2,
  "created_iso_time": "2022-12-22T15:34",
  "status": "new"
}, {
  "id": 1,
  "created_iso_time": "2022-12-22T15:33",
  "status": "new"
}]

A partner application requested the first page of the list:

api.getOrders({
  "limit": 2,
  "parameters": {
    "order_by": [{
      "field": "created_iso_time",
      "direction": "desc"
    }]
  }
})
→
{
  "orders": [{
    "id": 3, …
  }, {
    "id": 2, …
  }]
}

Then the application requests the second page ("limit": 2, "offset": 2) and expects to retrieve the order with "id": 1. However, during the interval between the requests, another order, with "id": 4, happened.

[{
  "id": 4,
  "created_iso_time": "2022-12-22T15:36",
  "status": "new"
}, {
  "id": 3,
  "created_iso_time": "2022-12-22T15:35",
  "status": "new"
}, {
  "id": 2,
  "created_iso_time": "2022-12-22T15:34",
  "status": "ready"
}, {
  "id": 1,
  "created_iso_time": "2022-12-22T15:33",
  "status": "new"
}]

Then upon requesting the second page of the order list, instead of getting exactly one order with "id": 1, the application will get the "id": 2 order once again:

api.getOrders({
  "limit": 2,
  "offset": 2
  "parameters"
})
→
{
  "orders": [{
    "id": 2, …
  }, {
    "id": 1, …
  }]
}

These permutations are rather inconvenient in user interfaces (if let's say, the partner's accountant is requesting orders to calculate fees, they might easily overlook the duplicate identifiers and process one order twice). But in the case of programmable integrations, the situation becomes even more complicated: the application developer needs to write rather unobvious code (which preserves the information regarding which pages were already processed) to carry out this enumeration correctly.

The problem might easily become even more sophisticated. For example, if we add sorting by two fields, creation date and order status:

api.getOrders({
  "limit": 2,
  "parameters": {
    "order_by": [{
      "field": "status",
      "direction": "desc"
    }, {
      "field": "created_iso_time",
      "direction": "desc"
    }]
  }
})
→
{
  "orders": [{
    "id": 3,
    "status": "new"
  }, {
    "id": 2,
    "status": "new"
  }]
}

Imagine, that in between requesting the first and the second pages, the "id": 1 order changed its status and moved to the top of the list. Upon requesting the second page, the partner application will only receive the "id": 2 order (for the second time) and miss the "id": 1 completely — and there is no method to learn this fact!

Let us reiterate: this approach works poorly with visual interfaces, but with program ones, it inevitably leads to mistakes. An API must provide methods of traversing large lists that guarantee clients can retrieve the full and consistent dataset.

If we don't go into implementation details, we can identify three main patterns of realizing such traversing, depending on how the data itself is organized.

Immutable Lists

The easiest case is with immutable lists, i.e., when the set of items never changes. The limit/offset scheme then works perfectly and no additional tricks are needed. Unfortunately, this rarely happens in real subject areas.

Additive Lists, Immutable Data

The case of a list with immutable items and the operation of adding new ones is more typical. Most notably, we talk about event queues containing, for example, new messages or notifications. Let's imagine there is an endpoint in our coffee API that allows partners to retrieve the history of offers:

GET /v1/partners/{id}/offers/history⮠
  limit=<limit>
→
{
  "offer_history": [{
    // A list item identifier
    "id",
    // An identifier of the user
    // that got the offer
    "user_id",
    // Date and time of the search
    "occurred_at",
    // The search parameter values
    // set by the user
    "search_parameters",
    // The offers that the user got
    "offers"
  }]
}

The data returned from this endpoint is naturally immutable because it reflects a completed action: a user searched for offers and received a response. However, new items are continuously added to the list, potentially in large chunks, as users might make multiple searches in succession.

Partners can utilize this data to implement various features, such as:

Real-time user behavior analysis (e.g., sending push notifications with discount codes to encourage users to convert offers to orders)
Statistical analysis (e.g., calculating conversion rates per hour).

To enable these scenarios, we need to expose through the API two operations with the offer history:

For the first task, the real-time fetching of new offers that were made since the last request.
For the second task, traversing the list, i.e., retrieving all queries until some condition is reached (possibly, the end of the list).

Both scenarios are covered with the limit/offset approach but require significant effort to write code properly as partners need to somehow align their requests with the rate of incoming queries. Additionally, note that using the limit/offset scheme makes caching impossible as repeating requests with the same limit/offset values will emit different results.

To solve this issue, we need to rely not on an attribute that constantly changes (such as the item position in the list) but on other anchors. The important rule is that this attribute must provide the possibility to unambiguously tell which list elements are “newer” compared to the given one (i.e., precede it in the list) and which are “older”.

If the data storage we use for keeping list items offers the possibility of using monotonically increased identifiers (which practically means two things: (1) the DB supports auto-incremental columns and (2) there are insert locks that guarantee inserts are performed sequentially), then using the monotonous identifier is the most convenient way of organizing list traversal:

// Retrieve the records that precede
// the one with the given id
GET /v1/partners/{id}/offers/history⮠
  newer_than=<item_id>&limit=<limit>
// Retrieve the records that follow
// the one with the given id
GET /v1/partners/{id}/offers/history⮠
  older_than=<item_id>&limit=<limit>

The first request format allows for implementing the first scenario, i.e., retrieving the fresh portion of the data. Conversely, the second format makes it possible to consistently iterate over the data to fulfill the second scenario. Importantly, the second request is cacheable as the tail of the list never changes.

NB: in the “Describing Final Interfaces” chapter we recommended avoiding exposing incremental identifiers in publicly accessible APIs. Note that the scheme described above might be augmented to comply with this rule by exposing some arbitrary secondary identifiers. The requirement is that these identifiers might be unequivocally converted into monotonous ones.

Another possible anchor to rely on is the record creation date. However, this approach is harder to implement for the following reasons:

Creation dates for two records might be identical, especially if the records are mass-generated programmatically. In the worst-case scenario, it might happen that at some specific moment, more records were created than one request page contains making it impossible to traverse them.
If the storage supports parallel writing to several nodes, the most recently created record might have a slightly earlier creation date than the second-recent one because clocks on different nodes might tick slightly differently, and it is challenging to achieve even microsecond-precision coherence[1]. This breaks the monotonicity invariant, which makes it poorly fit for use in public APIs. If there is no other choice but relying on such storage, one of two evils is to be chosen:
- Introducing artificial delays, i.e., returning only items created earlier than N seconds ago, selecting this N to be certainly less than the clock irregularity. This technique also works in the case of asynchronously populated lists. Keep in mind, however, that this solution is probabilistic, and wrong data will be served to clients in case of backend synchronization problems.
- Describe the instability of ordering list items in the docs (and thus make partners responsible for solving arising issues).

Often, the interfaces of traversing data through stating boundaries are generalized by introducing the concept of a “cursor”:

// Initiate list traversal
POST /v1/partners/{id}/offers/history⮠
  search
{
  "order_by": [{
    "field": "created",
    "direction": "desc"
  }]
}
→
{
  "cursor": "TmluZSBQcmluY2VzIGluIEFtYmVy"
}

// Get the next data chunk
GET /v1/partners/{id}/offers/history⮠
  ?cursor=TmluZSBQcmluY2VzIGluIEFtYmVy⮠
  &limit=100
→
{
  "items": […],
  // Pointer to the next data chunk
  "cursor": "R3VucyBvZiBBdmFsb24"
}

A cursor might be just an encoded identifier of the last record or it might comprise all the searching parameters. One advantage of using cursors instead of exposing raw monotonous fields is the possibility to change the underlying technology. For example, you might switch from using an auto-incremental key to using the date of the last known record's creation without breaking backward compatibility. (That's why cursors are usually opaque strings: providing readable cursors would mean that you now have to maintain the cursor format even if you never documented it. It's better to return cursors encrypted or at least coded in a form that will not arise the desire to decode it and experiment with parameters.)

The cursor-based approach also allows adding new filters and sorting directions in a backward-compatible manner — provided you organize the data in a way that cursor-based traversal will continue working.

// Initialize list traversal
POST /v1/partners/{id}/offers/history⮠
  search
{
  // Add a filter by the recipe
  "filter": {
    "recipe": "americano"
  },
  // Add a new sorting mode
  // by the distance from some
  // location
  "order_by": [{
    "mode": "distance",
    "location": [-86.2, 39.8]
  }]
}
→
{
  "items": […],
  "cursor": 
    "Q29mZmVlIGFuZCBDb250ZW1wbGF0aW9u"
}

A small footnote: sometimes, the absence of the next-page cursor in the response is used as a flag to signal that iterating is over and there are no more elements in the list. However, we would rather recommend not using this practice and always returning a cursor even if it points to an empty page. This approach allows for adding the functionality of dynamically inserting new items at the end of the list.

NB: in some articles, organizing list traversals through monotonous identifiers / creation dates / cursors is not recommended because it is impossible to show a page selection to the end user and allow them to choose the desired result page. However, we should consider the following:

This case, of showing a pager and selecting a page, makes sense for end-user interfaces only. It's unlikely that an API would require access to random data pages.
If we talk about the internal API for an application that provides the UI control element with a pager, the proper approach is to prepare the data for this control element on the server side, including generating links to pages.
The boundary-based approach doesn't mean that using limit/offset parameters is prohibited. It is quite possible to have a double interface that would respond to both GET /items?cursor=… and GET /items?offset=…&limit=… queries.
Finally, if the need to have access to an arbitrary data page in the UI exists, we need to ask ourselves a question: what is the user's problem that we're solving with this UI? Most likely, users are searching for something, such as a specific list item or where they were the last time they worked with the list. Specific UI control elements to help them will be likely more convenient than a pager.

The General Case

Unfortunately, it is not universally possible to organize the data in a way that would not require mutable lists. For example, we cannot paginate the list of ongoing orders consistently because orders change their status and randomly enter and leave this list. In these general scenarios, we need to focus on the use cases for accessing the data.

Sometimes, the task can be reduced to an immutable list if we create a snapshot of the data. In many cases, it is actually more convenient for partners to work with a snapshot that is current for a specific date as it eliminates the necessity of taking ongoing changes into account. This approach works well with accessing “cold” data storage by downloading chunks of data and putting them into “hot” storage upon request.

POST /v1/orders/archive/retrieve
{
  "created_iso_date": {
    "from": "1980-01-01",
    "to": "1990-01-01"
  }
}
→
{
  "task_id": <an identifier of
    a task to retrieve the data>
}

The disadvantage of this approach is also clear: it requires additional (sometimes quite considerable) computational resources to create and store a snapshot (and therefore requires a separate tariff). And we actually haven't solved the problem: though we don't expose the real-time traversal functionality in public APIs, we still need to implement it internally to be able to make a snapshot.

The inverse approach to the problem is to never provide more than one page of data, meaning that partners can only access the “newest” data chunk. This technique is viable in one of three cases:

If the endpoint features a search algorithm that fetches the most relevant data. As we are well aware, nobody needs a second search result page.
If the endpoint is needed to modify data. For example, the partner's service retrieves all “new” orders to transit them into the “accepted” status; then pagination is not needed at all as with each request the partner is removing items from the top of the list.
- The important case for such modifications is marking the received data as “read”.
finally, if the endpoint is needed to access only real-time “raw” data while the processed and classified data are available through other interfaces.

If none of the approaches above works, our only solution is changing the subject area itself. If we can't consistently enumerate list elements, we need to find a facet of the same data that we can enumerate. In our example with the ongoing orders we might make an ordered list of the events of creating new orders:

// Retrieve all the events older
// than the one with the given id
GET /v1/orders/created-history⮠
  older_than=<item_id>&limit=<limit>
→
{
  "orders_created_events": [{
    "id": <event id>,
    "occured_at",
    "order_id"
  }, …]
}

Events themselves and the order of their occurrence are immutable. Therefore, it's possible to organize traversing the list. It is important to note that the order creation event is not the order itself: when a partner reads an event, the order might have already changed its status. However, accessing all new orders is ultimately doable, although not in the most efficient manner.

NB: in the code samples above, we omitted passing metadata for responses, such as the number of items in the list, the has_more_items flag, etc. Although this metadata is not mandatory (i.e., clients will learn the list size when they retrieve it fully), having it makes working with the API more convenient for developers. Therefore we recommend adding it to responses.

This is Chapter 20 of “The API” book being written by Sergey Konstantinov. I also have a book on the history of beer and historical beer styles, a Telegram channel on interesting classical music recordings, a travel photo blog on Unsplash, and a website with ranking fantasy & science fiction novels based on awards they received.

[The API Patterns] Lists and Accessing Them

Chapter 20. Lists and Accessing Them

Immutable Lists

Additive Lists, Immutable Data

The General Case

Discussion about this post