[HTTP APIs & REST] Working with HTTP API Errors. Final Provisions and General Recommendations

Jul 03, 2023

With this post, I’m continuing publishing the v2 of my book dedicated to APIs. If you like this book, please rate it on GitHub, Amazon, or Goodreads and

Chapter 39. Working with HTTP API Errors

The examples of organizing HTTP APIs discussed in the previous chapters were mostly about “happy paths,” i.e., the direct path of working with an API in the absence of obstacles. It's now time to talk about the opposite case: how HTTP APIs should work with errors and how the standard and the REST architectural principles can help us.

Imagine that some actor (a client or a gateway) tries to create a new order:

POST /v1/orders?user_id=<user_id> HTTP/1.1
Authorization: Bearer <token>
If-Match: <revision>

{ /* order parameters */ }

What problems could potentially happen while handling the request? Off the top of the mind, it might be:

The request cannot be parsed (invalid symbols, syntax violation, etc.)
The authorization token is missing
The authorization token is invalid
The token is valid, but the user is not permitted to create new orders
The user is deleted or deactivated
The user identifier is invalid or does not exist
The revision is missing
The revision does not match the actual one
Some required fields are missing in the request body
A value of a field exceeds the allowed boundaries
The limit for the number of requests reached
The server is overloaded and cannot respond
Unknown server error (i.e., the server is broken to the extent that it's impossible to understand why the error happened).

From general considerations, the natural idea is to assign a status code for each mistake. Obviously, the 403 Forbidden code fits well for mistake #4, and the 429 Too Many Requests for #11. However, let's not be rash and ask first for what purpose are we assigning codes to errors?

Generally speaking, there are three kinds of actors in the system: the user, the application (a client), and the server. Each of these actors needs to understand several important things about the error (and the answers could actually differ for each of them):

Who made the mistake: the end user, the developer of the client, the backend developer, or another interim agent such as the network stack programmer?
- And let's not forget about the possibility of the mistake being deliberately made by either an end user or a client developer while trying to blunt-force hijack the account of another user.
Is it possible to fix the error by just repeating the request?
- If yes, then after what period of waiting?
If it is not the case, is it still possible to fix it by reformulating the request?
If the error cannot be resolved, what should be done about it?

One of these questions is easily answered in the HTTP API paradigm: the desired interval of repeating the request might be indicated in a Retry-After header. Also, HTTP helps with question #1: to understand which side is the cause of the error, the first digit in the HTTP status code is used (see below).

With the other questions, the situation is unfortunately much more complicated.

Client Errors

Status codes that start with the digit 4 indicate that it was the user or the client who made a mistake, or at least the server decided so. Usually, repeating a request that resulted in a 4xx error is meaningless: the request will never be fulfilled unless some additional actions are performed. However, there are notable exceptions, most importantly 429 Too Many Requests and 404 Not Found. The latter implies some “uncertainty state” according to the standard: the server could use it if exposing the real cause of the error is undesirable. After receiving a 404, the request might be repeated, possibly yielding a different outcome. To indicate the persistent non-existence of a resource, a separate 410 Gone status is used.

A more interesting question is what the client can (or must) do if such an error is received. As we discussed in the “Isolating Responsibility Areas” chapter, if the error can be resolved, there must be a machine-readable description for the client to interpret. In the case it cannot, human-readable instructions should be provided for the user (even “Try restarting the application” is a better user experience than “Unknown error happened”) and for the client developer.

If we try to apply this principle to HTTP APIs, we will soon learn that the situation is complicated. On one hand, the protocol includes a lot of codes that indicate specific problems with using the protocol, such as 405 Method Not Allowed (indicates that the verb in the request cannot be applied to the requested resource), 406 Not Acceptable (the server cannot return a representation that satisfies the Accept* headers in the request), 411 Length Required, 414 URI Too Long, etc. The client code might process these errors and sometimes even perform some actions to mitigate them (for example, add a Content-Length header in case of a 411 error). However, this is hardly applicable to business logic. If the server returns a 429 Too Many Requests if some limit is exceeded, there are no standardized means of indicating which exact limit was hit.

Sometimes, the absence of a common approach to describing business logic errors is circumvented by using different codes with almost identical semantics (or just randomly chosen codes) to distinguish between different causes of the error. One notable example is the widely adopted usage of the 401 Unauthorized status code to indicate the absence or the invalid value of authorization headers, which is a signal for an application to ask the user to log in. This usage contradicts the standard (which requires that a 401 response must contain the WWW-Authenticate header that describes the methods of authorization; we are unaware of a single API that follows this requirement), but it has become a de facto standard itself.

Even if we choose this approach, there are very few status codes that can reflect different aspects of the same error type. In fact, we face the situation that all the multiplicity of business-bound errors is to be returned using a very limited set of status codes:

400 Bad Request for all the errors related to request validation issues. (Some purists insist that 400 corresponds to format violations such as invalid JSON. For logical errors, the 422 Unprocessable Content code is to be used. This actually changes nothing regarding the discussed problem.)
403 Forbidden for any problems related to authorizing the user's actions.
404 Not Found if any of the entities referred to in the request are non-existent or if exposing the real cause of the error is undesirable.
409 Conflict if data integrity is violated.
410 Gone if the resource was deleted.
429 Too Many Requests if some quotas are exceeded.

The editors of the specification are very well aware of this problem as they state that “the server SHOULD send a representation containing an explanation of the error situation, and whether it is a temporary or permanent condition.” This, however, contradicts the entire idea of a uniform machine-readable interface (and so does the idea of using arbitrary status codes). (Let us additionally emphasize that this lack of standard tools to describe business logic-bound errors is one of the reasons we consider the REST architectural style as described by Fielding in his 2008 article non-viable. The client must possess prior knowledge of error formats and how to work with them. Otherwise, it could restore its state after an error only by restarting the application.)

Additionally, there is a third dimension to this problem in the form of webserver software for monitoring system health that often relies on status codes to plot charts and emit notifications. However, two errors represented with the same status code — let's say, wrong password and expired token — might be very different. The increased rate of the former might indicate brute-forcing of accounts, while an unusually high frequency of the latter could be a result of a client error if a new version of an application wrongly caches authorization tokens.

All these observations naturally lead us to the following conclusion: if we want to use errors for diagnostics and (possibly) helping clients to recover, we need to include machine-readable metadata about the error subtype and, possibly, additional properties to the error body with a detailed description of the error. For example, as we proposed in the “Describing Final Interfaces” chapter:

POST /v1/coffee-machines/search HTTP/1.1

{
  "recipes": ["lngo"],
  "position": {
    "latitude": 110,
    "longitude": 55
  }
}
→ 
HTTP/1.1 400 Bad Request
X-OurCoffeeAPI-Error-Kind:⮠
  wrong_parameter_value

{
  "reason": "wrong_parameter_value",
  "localized_message":
    "Something is wrong.⮠
     Contact the developer of the app."
  "details": {
    "checks_failed": [
      {
        "field": "recipe",
        "error_type": "wrong_value",
        "message":
          "Unknown value: 'lngo'.⮠
           Did you mean 'lungo'?"
      },
      {
        "field": "position.latitude",
        "error_type": 
          "constraint_violation",
        "constraints": {
          "min": -90,
          "max": 90
        },
        "message":
          "'position.latitude' value⮠
            must fall within⮠
            the [-90, 90] interval"
      }
    ]
  }
}

Let us also remind the reader that the client must treat unknown 4xx status codes as a 400 Bad Request error. Therefore, the (meta)data format for the 400 error must be as general as possible.

Server Errors

5xx errors indicate that the client did everything right, and the problem is server-bound. For the client, the only important thing about the server error is whether it makes sense to repeat the request (and if yes, then when). Keeping in mind that in publicly available APIs, the real reason for the error is usually not exposed, having just the 500 Internal Server Error and 503 Service Unavailable codes is enough for most subject areas. (The latter is needed to indicate that the denial of service state is temporary and it might be replaced with just a Retry-After header to the 500 error.)

However, for internal systems, this argumentation is wrong. To build proper monitoring and notification systems, server errors must contain machine-readable error subtypes, just like the client errors. The same approaches are applicable (either using arbitrary status codes and/or passing error kind as a header); however, this data must be stripped off by a gateway that marks the border between external and internal systems and replaced with general instructions for both developers and end users, describing actions that need to be performed upon receiving an error.

POST /v1/orders/?user_id=<user id> HTTP/1.1
If-Match: <revision>

{ parameters }
→
// The response the gateway received
// from the server, the metadata
// of which will be used for
// monitoring and diagnostics
HTTP/1.1 500 Internal Server Error
// Error kind: timeout from the DB
X-OurCoffeAPI-Error-Kind: db_timeout
{ /*
   * Additional data, such as
   * which host returned an error
   */ }

// The response as returned to
// the client. The details regarding
// the server error are removed
// and replaced with instructions
// for the client. As at the gateway
// level it is unknown whether
// order creation succeeded, the client
// is advised to repeat the request 
// and/or retrieve the actual state.
HTTP/1.1 500 Internal Server Error
Retry-After: 5

{ 
  "reason": "internal_server_error",
  "localized_message": "Cannot get⮠
    a response from the server.⮠
    Please try repeating the operation
    or reload the page.",
  "details": {
    "can_be_retried": true,
    "is_operation_failed": "unknown"
  }
}

However, we go on a slippery slope here. The contemporary practice of implementing HTTP API clients allows for repeating safe requests (e.g., GET, HEAD, and OPTIONS methods). In the case of unsafe methods, developers need to write code to repeat the request, and to do so they need to read the documentation very carefully to check if it is the desired behavior and if it is actually safe.

Theoretically, with idempotent PUT and DELETE it should be more convenient. Practically, as many developers let this knowledge pass them, frameworks for working with HTTP APIs will likely not repeat these requests. Still, we can get some benefit from following the standards as the signature itself indicates that the request can be retried.

As for more complex operations, to make developers aware that they can repeat a potentially unsafe operation, we could introduce a format describing the possible actions in the error response itself… However, developers seldom expect to find such instructions in the error body, probably because programmers rarely see 5xx errors during development, unlike their 4xx counterparts, and testing environments usually do not provide capabilities to emulate server errors. All in all, you will have to describe the desirable actions in the documentation. (Be aware that this instruction will likely be ignored. This is the way.)

Organizing HTTP API Error Nomenclature in Practice

As it is obvious from what was discussed above, there are essentially three approaches to working with errors in HTTP APIs:

Applying an “extended interpretation” to the status code nomenclature, or in plain words, selecting or inventing a new status code for each new type of error introduced. (The author of this book has frequently observed an approach to API development that included choosing a status code based on wording resembling the error cause, disregarding its description in the standard completely.)
Abolishing the use of status codes and developing a format for errors enveloped in a 200 HTTP response. Most RPC frameworks choose this direction.
- 2a. A subvariant of this strategy is using just two status codes (400 for every client error, 500 for every server error), optionally complemented by a third one (404 to indicate situations of uncertainty).
Employing a mixed approach, i.e., using status codes in accordance with their semantics to indicate an error family with additional (meta)data being passed in a specially developed format (similar to the code samples we gave above).

Obviously, only approach #3 could be considered compliant with the standard. Let us be honest and say that the benefits of following it (especially compared to option #2a) are not very significant and only comprise better readability of logs and transparency for intermediate proxies.

Chapter 40. Final Provisions and General Recommendations

Let's summarize what was discussed in the previous chapters. To design a fine HTTP API one needs to:

Describe a happy path, i.e. draw a diagram of all HTTP calls that occur during a normal work cycle of an application.
Interpret every call as an operation executed on a resource and assemble a nomenclature of URLs and applicable methods accordingly.
Enumerate errors that might occur during operation execution and determine paths to restore the application state for clients after receiving an error.
Decide which functionality will be communicated at the HTTP protocol level, i.e., which standard protocol capabilities to use in conjunction with what tools and software and the extent of their usage.
Develop a detailed specification regarding the aforementioned list points.
Check yourselves: elaborate on paragraphs 1-3 to write pseudo-code for the application's business logic in accordance with the specification, and evaluate the convenience, understandability and readability of your API.

Additionally, we'd like to provide some code style advice:

Do not differentiate paths with trailing / and without it. Employ a default policy (we would rather recommend ending paths with / for a simple reason: it allows for referring to operations on the domain root resource in a readable manner as VERB /). If you decide to prohibit one of the variants (let's say, all URLs must end with a trailing slash), make a redirect or provide a very readable error message if a developer tries to call a URL formatted otherwise.
Include common headers (such as Date, Content-Type, Content-Encoding, Content-Length, Cache-Control, Retry-After, etc.) in the responses and generally avoid relying on clients to guess default protocol parameters correctly.
Support the OPTIONS method and the CORS protocol just in case your API needs to be accessed from a Web browser.
Choose a casing rule and a rule for transforming casing while moving a parameter from one part of an HTTP request to another.
Always leave an opportunity for backward-compatible extension of an API method. In particular, always return a JSON object as the endpoint response root as objects can always be extended with a new field, unlike arrays and primitives.
- Let us also note that an empty string is invalid JSON, so you need to return an empty object {} in 200 responses even if it doesn't have a specific meaning. Alternatively, you can use the 204 No Content status code with an empty body, which is not extensible.
For every GET response, provide explicit caching parameters (otherwise, there is always a chance that a client or an intermediate agent invents them on their own).
Do not employ known possibilities to serve requests in violation of the standard and avoid exploiting “gray zones” of the protocol. In particular:
- Do not place unsafe operations behind the GET verb, and do not place non-idempotent operations behind the PUT / DELETE methods.
- Maintain the GET / PUT / DELETE operations symmetry.
- Do not allow GET / HEAD / DELETE requests to have a body and do not provide bodies in response to HEAD requests or alongside the 204 status code.
- Do not invent your own standards for passing arrays and nested objects as query parameters. It is better to use an HTTP verb that allows having a body, or as a last resort pass the parameter as a Base64-encoded JSON-stringified value.
- Do not put parameters that require escaping (i.e., non-alphanumeric ones) in a path or a domain of a URL. Use query or body parameters for this purpose.
Familiarize yourself with at least the basics of typical vulnerabilities in HTTP APIs used by attackers, such as:
and include protection against these attack vectors at the webserver software level. The OWASP community provides a good cheatsheet on the best HTTP API security practices.

In conclusion, we would like to make the following statement: building an HTTP API is relying on the common knowledge of HTTP call semantics and drawing benefits from it by leveraging various software built upon this paradigm, from client frameworks to server gateways, and developers reading and understanding API specifications. In this sense, the HTTP ecosystem provides probably the most comprehensive vocabulary, both in terms of profoundness and adoption, compared to other technologies, allowing for describing many different situations that may arise in client-server communication. While the technology is not perfect and has its flaws, for a public API vendor, it is the default choice, and opting for other technologies rather needs to be substantiated as of today.

This is Chapters 39 and 40 of “The API” book being written by Sergey Konstantinov. I also have a book on the history of beer and historical beer styles, a Telegram channel on interesting classical music recordings, a travel photo blog on Unsplash, and a website with ranking fantasy & science fiction novels based on awards they received.