ik

Ivan Kusalic - home page

Designing Resilient Web APIs

Designing Resilient Web APIs: A practical guide to enabling API evolution

Designing APIs is hard. At the same time, it has far-reaching consequences for your product.

Will you be able to evolve the product? Who is already using the current version of the API that you just now learned is broken? Can you fix it without backward incompatible changes? Or do you need to coordinate with all the stakeholders? Do you have a way of tracking them all down and coordinating the change with all of them? How long will this coordination take before you can make new features available? Wait a minute, there are actually cars out there that depend on your current API and will be using it for the next 12 years?! How much does this all cost?!

Yes, API design is hard and has a huge impact on your product. If you are a developer, it influences your daily happiness as well.

In this book, I’ll share some of the lessons I’ve learned over the years about API design. Some of those will be applicable in your situation, some will not, some will be opinionated, some might even be just plain wrong from your perspective. Still, you may need to live with a bad API for 12 years?! And it could have been prevented?

Context

General API design is too big for one to hope to cover in a book such as this one, so some context is in order.

I’ll focus predominantly on Web API design for backend services.

The term REST API is heavily abused nowadays and does not convey the real semantics to most people, so I’ll stick with the term Web API for the rest of the book.

Even though backend services are in the spotlight, a lot of reasoning will be applicable to design of libraries or different user interfaces.

Another point worth mentioning is that I’ll write about APIs intended to be used by numerous users who are not completely under your control. Think enterprise setup more than walking-skeleton exploration phase in a 3-man startup. This doesn’t mean that points raised in the book do not apply to small projects – they do. What it does mean is that you hopefully already know a bit about the domain your API will cover, instead of purely trying to explore your domain via throwaway prototypes.

Finally, the presumed setup implies that the API will be hard to change once it’s published and out in the world – you are trying to guard against this pain. Maybe you are even willing to do a bit of design upfront.

Some of the points I’ll mention are trivial to follow, while others require considerable effort to be put into practice. Whatever your situation may be, I hope you’ll find something of value in the book.

A small note regarding the book itself – it reflects my preference for a top-down perspective. The book generally shifts from more abstract perspective to more concrete topics as it progresses. Even so, I’ve tried to make sure that there are some examples mixed in throughout it all. Note that individual chapters and sections also tend to follow the same pattern.

Without further ado – let start.

Table of Content

Have Guiding Design Principles and Values

You should start the API design by taking the time to decide on design principles your API will follow and values it strives to provide. Those will serve as a foundation of your API and will guide the design.

Consistency

Having an explicit set of design principles will help you ensure consistency across the API as a whole. You’ll be able to refer to the principles while you are working on the initial API design. You’ll also use them later when you’ll need to add new features or modify existing ones.

API design principles and values are particularly useful when you have new people working on the API at some later point in time. They likely won’t have the full background story and so design principles form a basis of a shared context.

Making Tradeoffs

When working on an API, you constantly need to make tradeoffs. Having guiding principles comes handy in those situations as it allows you to compare different potential outcomes against chosen principles. This will often bring clarity to the situation and help you make the right decision.

Understanding API

You should document design principles to provide the overarching context for the whole API.

Make that documentation accessible to all the users of the API and not just your developers. Users familiar with only some API elements will have an easier time adapting to the elements they haven’t seen before because the whole API is consistent.

Having consistent API with documented reasoning behind it will significantly reduce the frustration users will experience when getting to know the API. This will additionally decrease the resources needed to be put into customer support!

Have Values

For all the above reasons, you should have guiding values as well – what are you striving to achieve and will not compromise easily on.

The values are similar to principles, by are vaguer as they don’t say how to achieve something, only that you will strive to achieve it. They say in the end what value your customers are going to get. In a way design principles are values made a bit more actionable.

The consistency of user data is a good example of such a value. Maybe it is critical that your users can depend on consistency, e.g. maybe it’s going to be used to navigate autonomous driving cars, where any inconsistencies could result in costly and even fatal outcomes. Another example would be usability if your system will be used by humans and not being integrated to from other software systems.

You can’t just choose you wish-list of 20 values – they will be at odds with each other and lose any meaning. But think what is it you are fundamentally trying to provide to your users. Always keep the values in mind when working on API and the system behind it. Let them guide your design. Be aware that the values you choose must matter, otherwise you’ll compromise on them, and they will lose any meaning, as is the case with any value system that is not lived by.

This sounds quite abstract. And indeed, it is. Let’s proceed and see how can design principles and values be put into practice.

Design Principles to Use

Which design principles should you use? That varies based on the intended audience of the API, product strategy, environment you are in, values and messages you’re trying to convey, etc.

Although you could use a different set of principles for each API, I seem to be using almost the same set most of the time. Or more generally said: the beauty of the software design lies in the fact that regardless of the scope and granularity of the design, principles stay the same. Mostly, you are just giving them different weights to suit the situation at hand.

What follows in this section are the principles I’m using the most, together with concrete consequences they have on Web API design. In a way, this also defines what I consider to be a good API design. Your mileage may vary, so be sure to interpret design principles I’ll mention in your business and technical context.

Semantically Strong Vocabulary

I place a huge importance on having clear semantics on all the levels of software design, APIs included.

When someone says, they are using the Observer Pattern, I immediately assume that some kind of a subject is maintaining a collection of observers and is actively notifying them of interesting events. If this was not the case, the speaker would not have said that they are using the Observer Pattern. The semantic core of the idea of the Observer Pattern itself is in the notification mechanism and as such it must be applicable.

The value of having dedicated terms for different ideas is in the ability to quickly convey complex concepts. For that reason, the way we refer to things needs to be precise, with clear semantic meaning!

Having semantically strong vocabulary helps the client to understand your API, to use it correctly, and to build knowledge of your domain. You should probably provide a glossary to define all non-standard terms you use.

Avoid Code-Names and Abbreviations

Avoid using internal code-names at all times. In my experience, they start innocently and are even fun to use. It feels like you are having a private joke. But sooner or later you’ll leak those terms to customers. Maybe you’ll have a stray reference in the documentation. Or a package or a class name will use code-names. Or they will leak into business domain, and someone talking to a customer will use those terms. Better to not use them at all!

Code-names almost certainly don’t mean anything to anybody but you – users don’t have the context in which those terms were created. They are just confusing. It’s frustrating when you’re not in on the joke, right?

Once introduced, a term is very hard to replace with a better one. So, avoid this mess and just put a tiny bit of effort to come up with good terms that have clear semantics from the start.

This is all a bit abstract, so let me give you two terms used in the real world: Eggbox and Tofubox. Do those mean anything to you? Of course not. Well, the story behind is that a certain team was writing a service that manages customer data. As it was managing all the data, it was “a basket containing all the eggs”. That lead to the term Eggbox. That service was a prototype and was rewritten a year or two later. The main person driving the rewrite was a vegan. Hence Tofubox. Clever? Yes, but also a bad name to put in front of the customer. Would you like to suddenly be wondering what is this Tofubox without the above context? Then don’t use confusing terms in your domain.

For the same reason, be careful to avoid all non-standard abbreviations as much as possible. And if you must use them, always put them in a glossary!

Example

What do I mean when I ask you to use terms with clear semantic meaning?

Let’s say you are referring to a particular kind of a data collection you call a Catalog in the API. You must have a clear meaning of what the idea of a Catalog entails! A user should be able to go through the definition of what is this Catalog you mention and correctly decide if a certain collection of data they have is a Catalog or not. A side note – I’ll be using the Catalog example throughout the book.

If you are interested in a case study on conveying sound semantics at the lower level, in the code, check out this post.

Expose Minimal Useful Information

Let’s continue talking about semantics: Make sure you provide the user only the data they need and not more!

Let’s say that data in the aforementioned Catalog collection is versioned. Let’s further assume that the user has no need to know that versions are incrementally produced, and instead only needs to be able to know if they have the newest data or not. Even though you are using versions internally, do not expose them to the user – use checksums instead. The semantics of incremental versions are much stronger than that of checksums. Versions imply ordering between individual versions, support for ranges, particular versioning scheme being used, etc.

You can’t assume that users will depend only on the semantical meaning you want them to. They will depend on all the semantics that can possibly be implied. If you have a moderately successful product, someone will depend on wrong semantics you carelessly exposed through the API!

Continuing the Catalog example, let’s say that you’ve gone and exposed the versions everywhere, including in individual elements that are contained in a Catalog collection. Now that the system is successful, you need to prevent data from growing infinitely and so you are also retiring the data – you are retiring versions. Let’s say the collection reuses an element if it has not been modified. So now you have only versions 42 to 100 for a given catalog in the system, but you have an element that hasn’t been modified since version 5. The user now sees version 5 on that element and tries to query the catalog for version 5. And it fails because version 5 has been retired and does not exist anymore! All because you really should have used checksums dedicated to the individual elements of a catalog and not reused the version concept.

One thing to note from the previous example: what likely caused the problem is reuse of code artifacts somewhere in the system when the semantics did not fit. Avoid this at all levels of software design: reuse of implementation artifacts, like fields, methods or classes is only sound when the Concepts/knowledge/semantics can be reused as well!

Or said differently: you can apply Don’t repeat yourself (DRY) principle only when there is a duplication of knowledge.

Don’t Expose System Internals

Take care to abstract away all the internal implementation details from the user.

For example, you should not directly return AWS payloads from the API but should abstract them away. You might think that you’ll never move to an in-house infrastructure or to a different cloud provider, so why should you care? Even if this is the truth, and you really can’t be sure, you could suddenly face a big customer that wants to work with you but refuses to work with Amazon under any circumstances. This is actually not an uncommon business situation.

You should always have separate API endpoints for administrative or operator users. Those endpoints change faster than endpoints used by regular users and potentially you don’t even need to version the operator-only APIs. Note that even other internal teams should not use operator APIs directly – for more details check out Amazon SOA history humorously exposed in the infamous Steve Yegge’s post.

Be especially careful when deciding what HTTP status code to use. I’ve often seen a mistake being made where a system does not return the status codes that make sense from the user perspective, but that are instead a consequence of internal architecture and communication patterns. When deciding on a status code to use, always think what do you want a user to do as a result. Is the action they will perform based on the semantics of the received status code also the action you want them to do? Is a problem occurring temporarily? Should they just retry? Should they change parameters? Do they need to try a different request?

Continuing our running example with catalogs, let’s say that internally you need to connect to a database to retrieve catalog information. What response code should a user get when you have trouble connecting to the database? I’ve seen systems that in analogous situations return 404 (Not Found) as they can’t find the catalog data. From user perspective 404 is clearly wrong status code – logically catalog still exists, it’s just that your system is malfunctioning. One of the 5xx status codes, signaling server side error, would be a better choice.

This kind of broken semantics exposed via status codes, or more generally wrong responses, are often found when serving requests spanning multiple services. In a case of a failure the user often gets a response from one of the services instead of a response appropriate for the system as a whole, from the perspective of the user-system separation.

As an example, let’s assume that your catalog service has master-slave architecture internally. You are currently experiencing problems and have failed over the service via DNS to a slave. The user is trying to write the data, but they are now hitting a slave. Given that slaves are read-only, the request fails. And the slave returns 403 (Forbidden). This is again wrong – while from the perspective of an internal slave component the action is indeed forbidden, it is not an appropriate return code for the user. There is nothing that the user is doing wrong and should do differently, so 4xx (Client Errors) are not applicable. Use 503 (Service Unavailable) with a good error message instead. Regarding the error message, don’t leak system internals to the user by referring to the master-slave setup, say that the write operations are temporarily unavailable instead.

While I advise that you rigorously do not expose system internals to users, take your context into consideration. If you are designing upfront anywhere at all, it should be in the API, but you need to be aware that this often requires a non-trivial amount of effort. If you are exposing the API to external B2B customers, you better be rigorous. On the other hand, if you are prototyping a showcase that can make or break your startup, apply this principle in the situations where the effort is trivial1 and think carefully about others.

Do Not Surprise Clients

Whenever I’m designing a public-facing API, I always keep the Principle of Least Surprise in mind. This principle is even more important when designing backend service APIs than in everyday software development.

Your users can’t just poke behind the API to gain additional understanding of what was it you intended them to do. They can only rely on the intrinsics of the API itself and the documentation you provide with the API.

Note that the documentation only helps to a certain degree. It would be somewhat naive to assume that all your clients will carefully read all the possible corner cases listed in the API documentation, understand it fully and then write their code. They are more likely to just play around until they get something that reasonably looks like the expected output. While you definitely should have good documentation, it will only help you so much if your API isn’t easy to understand by itself.

There is another reason why the Principle of Least Surprise is so important: an API is by definition an explicit integration boundary to a different system. Your users are switching from their regular and familiar context to the unknown context of your system and are more likely to be confused.

When followed properly, this principle can significantly reduce the frustration users will encounter when starting to integrate with your system. This means that you are more likely to have happy users. Maybe even gaining an evangelist for the product instead of being likely to lose a customer to the competition. Needing to spend less effort and resources on customer support is just icing on the cake.

In context of Web API design, as a consequence of this principle, you should have simple and easy-to-understand API endpoints that do only one thing.

A few questions to help you start thinking about the user:

  • Does a user need to get an overview of the system before using the API? If so, how are they going to get that context?
  • How easy is it for a user to use new functionality, given that they have already used a significant portion of the API? Are you ever referring to the same concept in an inconsistent manner?

Consistency will help you a lot when tackling API usability issues. Make sure that you are referring to the same concept always in the same way across all of the API. This should apply to naming endpoints, naming query parameters, using the same HTTP methods consistently, reusing any additional abstractions you may have introduced, etc.

If possible, there should be only one way to do an action in your API. For any given action, it should be obvious to the user which endpoint to use. They should never need to relearn how to do something once they know how to do it in the context of your API.

Providing multiple ways to do the same thing can lead to a paralysis of choice as well.

Note that most users just want to solve their tasks at hand and won’t take the time to admire your API, no matter how beautiful it is. They are unlikely to carefully evaluate different options and use the best possible approach. They will instead do the first thing they can make work. Once it works they won’t care anymore. The option they choose will often be suboptimal; sometimes you’ll even be surprised by their “creativity”. It is generally better to prevent this situation from happening by having only one way to do things in the API.

To summarize, in API design you should ensure that correct actions are easy to do while wrong actions are unlikely to be done accidentally.

Single Responsibility Principle

The Single Responsibility Principle always applies to software design and API design is not an exception. Unfortunately, it’s also the principle that is the hardest to apply because it’s so overarching and abstract.

One consequence of this principle is that you should think carefully about enriching existing endpoints with additional functionality, e.g. by adding new query parameters. Make sure that the endpoint still retains its original semantics. An example of the question to ask in this situation: Are the semantics of the result changing based on the parameters used in different requests?

As an example, filtering out certain results returned by an endpoint based on the query parameters used is probably ok while changing the type or schema of the return object is not.

In general, you should think about responsibilities of individual API endpoints and how they fit together. On a higher level, think how all the different APIs you might have fit together as well.

Design API for Well-Behaved Clients

You can’t make every wrong action impossible to do, merely near-impossible to do accidentally. You can’t prevent users from shooting themselves in the foot if they chose to do so. So, don’t design the API around wrong user behavior. Focus instead on the majority of users that are trying to do the right thing and make it easier for them to do so.

This in a no way mean that you should not care about security – you really should! It’s your responsibility to prevent any single user from interfering with other users and the system itself. That is just given. But don’t go setting up unnecessary hurdles that well-behaved users need to overcome when doing regular actions, only to make wrong behavior harder to do. This will make the API less usable while a fraction of users will still do wrong things.

Be particularly mindful of doing security by obscurity. Aside from being a bad security practice, it often has a negative impact on the usability of the API.

Design your APIs to guide well-intended clients to correct usage-patterns and behaviors.

Enable API Evolution

If a project is actively being worked on, i.e. new features are being added over time, the API will need changes as well. There is no going around it – APIs evolve together with the services they belong to.

The question to ask yourself is not how to prevent APIs from changing, as you cannot, but how to constantly be evolving the API without breaking any of the existing clients.

Let’s separate users into two broad categories:

  1. Well-behaving clients – those that follow the API specification, respect HTTP protocol intrinsics, and implement the best practices from the documentation.
  2. Misbehaving clients – those that somehow depend on your current API but do not follow the specification, documentation or best practices.

In a perfect world, all your users are in the well-behaving category. Unfortunately, that does not reflect reality, where if your product is successful at all, you’ll have plenty of users in the 2nd category.

You almost certainly don’t have the luxury to just ignore misbehaving users – they can be important customers, can cost you negative publicity, can be taught to behave better toward your API, are probably well-intended and nice people overall, etc.

So, you are trying to evolve the API in such a way as to never break any client, including those in the misbehaving category.

What follows in this chapter are some techniques to enable you to have a constantly evolving product, with changing API, and at the same time not to impact users negatively.

Evolve API Scope

We have already established that your API will be changing over time as the product evolves to meet the customer needs. The risk of needing a backward incompatible change is bigger the less you know about your technical and business domains. I recommend you explore those domains as much as it’s feasible as early as you can. I won’t go over the ways to do it, except to say that you should tackle the areas, you know the least about and that pose the most serious risk, first.

Try to always have a clear understanding of what are the concepts in your domains, which forces are influencing them, and how you are modeling it all. It’s better to have this knowledge somewhere in addition to the thoughts in your head. Don’t forget to always update models as you learn more.

To make this more concrete, let’s continue with our catalog example. Let’s say that catalogs are versioned and that there is an API endpoint that returns you the changes in data that happened between two given versions of a catalog. When this endpoint was first introduced, a developer thought that users would most often ask for the changes between a version they have already used and the newest version. To make it easier on users, the endpoint required a user to always pass a startVersion parameter, but the endVersion parameter was made optional, defaulting to the newest version available. This sounds like a reasonable design. But it turned out later that it’s not. The consistency of catalog data is of uttermost importance and users using catalogs rely on this fact. What can happen? A user can request subsets of data from this endpoint in two consecutive calls, but a version can be published between them. So, one call would get data up to a version 42, while the next call will return data up to a version 43. Mixing data from those two calls would produce corrupt results. In the end, it turns out that endVersion parameter can’t be optional and inferred on the server side. The consequences of this design error are so serious, that there is no way around making backward-incompatible changes to the API.

A concrete lesson from the example above is to make mandatory all the information that is critical for decision-making, instead of trying to infer it.

This example also highlights how delicate API design really is. Be sure to explore and understand your domain as much as possible before trying to design a stable API.

Be conservative about exposing too powerful and too flexible endpoints. Or said differently: minimize the surface you are exposing and make sure it’s semantically sound.

I recommend you start conservatively and then later relax the constraints as you learn more about the domain and discover what users actually need. Relaxing existing constraints is feasible while imposing new ones when they are not in place is not!

On Introducing Abstractions

If you are working with a response format that is not-too-expressive, consider introducing additional abstractions.

As an example, JSON does not have a typed map as a built-in data type – map that always has the same type for all the values. Introducing this concept in a JSON API can make sense to give your clients a chance to benefit from new features that will be added in the future. Let’s say that you are returning descriptions of the items for sale. There are going to be a few standard fields you will have for each item. But there could be also a varying number of tags you may want to return together with an item. An individual tag may have the following information: name, value, detailed description and a link. You want your users to display those tags in a generic way. By doing so they will automatically benefit when a new tag is added to the system. Therefore, you are providing back a typed map of tags, where you are guaranteeing that all the elements in the map will be of the same type, in this case, the tag type.

While adding an abstraction like the typed map might be very valuable, it comes with a set of tradeoffs. Having an abstraction that is not baked in the standard format you are using introduces non-trivial complexity that every user needs to react upon. It will take time and effort to educate each and every user regarding your new construct. You must document it prominently and in great detail. Make sure those abstractions are semantically strong and sound. It makes more sense to introduce such an abstraction if you will be reusing it heavily in multiple endpoints. And whatever you do, keep the number of special abstraction low. Otherwise, the exploding complexity would be too costly to warrant any potential gains in expressiveness and semantic strength. This is often the case even for the first potential abstraction you are considering.

Regardless if you are introducing new abstractions or not, be sure to use the format’s existing facilities properly. As an example, do not return comma-delimited strings when you can use an array. Also, do not return a numerical value as a string when you can use a number directly.

Enforce Validations

You should enforce all applicable validations from the beginning. Do not just put them in the API spec or documentation with the intent of implementing the enforcement later. By not enforcing some validations right away, you are setting yourself for a much more painful situation in the future when you’ll try to enable the enforcement. You will break some clients, and they won’t be happy with you, regardless of what your documentation says.

Add requirements liberally and be strict when validating user input. The same way it’s easy to make a mandatory field optional but the opposite is not true, removing requirements or relaxing them is easy, but introducing new requirements or making existing ones stricter is hard. So it’s better to err on the side of having more validations. Be sure to validate all the data that business logic depends on for making decisions, i.e. all the data you are manipulating and not just serving to the user.

Let me explicitly mention validating a few specific kinds of requirements:

  • Validate any and all user inputs – check them both for the type and valid values. E.g. do not allow a user to pass you a string if you are expecting an integer, regardless if the string does contain an integer or not. If you let it go, eventually you are bound to get an invalid string as well.
  • Have throttling enforced from the start. Your resources are not infinite and you should make users understand this. If you do this late in the game, you’ll be breaking badly-behaved clients, in addition to also have spent considerable resources in vain. Doing throttling too late can potentially even break your business model if resource consumption is too high for the benefits provided.
  • If you are dealing with a lot of user data, always limit the amount of data an individual user can have in the system. E.g. if you are having user backups, limit the maximal space this data can take; or if you are storing a continuously growing amount of data over time, have retirement policies and enforce them. An example of a simple retirement policy is removal of data that is older than 2 weeks.

Note that you don’t need to implement all the implications of the validations right away – only the consequences observable by users. If you are removing too-old data, you don’t need to have the full cleanup system in place, but only make sure users can’t access data that should have been cleaned. It is, of course, better to have the full setup in place from the start, but you likely need to consider available resources and competing priorities.

Have Chaos Monkeys

Until now we have focused on the general desirability of enforcing validations. Chaos Monkey is a particular tool I find invaluable in preventing users from accidentally depending on the API in a way that prevents future API evolution.

The original term comes from Netflix’s approach to testing infrastructure where an evil chaos monkey makes random parts of the infrastructure fail. The purpose is to learn how to deal with infrastructure failures on a regular basis so that they become business as usual by the time a real failure occurs. As it inevitably will.

I’m using the term chaos monkey only loosely here to mean a piece of software that non-deterministically causes unusual actions to ensure they would be dealt with appropriately if they were to happen in the future. Or said differently: it’s a system that trains a target group how to deal with problematic situations. In this section, I’ll focus on chaos monkeys that train users how to use the API without preventing its future evolution.

Chaos Monkey for Responses

If your API is responding with JSON, having a chaos monkey for responses is a must! The idea is simple – modify some of the responses in a way that will support future API evolution, in this case by allowing changes to response format.

Though JSON APIs are the most common, note that this applies not just to JSON responses, but generally in any case of an API returning a structured format that will be parsed on the user side. Examples of other formats are XML and YAML.

The goal is to teach users that:

  1. New fields can be added at any point in time.
  2. All optional fields are, in fact, optional.

A common outcome of adding a new feature to a JSON API is the introduction of new fields to an existing response. Have a chaos monkey that generates new fields in the response for a fraction of requests. Generated fields should have unpredictable names and value structure. You want to generate new number, string, array and object fields. Or, if you’re not using JSON, fields of any other data type supported in the format. Generated fields should appear on top-level as well as nested deeper in the response. Generated object substructures should vary in depth and complexity.

If you have introduced additional abstractions that are not native to the format you are using, be careful to not break them with too liberal field generation. E.g. if you have introduced aforementioned typed map abstraction to a JSON response for certain fields, be sure not to generate new fields directly in the map itself as it would break the abstraction. Or if you are returning a percentage to represent a fractional value, make sure the generated value is indeed valid.

The second type of actions the responses chaos monkey should do is making certain that users are aware of some fields being optional. Maybe your current dataset has all the values for all the optional fields and as such currently users never see them missing. If you do not make them aware that a field can indeed be missing, they could start depending on the existence of that field. Have a chaos monkey randomly remove some optional fields for a fraction of responses.

How do you go about enabling chaos monkey? You want to make sure that all your users are exposed to consequences of chaos monkey interactions. What this means concretely depends on the setup you have. For example, if all new users always use an integration environment before they are granted access to the production system, have the chaos monkey enabled there. Have it interact with maybe 10% of the responses. You want to have this percentage configurable and the same is true for likelihood of individual interactions inside of a single response. You should also enable it in the production system, though probably with more conservative probabilities.

Enabling chaos monkey in the production can feel a bit scary, as it can impact customers in their business. But be aware that it’s much better to have that pain on a smaller scale and with controlled timing, rather than postponing it until you are forced to do a change that will have more disruptive impact at the point in time you can’t control.

Chaos Monkey for Redirects

The purpose of the second chaos monkey I want to mention is to ensure that users can react properly to different HTTP protocol events.

The goal here is to make sure that users can deal with redirects. Users respecting redirects is critically important to allow you changing system’s architecture and enable optimizations in the future. As an example, maybe you are serving some resources directly now, but in the future, you’ll serve them from a CDN. Or maybe you’ll change your system so that different endpoints will hit different services. For that to be possible, users need to respect redirects.

How would this chaos monkey work? Have your system redirect a fraction of requests to itself via system’s DNS name.

What kind of requests should you redirect and how? Depends on the endpoints you have and methods that are used in accessing them. Generally:

  • Redirect GET / HEAD requests via 307 (Temporary Redirect)
  • Redirect some POST / PUT / DELETE requests via 307 (Temporary Redirect). Don’t forget to think about idempotency of the requests.
  • If you have the same functionality exposed as a GET endpoint, redirect some POST / PUT / DELETE requests to it via 303 (See Other)

You should expand this chaos monkey to include any other HTTP event your users need to be able to react on. In addition to 3xx (Redirection), the 4xx (Client Errors) group of responses is of particular interest as those are the responses where user is doing something wrong and needs to change their behavior in reaction to the response received.

To summarize this section – make sure your users can deal with any potential HTTP protocol events that could happen in your system. Whenever you are introducing a new feature, think what unexpected events could happen as a consequence, and augment the chaos monkey to support them. For example, if you are now introducing authorization to your system, inspect all possible 4xx HTTP responses that could be applicable. As usually with validations, it’s better to err on the side of caution and have more validations rather than less.

Chaos Monkey for Failures

The final flavor of chaos monkey I want to present is the chaos monkey for failures.

Again, the goal here is simple – teach users that communications over networks are, by definition unreliable, that sometimes a request will fail, and that they should be retrying requests. This will make your client’s implementations more resilient and you’ll save yourself a lot of headache in the future. It’s not fun needing to explain to a client that indeed networks can fail, requests get lost or delayed and that it’s outside of your control. You don’t want to have conversations about where the failure is happening and who’s responsible – is the failure in their system, in the network between you two, or on your side. If they have particularly brittle implementations, they could experience a serious failure or even outage just from a request or two failing. And guess who’s going to be blamed? Another thing to point out is that debugging reported network failures is particularly cumbersome. In some cases, you won’t have any trace of the failure happening, no service or access logs, nothing to go by. Having this situation arise less often is always a win.

A final remark on this chaos monkey – there could be cases where you want to make sure it’s not activated, e.g. on your system’s health check requests. You could exclude a certain category of requests from chaos monkey influence by having a dedicated get parameter, by recognizing a request as authorized to skip chaos monkey interactions, etc.

Use DNS for Flexibility

Another tool to consider using is DNS. Domain Name System has surprisingly versatile applications that are generally underutilized. Those include: easy and cheap load balancing mechanism, architecture flexibility (e.g. providing cluster-level names in multi-cluster, highly-available systems), and our focus here, flexibility that enables API evolution.

Let’s start with the most basic premise – never, in any form, expose individual IPs or internal DNS names to users. In doing so, you are tying yourself to current implementation details or accidental consequences of the current setup. Let’s say there is a mobile application using your API. If it ever references your IP or internal DNS (e.g. AWS ELB name) directly, you can’t change them without breaking that client application. So, always use dedicated external DNS name fully under your control to refer to the system.

But DNS can do much more in terms of providing you with flexibility for the future – you can have a dedicated name for any client or individual resource.

Let’s revisit our catalogs example. Let’s say there are thousands of catalogs in the system. Instead of allowing direct access to the system and providing catalog name or id to desired endpoints, you should allow only access via DNS that has subdomain dedicated to an individual catalog. If your system name is catalogs.com and there is a catalog named foobar, allow the access to that catalog only via foobar.catalogs.com subdomain. This way a user does not need to specify all the time the catalog name, but more importantly you can point their DNS to wherever target you may need, at any point in the future.

Having dedicated DNS names per resource, or per user, or even per combination of a resource and a user, enables you to move resources and users between different infrastructure resources, different user environments, and even different system versions. The cost of this incredible power and flexibility boils down to only a bit of additional complexity. There is no good excuse to not have implemented this before your first real user, even if you have to configure the DNS names manually at that point.

There is one caveat to this approach though – you must never share intermediate DNS names, like catalogs.com, with any user!

It is critically important to expose to your customers only dedicated DNS names that provide indirection capability. This includes documentation, examples, and all client communication. As soon as you expose or leak intermediate DNS names, the value of this approach decreases – you now have less confidence that nobody is using your, supposedly private, names directly. Always use dedicated names internally, so you can be sure you are not leaking private DNS names. In this case, you really should eat your own dog food!

If you plan on using resource names in DNS from the start, keep in mind that only some characters are valid in DNS names. Either have resource names that allow only those characters or think of an appropriate mapping mechanism between resource and subdomain names.

I’ll point out once more a particular consequence of you having correctly adopted this mechanism – you may move a customer to a cluster with different system version. This means that if you have a customer that is too tied to an older API version, you may separate them out from all the other users that will use the new API. This is the last resort approach that shouldn’t be done lightly, but it allows you to keep evolving the product for all the other users that can adopt while fulfilling your contractual obligations to a problematic customer, potentially spanning years.

To conclude, it always makes sense to refer to your services with an additional level of DNS indirection. Make sure you can utilize this power when you need it!

On Documentation

Let’s talk a bit about the documentation accompanying your API.

I’ve already said that you should expose only minimal useful information to users. This applies to documentation as well. Do not expose internal system details or anything that is of no interest to clients.

You might think “What is the problem with more documentation? More is better, right?”. And you’d be wrong. While it’s critically important that you have great documentation, more often does not mean better.

It is hard to keep the documentation up-to-date and make sure it reflects the truth about the system. The only thing that’s more terrible than having not documented something important, is to have documented it but it’s not true anymore. Inaccurate, or worse, wrong documentation can really frustrate and infuriate users. Keeping documentation up-to-date is not a trivial task at all and you should put efforts into tools and processes to enable easier working with documentation.

Documenting everything possible without clear structure is also bad from the user perspective – it makes it hard to discern what is important and what is less so. Like I said before, most users won’t read all your documentation. So, you better make sure the parts they are more likely to read are delivering the most important information.

Let’s pretend you could document everything possible and have it nicely structured at the same time. Would that make sense? Well, the answer is not clear, there would still be a downside – it would feel overwhelming. Imagine you are buying a fancy pen. And you realize it comes with 500 pages manual. What would you think? It’s worse with APIs, as users have less confidence when judging the complexity of associated domains and underlying systems. Not to mention the wastefulness of it all.

So, what should you do? Have documentation clearly structured. Also, apply the DRY principle. As in software design, it’s hard to keep the information up-to-date if you are writing the same thing over and over again. Repetition also clutters the documentation for the users who are already aware of a concept. The solution to this problem is the same as in software design – use references to existing explanations of concepts, instead of repeating full explanation every single time. Personally, I find it interesting that design principles are again relevant and that you can think about designing documentation.

Speaking of references, never use forward references. They are confusing, irritating and distract the user from the current topic at hand. It’s also just bad manners to use them.

What should the documentation contain? A clear API specification, definitions of domain terms, getting started guide, and examples demonstrating common use-cases, at minimum. Try to really think about the documentation from the design perspective instead of just randomly adding new descriptions to anywhere they seem to fit, but where nobody can find them later.

As mentioned before, be careful to not use code-names or any unnecessary non-standard abbreviations. Be careful what you put into documentation. A trivial example – make sure that whenever you are mentioning a way to contact the support, it’s done in a consistent way. Do not use internal, or worse individual emails. Also, make sure that all examples are clear and self-contained. E.g. don’t allow the use of developer’s internal usernames as part of any resource names in the examples.

You should also think about Separation of Concerns and how it applies to the documentation. One area is separating the documentation for different kinds of uses and audiences.

Let’s return to our catalog example. As mentioned before, catalogs are collections of data. Let’s assume that a fraction of users is publishing new data, while majority of users are only reading published data. In this case, there are two different target audiences for your system – publishers and consumers. Make sure you separate documentation for those two. Consumers do not care about the details of data publishing, they only want to know when and how to get the data they need. So, don’t expose the publishing APIs to consumers. It would just unnecessary confuse them and detract from their main objective. Do make this information available for publishers as they need it. To make it more interesting, notice that publishers can be consumers as well, e.g. they can be enriching data published by someone else.

Lastly, I want to point out a particular benefit of writing documentation from the start: it will help you to have a better API design as well. By needing to document the API in a way that’s accessible to users, you’ll notice concepts that are surprising and harder to explain, before you put them out in the wild. Needing to explain or teach something also makes us understand the topic better, so you are likely to come up with more elegant designs while you’re working on the documentation.

Have Playground

One thing you should consider is having a playground for your API. What do I mean by that?

A playground is an environment where users can experiment with different API calls. They can explore and visualize payloads, see and modify examples that use the API, and can get the feel for the API itself. So, a playground is in a way a web client for the API, that users can play with for demonstration and exploration purposes. At the same time, it is also a form of live API documentation.

Aside from being a great education tool for users, having a playground has the benefit of allowing you to eat your own dog food. As you evolve the API you will be evolving the playground as well. That way you are using the API and noticing any usability problems that may exist. This is quite a big advantage – many teams just expose APIs without using them at all, and thus at best produce subpar API designs.

How do you manage a playground? Have it as a part of the system and deploy it together with the API. That way playground will be able to directly use the underlying system and you’ll keep it in sync with the API.

By now you must be wondering how would such an API playground look like? You can find a great example here.

Collect Minimal User Information

You should make sure to know a bit about your users. At the very minimum, you must be able to identify different users and know how to contact them. Hopefully you’ll never need to do so, but better to have this information if you ever need to coordinate a change across your customers.

In addition to contact information, you should know:

  • Is a user internal or external customer of yours? To which groups of users do they belong? In the catalog example, are they a publisher or a consumer?
  • What are they using you for? What is their product?
  • Don’t forget any other mandatory information specific to your system.

You may have multiple environments, e.g. nightly preview environment, customer integration environment, different production environments, etc. In that case, make sure you can correlate the same user across all those environments.

It goes without saying that you should be able to correlate any invocation of the API with the user that made it.

Make sure your customer onboarding process collects needed information. If you are allowing resource creation via the API itself, require this information in relevant API endpoints.

Have Life Cycle Management

Consider having Life Cycle Management setup early on. You could start with a simple and lightweight process that will progressively be adopted as the context changes. If the product succeeds, you will need to have Life Cycle Management anyway, at latest when you have contracts that include SLAs.

Life Cycle Management is too big a topic to adequately tackle at the same time as the API design, so I’ll just point out two topics that are relevant from the start in the context of the API design.

On Versioning

There are three different aspects when versioning a service:

  • System version is the implementation version of artifacts comprising the service when they are deployed. Service users generally do not care about this at all. It matters to you for various technical reasons, hopefully grouped mostly under the Continuous Delivery umbrella.
  • API version is exclusively concerned with backward incompatible changes. You increase the version when you are breaking backward compatibility. For versioning schema, it is enough to have only a single (major) version number.
  • Business version is dealing with the value provided to customer, packaging of features, pricing, contract details, etc.

Those three versions refer to different concepts and you must not combine them – they evolve at different pace and for different reasons. Combining them would indeed be perilous, as it would only limit flexibility without any significant gains.

Often those different versioning aspects do not even have one-on-one relationships. It is common for the same system (having it’s one implementation version) to run multiple or even all your API versions. Anything else is likely to be wasteful of infrastructure resources and harder to manage.

Note that versioning of services is thus quite different from versioning of shippable software products, like libraries or applications. When you are versioning shippable software you generally care only about getting it to the client. Once they have the software, it is their responsibility to do something with it. You are not running that software for them as is the case with services. With shippable software, you can mostly just adopt classic semantical versioning. As for the business version, you could express the difference even in the name, e.g. by using “community edition” label.

A word about adding new features. Features and their packaging are in the realm of business versioning, not API versioning. Nonetheless, it is useful to tackle this topic, particularly as it can have consequences on the technical context. While new features do not change the API version, it is useful to think about the feature maturity. You may want to have new features go through various (business) maturity stages, e.g. feature being in beta, being released to a certain subset of customers, or being generally available.

Define Compatibility Concepts

As we saw in the previous section, API versioning is concerned with the compatibility – and indeed compatibility is a crucial topic we have been exploring so far.

Therefore, it should come as no surprise that in the context of Life Cycle Management you should care about having strong definitions on what does compatibility mean – what does constitute a breaking change. In fact, this is the most important aspect of any Life Cycle Management effort.

You must define both aspects of compatibility:

  • Backwards compatibility is concerned with service provider’s obligations, i.e. your obligations, guaranteeing that an existing user’s system will not be broken by a change in your system.
  • Forwards compatibility is concerned with licensee’s (user’s) responsibilities. The goal here is to allow an existing user’s system to gracefully integrate while your system is evolving.

You should also take care that your Life Cycle Management explicitly allows for the following two categories of changes:

  • Changes caused by the legal environment
  • Critical security updates

Don’t forget to think about quality and coverage of the data commitments in case this is relevant in your context.

Be Aware of Distributed Computing Challenges

Designing and implementing distributed systems is a hard problem. And this is quite of an understatement. As I can’t hope to cover the topic in any depth, this section is intended to raise awareness of the complexity of the challenge.

Let me open the topic by quoting some famous writings on the topic.

First, we have Peter Deutsch’s Eight Fallacies of Distributed Computing:

Essentially everyone, when they first build a distributed application, makes the following eight assumptions. All prove to be false in the long run and all cause big trouble and painful learning experiences.

  1. The network is reliable
  2. Latency is zero
  3. Bandwidth is infinite
  4. The network is secure
  5. Topology doesn’t change
  6. There is one administrator
  7. Transport cost is zero
  8. The network is homogeneous

Then there is the CAP theorem, that basically states that it is impossible for a distributed system to simultaneously provide all three of the following guarantees: Consistency, Availability and Partition Tolerance. Even though you can at most choose two, and even that is not easy at all.

You should also read The Twelve Networking Truths. April fool’s joke in a form of a tiny RFC? Definitely worth reading.

Reasons for Complexity of Distributed Systems

The trouble with distributed system and architecture design is that there are a lot of hidden complexities. If you haven’t worked on such a system before, the intuition will often work against you. You’re likely to assume that things are much simpler than they really are.

Let’s scratch the surface of a particular topic in this domain – read-after-write consistency. The topic appears almost trivial – you write some data, operation succeeds, and it is available in the next instant. Except that this guarantee is extremely hard to provide in sizable system that is under a high load. Can a network partition occur? What if a data center goes down? Does your chosen storage system even provide that guarantee? Maybe you are using AWS as your cloud provider? Surely S3 provides such a guarantee? Well no, in fact it does not. What S3 does guarantee is only the read-after-create consistency. If you try to directly access an object after it’s created it should be consistent and visible. On the other hand, if you are trying to list that object after creation, it might not be visible. Or if you are updating an existing object there is no such guarantee. It gets worse, what S3 actually provides is not the read-after-create consistency, but read-after-create-if-you-haven’t-tried-reading-before. So, if you’ve tried to access an object in S3, but it didn’t exist at that point, and then you create it, it might still not be visible. Why? It turns out that S3 internally uses a negative cache, caching 404s, and so even once an object is created, you might hit that cache which states that the object in question does not exist.

As I’ve already mentioned AWS, and I do so only because it’s the cloud infrastructure provider I’m most familiar with, let me point out one more thing: basically none of the storage systems in AWS provide strong guarantees across multiple regions that span the globe. Even if you are using Amazon-hosted services, once you need to be in multiple regions, you’ll need to build on top of primitives that work only in one region. And if your product is successful, you’ll most likely need to do so. Do clients want 99.9% availability of your system? Too bad, you can’t even get 99.0% in a single AWS region.

Let’s explore a particularly common misconception: if you have a failure, you’ll fail-over the system to another data center or region and be done with it. Actually, you can’t do that – there is simply no strategy that will allow you to do a full failover in 100% of situations. Why? You might not be able to connect to a server and kill it. You might think that’s ok, it means the server is not available anyway. Well, what if it’s reachable from a network of a customer, but not from yours? Or what if it’s not reachable at all for a while, but then once it becomes reachable it is already serving a customer with outdated data? There are infinite possible situations that break your assumption of having only total failures. Partial failures are the norm. The vast majority of cloud infrastructure failures are indeed partial, a useful fact to keep in mind. But it won’t happen to you? Yes, it most likely will. The impact will depend on the architecture you have and how important are high availability, durability, consistency, etc. in your business domain.

Designing systems on top of any form of distributed infrastructure is hard. You might have a lot of assumptions that in the reality do not hold 100% of the time. Inevitably, you will hit Joel’s Law of Leaky Abstractions, just like in the S3 example above. If you haven’t read this classic essay, you really should check it out.

For additional insights on the topic of distributed systems, take a look at Jepsen tests. Jepsen tests put distributed systems under series of complex interactions, emulating situations that can and do happen in the real world. A good starting point is The network is reliable post which lists a dozen of outages that happened in the real world and had serious consequences. Next, read a Jepsen report about the storage system you are using. This will likely be frightening but enlighting experience.

What does all of this mean in terms of API design? Be aware of the complexity that comes with distributed systems. Don’t commit to things that are too hard or even impossible to do. Before you commit to any kind of SLAs, review your system and the API to see if it’s feasible to do so. An unfortunate consequence of that review might be a need for a new API version, and that’s ok, much better than making commitments you have no way of upholding. Hopefully you won’t need to go there if you’re following guidelines presented in this book, like requiring redirect support or having all data that’s necessary to make a decision in the request itself.

Miscellaneous Tips and Tricks

What follows in this chapter are miscellaneous useful techniques, tips, or tricks, that are related to the API design and maintenance.

Support Secure Connections

I won’t go on a rant why you should enable secure communication. If your API is using HTTP as a communication protocol, you should support HTTPS. In fact, you can’t avoid it – customers will request that you support HTTPS. But I’m proposing to go further and have secure connections as the only way to access the system.

A few years ago, the overhead associated with HTTPS had a bit of a stigma in the programming community, but even that is gone nowadays. No matter who your customers are, if they can use HTTP, they can also use HTTPS.

Using HTTPS as the only way to access the system from the start will not in any way slow you down, and instead it will ensure you don’t have problems trying to enable it in the future. In fact, as you are not supporting HTTP, but only HTTPS, you will remove a tiny bit of complexity from the system as well.

This is the one easy step to make your system more secure from the start. Having only HTTPS will communicate to your clients that you do care about security. But security is not there for the show, you really need to take care that your system is secure. Otherwise you could acquire negative reputation, lose clients, face legal issues or even fail the business. And the first major step to reduce this risk is so easy to do!

Polling for Status

The majority of interactions your clients have with the API are likely to be blocking: client issues a call and blocks on that call until it returns or a timeout is reached. It’s easier for a client to correctly implement logic around blocking API interactions than it would be if they needed to deal with non-blocking operations.

If blocking approach is preferred in most situations, when would you want to use non-blocking behavior? There are two major questions to consider:

  1. How long does it take you to produce a result?
  2. Are connections to the API likely to be flaky?

If you need significant time to produce a result, say 5 minutes, you almost certainly do not want for the call to be blocking. A client should not be forced to waste the time waiting for you to finish. The more time a call takes, the more likely that it will fail on the network level. Likelihood of failure increases the more unreliable the network between you and the client.

Be aware that even if the network is relatively reliable, you’ll still have issues if the time to calculate a response is too long – you are likely to hit the timeouts on any level of the infrastructure, e.g. in client’s system, at a proxy, at a load balancer, or in your server-side implementation. It is sometimes possible to mitigate those issues by tweaking keep-alive, idle connection and other settings, but not always. There is often not much you can do if a component between a user and your system is not configured properly to support long-lasting connections. Note that the tradeoff involved is not trivial as well – if you configure the system to support long-lasting connections, detection of genuine errors will be delayed, often for all the interactions, not just this long-lasting one.

How do you go about having a non-blocking operation in the API? By implementing a polling mechanism. Instead of one blocking endpoint, you will have at least two endpoints, each still individually blocking, that will together provide a non-blocking operation. One endpoint will be used to start the operation. The other, polling, endpoint will be used to check if the operation has finished. You are connecting those two endpoints to a logical single operation by using some kind of a token – this token is returned to the user when the operation is first initiated, and is used when polling for the operation status.

There are quite a few nuances to consider when implementing a non-blocking, polling operation:

  • If a user does not care about the resulting payload, but only that the operation has succeeded, the aforementioned two endpoints are enough. This particularly makes sense if the user has logically issued a command, as in Command–query separation principle.
  • If a user needs to get some kind of payload as a result of the operation, it can be provided as a result of the polling call or there could be a third endpoint to fetch the result.
  • If the operation’s result is likely to be needed again you should consider implementing a caching mechanism.
  • What does constitute a token? Is it just an UUID? Is it a small JSON object with a few fields? Should user know those fields and values, or should you encode them? Where do you store tokens internally in your system?
  • Things become a bit more difficult if the system is distributed, as you now most likely need to account for the possibility of starting call and polling calls hitting different servers or even clusters.
  • How do you educate clients? The logic on their side is also going to be more complex than in the case of blocking operations.

As you can see, implementing non-blocking operations is more complex than regular blocking calls. Still, it’s sometimes the right thing to do.

You should be aware that there are sometimes other alternatives, like WebSocket or Chunked transfer encoding, though they also come with unique tradeoffs.

On a plus side, once you have implemented first non-blocking operation, having another one is easier. You will be able to reuse some of the implementation mechanisms and have easier time explaining how it all works, while a client will be more familiar with the mechanics of non-blocking operations and potentially also have implementation reuse on their side.

Header Tricks

Let me mention a few things about headers.

You should have a response header with some form of system version information. You are ensuring that you can ask a client to provide you that information back once you are having issues with troubleshooting problems they are having. With this information, you can correlate their problems to a particular system implementation state on your side.

In simplest case, the system version information can be your internal system version. But you may not want to expose it directly. You have plenty of other options, including git commit hash of the deployed system or continuous delivery pipeline execution identifier.

Another simple use of response headers is a deprecation header – if a user is using deprecated functionality, you can let them know that via a header that would also contain a link to more information and actions needed. This is in a way similar to libraries logging warnings when deprecated functionality is used – there is no guarantee that every user will act see it, but it’s still helpful as even if only a few users react on this information, it will already be worth the trivial effort needed to implement it.

Final header trick I’ll mention that you may consider using is allowing to express query headers via query parameters. This is particularly useful for testing, but can also help you with users that use limited HTTP libraries, e.g. if your API is used from an embedded system. If you decide to do this, always define what happens if both query parameter and header are specified in the same call, e.g. that query parameter overrides header when provided. Are you going to expose this mechanism to clients or only use it internal is for you to decide – it has benefits, but can also be confusing and you’d need to educate your clients.

Correlating User Interactions

If individual actions of a user can’t be taken in the isolation, you’ll want to have some kind of a session. There are a few options to consider:

  1. Explicit session id with session replication. User always provides an id they’ve gotten from you in all subsequent API calls. You use that id to access data belonging to the session. You are replicating both session id and data across all of you servers or clusters.
  2. Session affinity without session replication. You are not replicating sessions, but are ensuring that a user always reaches the same server. Session affinity is also known as a sticky session – the session sticks to the same server. If that server is not available, user ends up on a different server where there is no session data and needs to start a new session.
  3. Session affinity with session replication. All the requests in the same session generally go to the same server. In the case the server is not available, user gets to a different server, but as all the session data is replicated, you can still serve the request without forcing the user to start a new session.
  4. User always provides all the session data. This is not feasible when your users are humans, but can be possible when the API is accessed in the automatic manner. Instead of providing just session id to a client, you provide them with all the session data.

So, what are the tradeoffs involved? When dealing with backend services you should generally avoid option (2), as it can lead to the loss of the session data and the client needs to start a new session. This causes more problems when clients are not real human users – if they were humans, they would likely get somewhat frustrated, but they could still react properly in the event the session is lost. When the API is accessed in an automated manner, there are bound to be some clients that won’t deal with drooping of the session in a robust manner.

The benefit of sticky sessions is that you could have session data in memory of a particular server and thus access it quickly. But there is another problem with sticky sessions in general – they rely on cookies to be implemented. While basically all browsers can deal with cookies without any user interaction, this is not the case with HTTP client libraries. Some of them support cookies but don’t use cookies without being explicitly configured to do so, while some libraries don’t have cookie support at all.

If your clients do support cookies and time required to access session data is of importance, you should go with option (3), where you can still recover session data from a shared storage accessible from any of the servers.

If sticky sessions are problematic, you should probably go with option (1). Even so, you should be aware that any approach relying on session replication can be problematic when your system is highly distributed. How do you replicate the session data? Do you even have a storage that can be accessed from all your servers distributed across the globe? Does the session storage need to be highly available? Can you access it fast enough or do you need to try caching the data for faster access? Can you ensure the cache is up-to-date?

Option (4) avoids scaling problems as clients always provide all the necessary data. Putting the usability annoyance aside, you are still facing a few constraints. What if you can’t just create the session data, but need to update it as the user does more actions? You’d need to provide user data on each API call. How do you deliver this data? How do you ensure that the session data the user is sending to you is fresh and not from a few calls back?

Note that the need for session tracking could often be mitigated if your API was actually a REST API that relies on hypermedia to keep the context of consecutive user actions. This approach is known as HATEOAS. The tradeoffs when using this approach are all about complexity. It is not easy to work with such an API: educating clients and making sure they are using the API in the right way is a challenge. Unfortunately, there is going to be some client hate associated with HATEOAS.

As you can see, even though I’ve mentioned cookies, there is unfortunately no free lunch.

Examples of Limitations

Let’s talk about a few concrete limitations and how to avoid them.

Query Parameters

First one is mostly about GET requests and URL length. When a user is issuing a GET request, they need to encode all of their data in query parameters. If there is a lot of data to be encoded, URL length can easily grow to thousands of characters. While there is no theoretical limitation to the URL length, in practice you should not depend on URLs having more than 2000 characters. Some browsers and libraries can’t deal with URLs longer than 2000 character.

Be aware of this limitation when designing API, and provide alternative way to do an operation that would require a too long URL because of query parameters. Note that you can easily end up with more than 2000 characters in URL even by using only a handful of query parameters if their values are long enough.

You can avoid this limitation by having a POST endpoint that encodes the same user-supplied data in a JSON payload instead of a GET endpoint that relies only on query parameters. Be aware that you are making tradeoffs though: while POST endpoints generally don’t have a problem with URL length, payloads sent to them are usually not logged anywhere, at least in the case when you are not explicitly making an effort to log them. Another tradeoff you are making is on the semantics of HTTP method – POST is more confusing and unexpected than GET method when a user is querying the system for information.

Speaking of query parameters, do not repeat them. Yes, it is possible to do so, and some systems can deal with this repetition, but it is bloating the URL and making it harder to understand, particularly if those are intermingled with other query parameters. Express this situation by listing all the values under the same parameter name. Let’s say users are using query parameters to get only items that have desired labels. Instead of expressing the support for multiple labels by having repeated label query parameters, e.g. label=foo&label=bar, use labels query parameter with all the values, in this case labels=foo,bar.

Returning JSON Arrays

Consider not using JSON arrays directly as a resulting payload and have them embedded inside a JSON object. So instead of using [ a, b, c] directly, use { "values" : [a, b, c] }.

Why? Many of your users will be surprised that an array not embedded in an object is actually a valid JSON payload. But more importantly, returning a JSON array directly prevents you from being able to add additional features to this payload. That way the endpoint becomes a high risk of causing a breaking change in the future.

If you are returning an object with array embedded, you can later add more top-level elements to the object. With array that is just not possible.

Creating and Deleting Resources

When you are dealing with some kind of a unique resource in the system that can be created and deleted by users, there are a few possible strategies to implement this:

  1. Allow creation of resource with the same name or id immediately after they were deleted.
  2. Allow creation of resource with the same name or id at some later point in time after they were deleted.
  3. Do not allow reuse of resource names or ids.

The reason why I’m even mentioning this is that option (1) can be hard to guarantee, particularly in distributed systems. Options (2) and (3) are largely the same from this perspective and which one you will choose depends mostly on intrinsics of your domain.

You may be asking yourself why could option (1) be problematic? Maybe the underlying storage system does not guarantee read-after-write consistency? Cache invalidation could be a non-trivial issue as well.

A particular sign you should not be going for option (1) is if your system generally does not allow a user to (re)create that kind of resources on their own, but is allowed to do so in an integration environment for the testing purposes. In this case if you try to go for option (2), users can be unhappy that the resource recreation does not work right away. So, you should probably go for option (3). If the resource (re)creation is particularly costly or time-consuming, you could create in advance an appropriate number of resources for testing purposes and provide them to users for testing purposes, e.g. via dedicated endpoint.

Have Internal Endpoints

Introducing internal or operator endpoints is relatively cheap when compared to adding new endpoints to any customer-facing part of the API.

With internal endpoints, I’m referring to endpoints that are fully under your control – not even exposed to other teams in the company. They are used by the system operator (you) to manage the system.

You can have system diagnostics endpoints to get insights into the current state of the system. As the system grows, you’ll need to add more of them. Those endpoints can also be used by any monitoring tools fully under your control. You might want to:

  • Run health checks to see if the system is responsive
  • Run smoke tests to check if upstream systems you depend on are reachable and behaving as expected
  • Get upstream systems latencies and failure rates
  • Get response times distribution for individual endpoints
  • Get error rates distributions across endpoints or error types
  • Get insights into environment of a single server, like CPU, hard disk, memory, or network usage.

You should also know how users are behaving, including:

  • What interactions with the system is an individual user doing
  • What are the errors they are seeing
  • How much of trotting quota does a user have left

For any kind of statistical distributions used in metrics, you want to be able to quickly assess the state with only a few numbers. You are particularly interested negative extremes, e.g. what are latencies that are at the slowest 2% of all the requests. You could report the distribution of latencies with: median, average, p90, p95, p98 and p99.

Assuming that an endpoint is indeed used only internally, the rules relax a bit from what we’ve been exploring so far – you may not need API versioning and you can change endpoints as needed. You may also prioritize ease of use over defensive design a bit. This is of course not an excuse to make a mess or have semantically broken endpoints. Merely that the consequences of getting things a bit wrong are contained in the team, and so you can fix them easily while moving faster to satisfy immediate business or technical needs.

You can go a bit further to enable ease of use, by having an internal web page that allows you to interact with all the internal endpoints in a convenient manner, e.g. by providing you with forms to access them, by providing you with graphs instead of having only a few numbers to approximate statistical distributions, etc. This does not require a lot of effort and allows you to get the most out of internal endpoints. And the less tedious it is to do something, the more likely it is you’ll be doing it often.

To enable easy-to-use internal pages, and for that matter, customer-facing playground I’ve mentioned earlier, you may want to provide different response types for the same data returned. For example, you could normally return JSON payloads to a customer, but return HTML representation to be used in the playground and internal endpoints page, and so enhance usability. An example of usability enhancement would be a JSON viewer that can easily collapse sections, or that renders any abstraction you may have provided more appropriately. Be careful to not expose those different representations carelessly to customers. In fact, you most likely want to only provide them with one representation. Otherwise you are confusing customers and risking them integrating with inappropriate response representation, e.g. with volatile HTML instead of more controlled JSON representation.

On Providing SDKs

You may want to provide Software Development Kits for certain programming languages to enable easier integration with the service. While this is not the main topic of this book, it still deserves a concise consideration.

The first question to tackle is: should your SDK mirror the Web API?

The main benefit you would gain by mirroring the web API endpoints one-to-one in the SDK is the ease of generating the SDK, potentially for multiple programming languages.

A benefit of not mirroring the Web API is much stronger abstraction capabilities. Basically all the mainstream programming languages are much more expressive than Web services are. They provide stronger encapsulation, higher-level abstractions, more expressive and numerous data types, less constrained flow of information that can combine multiple Web endpoints instead of using individual HTTP calls, etc.

Another benefit is easier evolution of the product – by having an SDK on higher abstraction level than what Web API provides, you are having another level of indirection where you can manage changes in the API. Often you might preserve the SDK API backward compatibility while changing how the SDK interacts with your service. That way customers can update their SDK libraries without needing to do any updates to their code and immediately benefit from enhancements you made. Or seamlessly transition to newer API endpoint without seeing backward incompatible change on the Web API level.

By having an SDK that runs on the client’s side, you gain capability to further enhance the service experience. You can do client-side caching, optimize the flow of data by flexibly choosing batching or streaming approaches, abstract away any session handling requirements, etc.

Note that you can also decide to have SDKs as the main, or even the only, way of integrating with your service. This changes Web API design considerably and would require another book to cover appropriately. Switch of focus from the web API to an SDK APIs is anyway likely to happen later in the service maturity. And if it happens, you’ll still benefit from careful design of APIs that we are focusing on.

Gain Insights Through Logs

Understanding user behavior and the impact on your system is critical. The easiest way to gain insights on this topic is by logging any relevant information.

You should produce and store all kinds of logs from the start. At this point you likely don’t know all the applications you’ll have for this data in the future. And that’s fine, just preserve it so you can run analysis in the future.

You should log all the requests and interesting details associated with them. That includes HTTP method, path, query parameters, headers, user information, timestamp, response times, etc. Consider logging the response as well, but be aware that it is often not feasible. The same goes for POST or PUT payloads.

You should have additional types of logs, not just access log. Those should include:

  • System log that is used to log all the interesting operations that internally happen in your system while the requests are being processes, or that are done as a part of system internal operations. What were the execution paths through the system? Which operations were retried? Where did the system fall back to secondary source of data? What intermediate results are interesting?
  • Upstream log is used to log all interactions with upstream systems and can have data analogous to the access log. This log is particularly useful when you need to be able to quickly establish if a problem originated inside of your system or not, and act accordingly.
  • Incidents log is used to log, all the interactions that resulted in errors, in one place. Consider exposing various analytics and raw log content through internal endpoints or internal page.

Now that you have multiple logs, you’ll want to be able to track individual API interactions through all those logs. This can be done by using transaction ids. A transaction id is an UUID unique to each request. Generate them as close to the system entry point as possible, e.g. on the load balancer level. Propagate this information through all operations in the system. On the load balancer level, you can use headers to encode it. How you propagate this information through the system, largely depends on the technology used in the implementation. Don’t forget to encode transaction ids in any upstream interactions as well.

In general rather err heavily on the side of logging too much than too little. You should, of course, consider the constraints you are working with, e.g. you don’t want to impact user experience because logging is straining your system too much. Or worse crash servers because you don’t have any free disk space left. Prioritize what you log by usefulness of the data, and then log as much as is still meaningful.

What does logging have to do with API design? Now that you are collecting all this data, you can analyze it to gain insights into different usage patterns users are making. You’ll be able to discover wrong or unexpected patterns and address them in the API, e.g. by introducing new endpoints, by having better documentation or by introducing stricter trotting rules.

Analyzing different usage patterns may also result in positive surprises – maybe your system is providing value to a customer you haven’t even consider. If that is the case, maybe there is even a better way to enable newly-discovered use case through the API.

Preventing Accidental Changes with Log-Replay Tests

Let’s conclude this chapter with a powerful technique that is unfortunately not used as much as it should be – log replay tests.

The basic idea is simple: record access logs to the service and then later replay requests from those logs to a new candidate version of the service and compare the output to the results produced by the version of the service that is in use and working as expected.

Or said differently, you are recording the user input to the system and then recreating those interactions and validating that the output matches your expectations. This technique can be applied to vast numbers of different systems, but it is particularly suitable for web services as those tend to have simple input (recorded access logs) and structured outputs that are easy to compare (e.g. JSON).

The main goal of log-replay tests is to prevent accidental regressions of the system. In the context of resilient web APIs, log-replay tests are particularly relevant to prevent unexpected changes that could break existing clients. But they can also be used to stress-test the system by replaying the logs faster. Or to model various usage patterns by replaying different mixtures of requests. Or to prevent regression of the system’s performance over time. As you can see, an extremely versatile tool you really should be using.

Why is log-replay testing strategy so powerful? Because it tests a candidate system with real users’ interactions, as opposed to having only test scenarios developers deemed worth testing. Once put in place, the log-replay tests will automatically adapt to new usage patterns as the service evolves and so always be up-to-date. They form a perfect last testing step in your Continuous Delivery pipeline – running log-replay tests takes a bit of time, but they catch virtually any regression you might have introduced.

How can you go and implement those log-replay tests? First you need to be sampling the requests. This can be done in multiple ways, including randomly sampling a small percentage of logs on various servers. If servers comprising the service are all uniform, you can even just choose a server and continuously collect logs for a period of time. Your goal in this phase is to collect logs that represent all the interactions clients are doing. At least make sure you have logs for all the endpoints used by customers. Note that this should not be done as a one-off task, but should be automated to keep log samples up-to-date with newly introduced endpoints or system behavior.

Once you have a new candidate version of the system, you replay those logs against the service that is in use and presumably behaving as expected, and against your candidate service. Both produce results that are then compared. This comparison is done by some kind of a matcher that can ignore unimportant differences like timestamps. And that’s it. Depending on the complexity of your system and maturity of your pipelines, you may automatically decide to put the candidate service in use or present the report for manual operator inspection.

There are a few caveats to keep in mind when working on log-replay tests:

  • Implementing log-replay tests is simpler for stateless services where each request can be considered in isolation.
  • GET requests are easier to handle than POST / PUT / DELETE requests because of idempotence.
  • You need to record request payloads to be able to replay POST or PUT requests.
  • Don’t forget to mark requests used in tests and exclude them from logging, or otherwise make sure they are not used in analytics. If you are not doing this, you are contaminating your data.

While it may take some effort to implement log-replay tests, they are definitely worth it. They can be your final safety net that ensures customers are not impacted negatively.

Conclusion

This marks the end of this brief book on building resilient web APIs. I hope you’ve found something useful along the way and are now ready to better tackle the problem of designing Web APIs.

I’d appreciate if you send any questions, comments, or suggestions you might have, to api-book@ikusalic.com. I’ll try my best to answer them all.

I wish you a great success in all your future design endeavors.


  1. Yes, premature optimization is root of all evil. No, that is not an excuse to write bad code when the effort to do it better would have been the same. I’ve seen so many people abuse this quote by not understanding what premature optimization is…