all that jazz

james' blog about scala and all that jazz

CQRS increases consistency

The problem with CQRS is that it makes things more complex because it decreases consistency.

I hear this argument a lot. As the author of a framework that strongly encourages CQRS, I obviously am biased against this opinion, and disgaree with it.

Of course, the statement and my disagreement needs context and qualification. If you have a traditional monolithic system, where all data operations are done on a single database, and that single database supports ACID transactions, then yes, switching to CQRS will decrease consistency. However, that’s not the context in which CQRS usually comes up, and it’s not the context that I recommend using it in. The context is microservices, and more and more industry experts are recommending that people move away from monoliths and move to microservices.

§Microservices decrease consistency

A core principle of microservices is that they should own all their own data. Another core principle is that they should be able to act autonomously. This means they shouldn’t need to depend on other services - in particular making synchrconous calls to other services - in order to implement the functions they are responsible for.

If a service is to be autonomous then, it needs to store all the data necessary for it to achieve its functions. Sometimes this data may be owned by other services - and so in order to function, those services need to publish that data for this service to consume and update its own store. This means in a microservices system, the same data is going to be stored in many places. Will this data be strongly consistent across the entire system? Unless every time you do an operation, you stop all other requests to the entire system, and process the operation until it succeeds, the answer is no, it will often be inconsistent.

So, a switch to microservices is by nature a switch to decreased consistency. Decreased consistency is a consequence of microservices.

§Inconsistency can be very bad

Because we are using microservices, our system can become inconsistent, and we need to deal with that. How inconsistent our system can get depends on how we deal with it. The typical, very high level hand wavy approach to dealing with inconsistency in a distributed system is to use eventual consistency. But how is that actually achieved?

Let’s imagine service B needs some data owned by service A to implement its responsibility. In this scenario, A is responsible for writing the data, it’s the thing that handles the command, while B is responsible for reading the data, it does the query. In order for B to act autonomously though, it can’t query A directly, it needs to have A push it the data so that it can update its own query store, and then when the time comes to use it, query it locally.

Service A could do this synchronously, when an operation is done on service A, it will make a call on service B to update it with the results. Now this works fine on paper, but what happens when service B is down at that time? Does service A have to keep retrying until service B comes back up? What if service A then goes down?

As you can see we now have a consistency problem, one that will not eventually become consistent. Service A has been updated, but service B failed to update. It will stay inconsistent forever, we could say that it is terminally inconsistent, and will require manual intervention in order to fix. The reason we have this problem is that we combined our command responsibility and query responsibilities in the one operation, and since that operation wasn’t atomic, a partial failure of the operation will lead to terminal inconsistency.

§CQRS gives you consistency back

CQRS means separating command responsibilities from query responsibilities. In our scenario, using CQRS, service A handles the command, and updates its state, and is done. Then, in a separate operation, another process will take the result of the operation on A, and asynchronously push it to service B.

Since the operation is asynchronous, service B doesn’t need to be up at the time - for example it can use an at least once messaging queue to handle the message. If service B isn’t up at the time, then the system will be inconsistent for a period of time. However, when service B comes back up, and then processes the message, the system will become consistent again, it will be eventually consistent.

So by employing CQRS, we are able to get back some of the consistency guarantees that we lost when we moved to microservices. We don’t have a globally consistent system, but we can guarantee an eventually consistent system.

§CQRS is a necessary evil

CQRS is complex, far more complex than handling both the command and query responsibilities of data in single operations on a strongly consistent database. However, in a microservices world, we don’t have the luxury of relying on a single strongly consistent database. In that world, inconsistency is a given. If someone says using CQRS in microservices means you lose consistency - they have failed to acknowledge that they lost consistency the moment they started using microservices, it was not CQRS that lost them that consistency. Rather, CQRS is a very powerful tool that allows us to address the inherent inconsistency of microservices to give us eventual consistency instead of terminal inconsistency.

Events as facts - trade-offs between resilience and consistency

In architecting a system, a question that you often have to ask is how much information should be included in an event? The more information you include, the more the services that receive the event will be to be able to process it without looking up information from other services, which will make your system more resilient and scalable. For example, if a user submits an order, the system might publish an order submitted event. But should it include the list of items that were ordered in that event? If it does, then services consuming that event will not need to go back to the order service to find out what items the order has, they’ll be able to process that event in isolation from the order service and in doing so be more resilient.

On the other hand, if you include too much information in your event, your event becomes very large, which will impact the throughput of your messaging infrastructure and the cost of processing events. While this is true, it’s tempting to consider the problem as primarily a resilience vs throughput problem, and this is how I used to think of it. However, I no longer think of that as the major trade off.

One way to view events is to see them as facts. The words event and fact have very similar meanings, and can often be used interchangeably. An event is something that happened (or will happen, but in the context of systems architecture we always use it to refer to something that has happened). A fact is something that happened. However they have a different focus, and consequently when we think of events vs thinking of facts, we apply slightly different thought processes and constraints. When referring to something as an event, the emphasis is placed on what happened. When referring to something as a fact, the emphasis is placed on the truth of the thing that happened.

This subtle difference can change the way we think about the messages that convey this information. A fact is indisputable, so if an event is a fact, then it should only contain information that is indisputable. The user submitted the order at this time. That is indisputable, it is a fact that the user did that. The order contains these items. That is disputable, the order may have been changed since it was submitted due to the availability of items in it changing.

The property of our system that is in question here is consistency. If we treat all events as facts, containing only indisputable information, then we improve the consistency of our system. A service can’t make the mistake of trusting information in an event that might later become false, since the event won’t contain that information. And this consistency issue is, in general, a bigger issue than the throughput of your message broker. There are many things that can be tuned to increase throughput, but addressing inconsistency requires a lot more than just tuning.

And so I’ve realised that the biggest concerns when deciding how much information to include in events are resilience and consistency. To include more information beyond the fact that happened is to increase resilience at the cost of consistency.

Of course, it is a trade-off, and one that only the business requirements can decide where the appropriate balance lies. If I have an email notification service that is subscribing to order events, and it needs the list of order items in order to render an email notification, it is fine for it to simply read the items from the event to generate the email to send.

On the other hand this depending on the items from the event will cause a problem for the inventory service, which needs to adjust its inventory levels according to what the order contains when it’s eventually dispatched. There are two ways this could be addressed while still maintaining resilience - the inventory items could receive all events regarding changes to items in the order from the start, through to when the order is submitted, and right through to dispatch. Or it may consider the order submitted event with its list the starting state of the order, and subscribe to subsequent change events for the order.

Semantic discoverability security

A well known security issue in HTTP is that if you return 404 Not Found for a resource that doesn’t exist, but return 401 Unauthorized or 403 Forbidden for a resource that you’re not allowed to access, then you might be giving away information that an attacker could use, that is, that the resource exists. While in some cases that’s no big deal, in other cases it’s a security problem.

The usual solution to this is when you’re not permitted to access something, the server should respond as if it doesn’t exist, by returning 404 Not Found. Thus, you receive the same response whether the resource exists or not, and are not able to determine whether it does exist. I’d like to argue that this is the wrong approach, it is not semantic.

Just before we go on, I just want to clarify, while returning 401 Unauthorized may be a semantic way to tell a client that you need to authenticate before you can continue, on the web, it’s not a practical way to tell the client that, since the HTTP spec requires that it be paired with some user agent aware authentication method, such as HTTP BASIC authentication. But typcially, sites get people to log in using forms and cookies, and so the practical response to tell a web user that they need to authenticate before they can continue is to send a redirect to the login page. For the remainder of this blog post, when I refer to the 401 status code, I am meaning informing the user that they need to authenticate, which could be done by actually sending a redirect.

So, let’s think about what we’re trying to achieve here when we say we want to protect discoverability of resources. What we’re saying is that we want to prevent users who don’t have discoverability permission from finding out if if something does or doesn’t exist. The 404 Not Found status code says that a resource doesn’t exist. So if I’m not allowed to know if a resource doesn’t exist, but the server sends me a 404 Not Found, that’s a contradiction, isn’t it? It’s certainly not semantic. Of course, it’s not a security issue because the server will also send me a 404 Not Found if the resource does exist, but that’s not semantic either, the server is in fact lying to me then.

Sending a 404 Not Found in every case is not the only solution, there’s another solution where the server can say what it really means, ie be semantic, while still protecting discoverability. If a resource doesn’t exist, but I’m not allowed to find out whether a resource does or doesn’t exist, then the semantic response is not to tell me it doesn’t exist, it’s to tell me that I’m not allowed to find out if it exists or not. This would mean, if I’m not authenticated, sending a 401 Unauthorized, or if I am authenticated but am still not allowed to find out, to send a 403 Forbidden. The server has told me the truth, you’re not allowed to know anything about this resource that may or may not exist. And, if it does exist, and I’m not allowed to do it, the server will do the same, sending a 401 or 403 response. In either case, whether the resource exists or not, the response code will be the same, and so can’t be exploited to discover resources.

In this way, we have implemented discoverability protection, but we have done so semantically. We haven’t lied to the user, we haven’t told them that something doesn’t exist that actually does, we have simply told them that they’re not allowed to find out if it does or doesn’t exist. And we haven’t seemingly violated our own contract by telling a user something is not found when they’re not allowed to know if it’s not found or not.

In practice, using this approach also provides a much better user experience. If I am not logged in, but I click a link somewhere to the resource, it would be much better for me to be redirected to the user login screen so that I can login and be redirected back to the resource, than for me to be just told that the resource doesn’t exist. If the resource doesn’t actually exist, then after logging in, I can be redirected back to the resource where I’ll get a 404 Not Found, this is semantic and makes sense, it’s not until I log in that I can actually get a Not Found out of the server, that’s what discoverability means. This is exactly my argument in a Jenkins issue, where Jenkins is currently returning a 404 Not Found screen (with no option to click log in) for builds that I don’t have permission to access until I log in.

Open Source - it's not about free as in beer or speech

I’ve been thinking a bit about open source software, and why I’m so attracted to it. Often when people talk about the thing that makes open source so great, they use the phrase “it’s free as in speech”. Of course it’s also free as in beer, that is, it’s free as in you don’t have to pay for it, and that certainly is an advantage of open source software, but the point that they’re making, in contrast to free as in beer, is that the real value of it is that you are free to do with the software as you choose.

But when I think about what attracts me to open source software, like free as in beer, free as in speech is only an advantage, a nice to have, it’s not the primary attraction.

So firstly, why isn’t free as in beer the primary attraction to me? The answer is it’s about risk. Dijkstra famously stated that we should count lines of code as “lines spent” rather than “lines produced”. Every line of code in your system is a line of code that you must understand and maintain for the life of that system. The more lines of code, the more complexity, the harder the system is to understand, the more expensive it is to maintain.

This doesn’t just apply to the code that you write - it applies to the third party libraries that you’re using. Every library that you use is another layer of code, another part of the system that you have to maintain, and so counts towards the number of lines spent. It may be a commercial library that you’re paying someone else to maintain - but it’s still your responsibility to ensure that it is maintained since it’s your system that the library is running in. When you pay someone to maintain a library for you, you are only delegating that responibility, not transferring it.

So, let’s say there was a library that I wanted to use, and it was free, as in beer, but there was no source code available for it. I could take the binaries, call them my own, use them in my own system, they are free, but I could never get the source code, I had to rely on the maintainer. Would I use such a library? Never. The risk is simply too high, since that library is part of my system, it’s my responsibility to maintain it. But I can’t maintain it myself, since I don’t have the source code. And because I’m not paying the maintainer for it, I can’t delegate the responsibility to them, because they have no obligation to me to fulfil that responsibility. Hence, I cannot use the library since I cannot be held responsibible for it. So, while free as in beer is great, it is not the primary advantage of open source, there are more important things than free as in beer.

So what about free as in speech? An open source library allows me the freedom to maintain and modify it myself, as a last resort I can always fork it. This is an advantage since it means I can always assume the maintenance of that library myself, and that means I can take responsibility for it. But is that the primary advantage? Let’s consider you had a choice of two libraries. One of them was open source, let’s say ASF licensed, but it had no active community around it, the maintainer didn’t accept patches, and only worked on things that he or she was interested in. The other library was commercial, you had to pay for it, and you were restricted in how you could use it, however once you paid for it, you were given full access to source code, and access to an active community of developers, not only within the company that produced the library but also developers from other companies that were using the library and building on it. Which would you go with?

Now I don’t expect everyone to answer the same here, there’s no one right answer, but my answer is the commercial one. Why? Because I think the biggest value in open source is the community. And so an open source project with no community and no opportunity for community involvement is missing the biggest value of open source, which means a non open source project that has commmunity and active community involvement will trump it.

Why is a community so valuable? As I said before, when you decide to use a library in your system, you assume responsibility for maintaining that library, regardless of whether it’s open source or not. When that library comes with an active community with the like minded goal of improving that library, then when you decide to take responsibility for it, you are immediately joined by a community of people who will help you to do that. Ask any open source maintainer, a community member who sees themselves as part of the solution to any problems they have with an open source library is a peer. They are someone who the maintainer wants to work with, and so when you take that responsibility, you get a whole community of people who are wanting to work with you to solve your problems. This is the great thing about open source, and in my opinion, it’s far greater than the freedom to use the library in any way you want.

Now perhaps you might be thinking that the example of a commercial library with an active developer community is contrived - that would never happen, it’s only because of the freedom that open source affords that such communities exist. Well, that’s simply not true, I have worked in such a community before. Atlassian is a company that sells commercial software with commercial licenses, you are not free to do with Atlassian’s software as you please, you may only use it as the license stipulates. But when you do purchase a license, you get full access to their source code, and there is a vibrant plugin development community that you can join, one that is fostered by Atlassian, where knowledge is shared by both outside and Atlassian community members, and people work together to make the entire platform better. This meets all my of requirements of a developer community, and addresses the risk of including an Atlassian product as part of my system.

So if an Atlassian product was the right fit to be part of a system that I was maintaining, I would not hesitate to choose it. The fact that it isn’t free as in speech or beer wouldn’t phase me in the slightlest, I would choose it for the same primary reason that I choose most open source software - that is that I get a whole community of like minded developers working to help me take the responsibility of maintaining my system. And that’s how I know that the primary value of open source to me isn’t that it’s free as in speech, it’s the community that comes with it.

Eventual Consistency is not safe and that's ok

A number of years ago, Peter Bailis wrote an excellent post titled Safety and Liveness: Eventual Consistency Is Not Safe. This post is short and to the point, and is a very salient reality check for anyone that is using eventual consistency. With the prevelance of eventually consistent architectures and systems today, I recommend that all developers read it, in spite of the posts age, it is just as important today as it was when it was written.

A member of our marketing department recently came across the post, and sent an email to our engineers because he was somewhat confused about it. At Lightbend, we are advocates of eventual consistency, and much of our tech embraces it. However he was alarmed by the statement that eventual consistency is not safe. This statement led him to the conclusion that the post was saying that eventual consistency must be bad.

I don’t think this conclusion is an unreasonable conclusion for someone not familiar with eventual consistency to make, not just a marketing person, but for a developer who is completely new to eventual consistency. And so I think the statement needs some further explanation.

So to start with, an analogy. The sun is unsafe. If you were to somehow “stand on” the sun, before you could begin to comprehend your environment, you would cease to exist in any recognisable form. This property of the sun, that standing on it and existing are incompatible, is not a bad thing, in fact it’s not a good thing either, it’s neither bad nor good, it’s just a property. It doesn’t become a good or bad thing until you try to rely on the property or its absence. It simply means, don’t stand on the sun.

Likewise, the unsafeness of eventual consistency is not a bad thing, nor is it a good thing, it’s just a property.

Now the bulk of Peter’s post was about how eventual consistency alone is not useful, it must be combined with other guarantees, such as what versions of the data can be returned when it’s not consistent, and what version of the data will eventually be returned. These guarantees help to make an eventually consistent system more safe, but they don’t stop it from being unsafe.

So what does this unsafety look like in practice, and why isn’t it necessarily a bad thing?

Let’s consider a very common eventually consistent architectural practice - Command Query Responsibility Segregation (CQRS). A CQRS implementation is usually eventually consistent, and offers further guarantees such as that the read side will always return some version of the data that matches some version in the write sides history, and that the convergent version that will eventually be returned will always be the most recent version of the write side.

So, what is not safe about it? Let’s say you have a blog that supports full text search. The search is implemented using a read side view of the data that puts all the posts into a natural text index. Since this is using CQRS, the index is updated asynchronously, and it may take some time for the new post to reach the read side. This means, immediately after publishing a new post, let’s say it’s a post about eventual consistency, if you search for “eventual consistency”, the post won’t come back in the search results. This is in spite of the fact that you can browse to the post, load it, and see the words eventual consistency in the text of the post. The system is in an inconsistent state, when we load the post it has the words “eventual consistency”, when we ask the search index for all the posts with the words “eventual consistency” in them, the index doesn’t return that post. This is the unsafety, it is not safe to rely on every part of the system returning the same answer to whether our post contains the words “eventual consistency”.

So, now we’ve seen an example of the unsafety of eventual consistency, why is this not bad? The answer is simply, there is no requirement for the text search to be updated immediately after the blog post is published. There is a requirement that it be eventually updated, and in a somewhat timely manner, and CQRS allows us to meet that requirement. But no one cares if for a few seconds or even minutes after publishing a blog post, that the text search on their blog doesn’t return the new post. So the unsafety of eventual consistency is not bad because we’re not relying on it to be safe.

How does this play out in a broader context? It’s easy to see how safety is not a requirement for a search index, but that’s a very specific scenario, is this one of the few places where eventual consistency can be used? The answer to that is necessarily no, if you’re writing a sufficiently large system, with even modest scaling and availability requirements by todays standards, you will probably find yourself having to use eventual consistency in order to scale your system and maintain availability. The upshot is that you must embrace the unsafeness property, and not put strong consistency requirements on the parts of your system that are eventually consistent. You must design your system to handle the unsafeness.

There are many ways to do this - for example, by using asynchronous message passing, messages can be queued to be handled when the system is consistent and ready to handle them, in contrast to synchronous RPC calls, which require the system to be consistent before the calls can be made. Another technique is to just store events, or facts, things that happened, and don’t calculate the state until later, when the system has converged, for example, when the state is queried. And yet another is to assume consistency, but then have a process in place to detect when that assumption was not true, and trigger some recovery operation.

In all these cases, the unsafeness of eventual consistency isn’t a bad thing, it’s just a property that has been carefully considered and addressed in the design of the system. It is not safe to rely on the consistency of the system, and that’s ok.