all that jazz

james' blog about scala and all that jazz

Events as facts - trade-offs between resilience and consistency

In architecting a system, a question that you often have to ask is how much information should be included in an event? The more information you include, the more the services that receive the event will be to be able to process it without looking up information from other services, which will make your system more resilient and scalable. For example, if a user submits an order, the system might publish an order submitted event. But should it include the list of items that were ordered in that event? If it does, then services consuming that event will not need to go back to the order service to find out what items the order has, they’ll be able to process that event in isolation from the order service and in doing so be more resilient.

On the other hand, if you include too much information in your event, your event becomes very large, which will impact the throughput of your messaging infrastructure and the cost of processing events. While this is true, it’s tempting to consider the problem as primarily a resilience vs throughput problem, and this is how I used to think of it. However, I no longer think of that as the major trade off.

One way to view events is to see them as facts. The words event and fact have very similar meanings, and can often be used interchangeably. An event is something that happened (or will happen, but in the context of systems architecture we always use it to refer to something that has happened). A fact is something that happened. However they have a different focus, and consequently when we think of events vs thinking of facts, we apply slightly different thought processes and constraints. When referring to something as an event, the emphasis is placed on what happened. When referring to something as a fact, the emphasis is placed on the truth of the thing that happened.

This subtle difference can change the way we think about the messages that convey this information. A fact is indisputable, so if an event is a fact, then it should only contain information that is indisputable. The user submitted the order at this time. That is indisputable, it is a fact that the user did that. The order contains these items. That is disputable, the order may have been changed since it was submitted due to the availability of items in it changing.

The property of our system that is in question here is consistency. If we treat all events as facts, containing only indisputable information, then we improve the consistency of our system. A service can’t make the mistake of trusting information in an event that might later become false, since the event won’t contain that information. And this consistency issue is, in general, a bigger issue than the throughput of your message broker. There are many things that can be tuned to increase throughput, but addressing inconsistency requires a lot more than just tuning.

And so I’ve realised that the biggest concerns when deciding how much information to include in events are resilience and consistency. To include more information beyond the fact that happened is to increase resilience at the cost of consistency.

Of course, it is a trade-off, and one that only the business requirements can decide where the appropriate balance lies. If I have an email notification service that is subscribing to order events, and it needs the list of order items in order to render an email notification, it is fine for it to simply read the items from the event to generate the email to send.

On the other hand this depending on the items from the event will cause a problem for the inventory service, which needs to adjust its inventory levels according to what the order contains when it’s eventually dispatched. There are two ways this could be addressed while still maintaining resilience - the inventory items could receive all events regarding changes to items in the order from the start, through to when the order is submitted, and right through to dispatch. Or it may consider the order submitted event with its list the starting state of the order, and subscribe to subsequent change events for the order.

Semantic discoverability security

A well known security issue in HTTP is that if you return 404 Not Found for a resource that doesn’t exist, but return 401 Unauthorized or 403 Forbidden for a resource that you’re not allowed to access, then you might be giving away information that an attacker could use, that is, that the resource exists. While in some cases that’s no big deal, in other cases it’s a security problem.

The usual solution to this is when you’re not permitted to access something, the server should respond as if it doesn’t exist, by returning 404 Not Found. Thus, you receive the same response whether the resource exists or not, and are not able to determine whether it does exist. I’d like to argue that this is the wrong approach, it is not semantic.

Just before we go on, I just want to clarify, while returning 401 Unauthorized may be a semantic way to tell a client that you need to authenticate before you can continue, on the web, it’s not a practical way to tell the client that, since the HTTP spec requires that it be paired with some user agent aware authentication method, such as HTTP BASIC authentication. But typcially, sites get people to log in using forms and cookies, and so the practical response to tell a web user that they need to authenticate before they can continue is to send a redirect to the login page. For the remainder of this blog post, when I refer to the 401 status code, I am meaning informing the user that they need to authenticate, which could be done by actually sending a redirect.

So, let’s think about what we’re trying to achieve here when we say we want to protect discoverability of resources. What we’re saying is that we want to prevent users who don’t have discoverability permission from finding out if if something does or doesn’t exist. The 404 Not Found status code says that a resource doesn’t exist. So if I’m not allowed to know if a resource doesn’t exist, but the server sends me a 404 Not Found, that’s a contradiction, isn’t it? It’s certainly not semantic. Of course, it’s not a security issue because the server will also send me a 404 Not Found if the resource does exist, but that’s not semantic either, the server is in fact lying to me then.

Sending a 404 Not Found in every case is not the only solution, there’s another solution where the server can say what it really means, ie be semantic, while still protecting discoverability. If a resource doesn’t exist, but I’m not allowed to find out whether a resource does or doesn’t exist, then the semantic response is not to tell me it doesn’t exist, it’s to tell me that I’m not allowed to find out if it exists or not. This would mean, if I’m not authenticated, sending a 401 Unauthorized, or if I am authenticated but am still not allowed to find out, to send a 403 Forbidden. The server has told me the truth, you’re not allowed to know anything about this resource that may or may not exist. And, if it does exist, and I’m not allowed to do it, the server will do the same, sending a 401 or 403 response. In either case, whether the resource exists or not, the response code will be the same, and so can’t be exploited to discover resources.

In this way, we have implemented discoverability protection, but we have done so semantically. We haven’t lied to the user, we haven’t told them that something doesn’t exist that actually does, we have simply told them that they’re not allowed to find out if it does or doesn’t exist. And we haven’t seemingly violated our own contract by telling a user something is not found when they’re not allowed to know if it’s not found or not.

In practice, using this approach also provides a much better user experience. If I am not logged in, but I click a link somewhere to the resource, it would be much better for me to be redirected to the user login screen so that I can login and be redirected back to the resource, than for me to be just told that the resource doesn’t exist. If the resource doesn’t actually exist, then after logging in, I can be redirected back to the resource where I’ll get a 404 Not Found, this is semantic and makes sense, it’s not until I log in that I can actually get a Not Found out of the server, that’s what discoverability means. This is exactly my argument in a Jenkins issue, where Jenkins is currently returning a 404 Not Found screen (with no option to click log in) for builds that I don’t have permission to access until I log in.

Open Source - it's not about free as in beer or speech

I’ve been thinking a bit about open source software, and why I’m so attracted to it. Often when people talk about the thing that makes open source so great, they use the phrase “it’s free as in speech”. Of course it’s also free as in beer, that is, it’s free as in you don’t have to pay for it, and that certainly is an advantage of open source software, but the point that they’re making, in contrast to free as in beer, is that the real value of it is that you are free to do with the software as you choose.

But when I think about what attracts me to open source software, like free as in beer, free as in speech is only an advantage, a nice to have, it’s not the primary attraction.

So firstly, why isn’t free as in beer the primary attraction to me? The answer is it’s about risk. Dijkstra famously stated that we should count lines of code as “lines spent” rather than “lines produced”. Every line of code in your system is a line of code that you must understand and maintain for the life of that system. The more lines of code, the more complexity, the harder the system is to understand, the more expensive it is to maintain.

This doesn’t just apply to the code that you write - it applies to the third party libraries that you’re using. Every library that you use is another layer of code, another part of the system that you have to maintain, and so counts towards the number of lines spent. It may be a commercial library that you’re paying someone else to maintain - but it’s still your responsibility to ensure that it is maintained since it’s your system that the library is running in. When you pay someone to maintain a library for you, you are only delegating that responibility, not transferring it.

So, let’s say there was a library that I wanted to use, and it was free, as in beer, but there was no source code available for it. I could take the binaries, call them my own, use them in my own system, they are free, but I could never get the source code, I had to rely on the maintainer. Would I use such a library? Never. The risk is simply too high, since that library is part of my system, it’s my responsibility to maintain it. But I can’t maintain it myself, since I don’t have the source code. And because I’m not paying the maintainer for it, I can’t delegate the responsibility to them, because they have no obligation to me to fulfil that responsibility. Hence, I cannot use the library since I cannot be held responsibible for it. So, while free as in beer is great, it is not the primary advantage of open source, there are more important things than free as in beer.

So what about free as in speech? An open source library allows me the freedom to maintain and modify it myself, as a last resort I can always fork it. This is an advantage since it means I can always assume the maintenance of that library myself, and that means I can take responsibility for it. But is that the primary advantage? Let’s consider you had a choice of two libraries. One of them was open source, let’s say ASF licensed, but it had no active community around it, the maintainer didn’t accept patches, and only worked on things that he or she was interested in. The other library was commercial, you had to pay for it, and you were restricted in how you could use it, however once you paid for it, you were given full access to source code, and access to an active community of developers, not only within the company that produced the library but also developers from other companies that were using the library and building on it. Which would you go with?

Now I don’t expect everyone to answer the same here, there’s no one right answer, but my answer is the commercial one. Why? Because I think the biggest value in open source is the community. And so an open source project with no community and no opportunity for community involvement is missing the biggest value of open source, which means a non open source project that has commmunity and active community involvement will trump it.

Why is a community so valuable? As I said before, when you decide to use a library in your system, you assume responsibility for maintaining that library, regardless of whether it’s open source or not. When that library comes with an active community with the like minded goal of improving that library, then when you decide to take responsibility for it, you are immediately joined by a community of people who will help you to do that. Ask any open source maintainer, a community member who sees themselves as part of the solution to any problems they have with an open source library is a peer. They are someone who the maintainer wants to work with, and so when you take that responsibility, you get a whole community of people who are wanting to work with you to solve your problems. This is the great thing about open source, and in my opinion, it’s far greater than the freedom to use the library in any way you want.

Now perhaps you might be thinking that the example of a commercial library with an active developer community is contrived - that would never happen, it’s only because of the freedom that open source affords that such communities exist. Well, that’s simply not true, I have worked in such a community before. Atlassian is a company that sells commercial software with commercial licenses, you are not free to do with Atlassian’s software as you please, you may only use it as the license stipulates. But when you do purchase a license, you get full access to their source code, and there is a vibrant plugin development community that you can join, one that is fostered by Atlassian, where knowledge is shared by both outside and Atlassian community members, and people work together to make the entire platform better. This meets all my of requirements of a developer community, and addresses the risk of including an Atlassian product as part of my system.

So if an Atlassian product was the right fit to be part of a system that I was maintaining, I would not hesitate to choose it. The fact that it isn’t free as in speech or beer wouldn’t phase me in the slightlest, I would choose it for the same primary reason that I choose most open source software - that is that I get a whole community of like minded developers working to help me take the responsibility of maintaining my system. And that’s how I know that the primary value of open source to me isn’t that it’s free as in speech, it’s the community that comes with it.

Eventual Consistency is not safe and that's ok

A number of years ago, Peter Bailis wrote an excellent post titled Safety and Liveness: Eventual Consistency Is Not Safe. This post is short and to the point, and is a very salient reality check for anyone that is using eventual consistency. With the prevelance of eventually consistent architectures and systems today, I recommend that all developers read it, in spite of the posts age, it is just as important today as it was when it was written.

A member of our marketing department recently came across the post, and sent an email to our engineers because he was somewhat confused about it. At Lightbend, we are advocates of eventual consistency, and much of our tech embraces it. However he was alarmed by the statement that eventual consistency is not safe. This statement led him to the conclusion that the post was saying that eventual consistency must be bad.

I don’t think this conclusion is an unreasonable conclusion for someone not familiar with eventual consistency to make, not just a marketing person, but for a developer who is completely new to eventual consistency. And so I think the statement needs some further explanation.

So to start with, an analogy. The sun is unsafe. If you were to somehow “stand on” the sun, before you could begin to comprehend your environment, you would cease to exist in any recognisable form. This property of the sun, that standing on it and existing are incompatible, is not a bad thing, in fact it’s not a good thing either, it’s neither bad nor good, it’s just a property. It doesn’t become a good or bad thing until you try to rely on the property or its absence. It simply means, don’t stand on the sun.

Likewise, the unsafeness of eventual consistency is not a bad thing, nor is it a good thing, it’s just a property.

Now the bulk of Peter’s post was about how eventual consistency alone is not useful, it must be combined with other guarantees, such as what versions of the data can be returned when it’s not consistent, and what version of the data will eventually be returned. These guarantees help to make an eventually consistent system more safe, but they don’t stop it from being unsafe.

So what does this unsafety look like in practice, and why isn’t it necessarily a bad thing?

Let’s consider a very common eventually consistent architectural practice - Command Query Responsibility Segregation (CQRS). A CQRS implementation is usually eventually consistent, and offers further guarantees such as that the read side will always return some version of the data that matches some version in the write sides history, and that the convergent version that will eventually be returned will always be the most recent version of the write side.

So, what is not safe about it? Let’s say you have a blog that supports full text search. The search is implemented using a read side view of the data that puts all the posts into a natural text index. Since this is using CQRS, the index is updated asynchronously, and it may take some time for the new post to reach the read side. This means, immediately after publishing a new post, let’s say it’s a post about eventual consistency, if you search for “eventual consistency”, the post won’t come back in the search results. This is in spite of the fact that you can browse to the post, load it, and see the words eventual consistency in the text of the post. The system is in an inconsistent state, when we load the post it has the words “eventual consistency”, when we ask the search index for all the posts with the words “eventual consistency” in them, the index doesn’t return that post. This is the unsafety, it is not safe to rely on every part of the system returning the same answer to whether our post contains the words “eventual consistency”.

So, now we’ve seen an example of the unsafety of eventual consistency, why is this not bad? The answer is simply, there is no requirement for the text search to be updated immediately after the blog post is published. There is a requirement that it be eventually updated, and in a somewhat timely manner, and CQRS allows us to meet that requirement. But no one cares if for a few seconds or even minutes after publishing a blog post, that the text search on their blog doesn’t return the new post. So the unsafety of eventual consistency is not bad because we’re not relying on it to be safe.

How does this play out in a broader context? It’s easy to see how safety is not a requirement for a search index, but that’s a very specific scenario, is this one of the few places where eventual consistency can be used? The answer to that is necessarily no, if you’re writing a sufficiently large system, with even modest scaling and availability requirements by todays standards, you will probably find yourself having to use eventual consistency in order to scale your system and maintain availability. The upshot is that you must embrace the unsafeness property, and not put strong consistency requirements on the parts of your system that are eventually consistent. You must design your system to handle the unsafeness.

There are many ways to do this - for example, by using asynchronous message passing, messages can be queued to be handled when the system is consistent and ready to handle them, in contrast to synchronous RPC calls, which require the system to be consistent before the calls can be made. Another technique is to just store events, or facts, things that happened, and don’t calculate the state until later, when the system has converged, for example, when the state is queried. And yet another is to assume consistency, but then have a process in place to detect when that assumption was not true, and trigger some recovery operation.

In all these cases, the unsafeness of eventual consistency isn’t a bad thing, it’s just a property that has been carefully considered and addressed in the design of the system. It is not safe to rely on the consistency of the system, and that’s ok.

The Noop Monad - doing nothing safely

If you’re a fan of functional programming, as I am, you’ll know that one of the great things about it is how useful it is. But that isn’t the only great thing about functional programming, functional programming is also great for when you want to do nothing at all. Some might even say that doing nothing at all is where functional programming really shines.

So today I’m going to introduce a monad that surprisingly isn’t talked about a lot - the noop monad. The noop monad does nothing at all, but unlike noops in other programming paradigms, the noop monad does nothing safely.

§A demo

For this demonstration, I’m going to use Scala, with Scalaz to implement the monad. Let’s start off with the Noop type:

/**
 * A noop of type T
 */
sealed trait Noop[T] {

  /**
   * Run the noop
   */
  def run: Unit
}

As you can see, the Noop type has a type parameter, so we can do nothing of various types. We can also see the run function, and it returns Unit. Now typically in functional programming, returning Unit is considered a bad thing, because Unit is not a value, so any pure function that returns Unit must have done nothing. But since Noop actually does do nothing, this is the one exception to that rule. So the run function can be evaluated to do the nothing of the type that this particular Noop does.

Now, let’s say I have method that calculates all the primes up to a given number. Here’s its signature:

def calculatePrimes(upTo: Int): List[Int]

And let’s say I want to get a list of all the Int primes, I can use the above method like so:

calculatePrimes(Int.MaxValue)

But wait, you say! That code is going to be very expensive to run, it’s likely to take a very, very long time, and you have better things to do. So, you want to ensure that the code doesn’t run. This is where the noop monad comes on the scene, using the point method, you can ensure that it safely doesn’t run:

val noopAllIntPrimes = calculatePrimes(Int.MaxValue).point[Noop]

And then, when you actually don’t want to run it, you can do that by evaluating the run function:

noopAllIntPrimes.run

For those unfamiliar with scalaz and functional programming, a monad is an applicative, and an applicative is something that lets you create an instance of the applicative from a value. The method on Applicative for doing this is called point, in other languages it’s also called pure.

So, we can see that Noop is an applicative, but can we flatMap it? What if you don’t want to sum all those prime numbers, and then you certainly don’t want to convert that result to a String? The noop monad lets you do that:

val summedPrimesString = for {
  primes <- noopAllIntPrimes
  summed <- primes.reduce(_ + _).point[Noop]
  asString <- summed.toString.point[Noop]
} yield asString

And so then to ensure that we don’t actually do all this expensive computation, we can run it as before:

summedPrimesString.run

§Advantages

We can see how the noop monad can be used to do nothing, but what are the advantages of using the noop monad compared to some other methods of doing nothing? I’m going to highlight three advantages that I think really demonstrate the value of doing nothing in a monadic way.

§Runtime optimisation

This is often an advantage of functional programming in general, but the noop monad is the exemplar of optimization in functional programming. Let’s have a look at the implementation of the noop monads point method:

def point[A](a: => A): Noop[A] = Noop[A]

Here we can see that not only is the passed in value not evaluated, it’s not even referenced in the returned Noop. But how can the noop monad do this? Since the noop monad knows that you don’t want to do anything at all, it is able to infer that therefore it will not need to evaluate the value, and therefore it doesn’t need to hold a reference to the passed in value. But this advanced optimisation doesn’t stop there, let’s have a look at the implementation of bind:

def bind[A, B](fa: Noop[A])(f: A => Noop[B]): Noop[B] = Noop[B]

Here we can see a double optimisation. First, the passed in Noop is not referenced. The noop monad can do this because it infers that since you don’t want to do anything, you don’t need the nothing that you passed in. Secondly, the passed in bind function is never evaluated. As with the other parameter, the noop monad can infer that since the passed in Noop does nothing, there will be nothing to pass to the passed in function, and therefore, the function will never be evaluated.

As you can see, particularly for performance minded developers, the noop monad is incredibly powerful in its ability to optimise your code at runtime to do as little of nothing as possible.

§Code optimisation

But performance isn’t the only place that the noop monad can help with optimisation, the noop monad can also help at optimising your code to ensure it is as simple and concise as possible.

Let’s take our previous example of summing primes:

(for {
  primes <- calculatePrimes(Int.MaxValue).point[Noop]
  summed <- primes.reduce(_ + _).point[Noop]
  asString <- summed.toString.point[Noop]
} yield asString).run

Now, this isn’t bad looking code, but it does feel a little too complex when all we wanted to do in the first place was nothing. So how can we simplify it? Well firstly, you’ll notice that we don’t want to convert the summed result to a string, you can tell this by the .point[Noop] after it. Based on the rules of the noop monad, we can optimise our code to remove this:

(for {
  primes <- calculatePrimes(Int.MaxValue).point[Noop]
  summed <- primes.reduce(_ + _).point[Noop]
} yield summed).run

Is this safe to do? In fact it is, because we have actually replaced our intention of doing nothing, with nothing. We can do the same for summing all the primes:

(for {
  primes <- calculatePrimes(Int.MaxValue).point[Noop]
} yield primes).run

Now the final step in code optimisation, and this is the hardest to follow so bear with me, we can actually remove the not calculating the primes itself, and simultaneously remove the run function on that Noop. But how is this so? You may remember that I explained earlier if a pure function returns Unit, then it must do nothing. Our Noop.run is a pure function, and it does nothing. So since evaluating run does nothing, we can safely replace it with nothing. Finding it hard to follow? This is what it looks like in code:



As you can see, we’ve gone from five reasonably complex lines of code, to absolutely no code at all! This is the embodiment of what Dijkstra meant when he said:

If we wish to count lines of code, we should not regard them as “lines produced” but as “lines spent”.

The noop monad has allowed us to spend zero lines of code in doing nothing.

§Teaching monads

Teaching monads has proven to be the unicorn of evangelising functional programming, no matter how hard anyone tries, no one seems to be able to teach them to a newcomer. The noop monad solves this by grounding monads in a context that all students can relate to - doing nothing.

In particular, the noop monad does a great job for picking up the pieces of a failed attempt to teach a student monads. For example, consider the following situations:

  • A student has been told that monads are just monoids in the category of endofunctors. What does that even mean? But if I say the noop monoid in the category of endofunctors is just something that does nothing, simple!
  • A student has been told that monads are burritos. What does that even mean? But if I say the noop burrito is just something that does nothing, simple!

§Conclusion

So today I’ve introduced you to the noop monad. As you can see, it’s in the noop monad that functional programming is made complete, fullfilling everything that every functional programmer has ever wanted to do, that is, nothing at all.