all that jazz

james' blog about scala and all that jazz

Semantic discoverability security

A well known security issue in HTTP is that if you return 404 Not Found for a resource that doesn’t exist, but return 401 Unauthorized or 403 Forbidden for a resource that you’re not allowed to access, then you might be giving away information that an attacker could use, that is, that the resource exists. While in some cases that’s no big deal, in other cases it’s a security problem.

The usual solution to this is when you’re not permitted to access something, the server should respond as if it doesn’t exist, by returning 404 Not Found. Thus, you receive the same response whether the resource exists or not, and are not able to determine whether it does exist. I’d like to argue that this is the wrong approach, it is not semantic.

Just before we go on, I just want to clarify, while returning 401 Unauthorized may be a semantic way to tell a client that you need to authenticate before you can continue, on the web, it’s not a practical way to tell the client that, since the HTTP spec requires that it be paired with some user agent aware authentication method, such as HTTP BASIC authentication. But typcially, sites get people to log in using forms and cookies, and so the practical response to tell a web user that they need to authenticate before they can continue is to send a redirect to the login page. For the remainder of this blog post, when I refer to the 401 status code, I am meaning informing the user that they need to authenticate, which could be done by actually sending a redirect.

So, let’s think about what we’re trying to achieve here when we say we want to protect discoverability of resources. What we’re saying is that we want to prevent users who don’t have discoverability permission from finding out if if something does or doesn’t exist. The 404 Not Found status code says that a resource doesn’t exist. So if I’m not allowed to know if a resource doesn’t exist, but the server sends me a 404 Not Found, that’s a contradiction, isn’t it? It’s certainly not semantic. Of course, it’s not a security issue because the server will also send me a 404 Not Found if the resource does exist, but that’s not semantic either, the server is in fact lying to me then.

Sending a 404 Not Found in every case is not the only solution, there’s another solution where the server can say what it really means, ie be semantic, while still protecting discoverability. If a resource doesn’t exist, but I’m not allowed to find out whether a resource does or doesn’t exist, then the semantic response is not to tell me it doesn’t exist, it’s to tell me that I’m not allowed to find out if it exists or not. This would mean, if I’m not authenticated, sending a 401 Unauthorized, or if I am authenticated but am still not allowed to find out, to send a 403 Forbidden. The server has told me the truth, you’re not allowed to know anything about this resource that may or may not exist. And, if it does exist, and I’m not allowed to do it, the server will do the same, sending a 401 or 403 response. In either case, whether the resource exists or not, the response code will be the same, and so can’t be exploited to discover resources.

In this way, we have implemented discoverability protection, but we have done so semantically. We haven’t lied to the user, we haven’t told them that something doesn’t exist that actually does, we have simply told them that they’re not allowed to find out if it does or doesn’t exist. And we haven’t seemingly violated our own contract by telling a user something is not found when they’re not allowed to know if it’s not found or not.

In practice, using this approach also provides a much better user experience. If I am not logged in, but I click a link somewhere to the resource, it would be much better for me to be redirected to the user login screen so that I can login and be redirected back to the resource, than for me to be just told that the resource doesn’t exist. If the resource doesn’t actually exist, then after logging in, I can be redirected back to the resource where I’ll get a 404 Not Found, this is semantic and makes sense, it’s not until I log in that I can actually get a Not Found out of the server, that’s what discoverability means. This is exactly my argument in a Jenkins issue, where Jenkins is currently returning a 404 Not Found screen (with no option to click log in) for builds that I don’t have permission to access until I log in.

Open Source - it's not about free as in beer or speech

I’ve been thinking a bit about open source software, and why I’m so attracted to it. Often when people talk about the thing that makes open source so great, they use the phrase “it’s free as in speech”. Of course it’s also free as in beer, that is, it’s free as in you don’t have to pay for it, and that certainly is an advantage of open source software, but the point that they’re making, in contrast to free as in beer, is that the real value of it is that you are free to do with the software as you choose.

But when I think about what attracts me to open source software, like free as in beer, free as in speech is only an advantage, a nice to have, it’s not the primary attraction.

So firstly, why isn’t free as in beer the primary attraction to me? The answer is it’s about risk. Dijkstra famously stated that we should count lines of code as “lines spent” rather than “lines produced”. Every line of code in your system is a line of code that you must understand and maintain for the life of that system. The more lines of code, the more complexity, the harder the system is to understand, the more expensive it is to maintain.

This doesn’t just apply to the code that you write - it applies to the third party libraries that you’re using. Every library that you use is another layer of code, another part of the system that you have to maintain, and so counts towards the number of lines spent. It may be a commercial library that you’re paying someone else to maintain - but it’s still your responsibility to ensure that it is maintained since it’s your system that the library is running in. When you pay someone to maintain a library for you, you are only delegating that responibility, not transferring it.

So, let’s say there was a library that I wanted to use, and it was free, as in beer, but there was no source code available for it. I could take the binaries, call them my own, use them in my own system, they are free, but I could never get the source code, I had to rely on the maintainer. Would I use such a library? Never. The risk is simply too high, since that library is part of my system, it’s my responsibility to maintain it. But I can’t maintain it myself, since I don’t have the source code. And because I’m not paying the maintainer for it, I can’t delegate the responsibility to them, because they have no obligation to me to fulfil that responsibility. Hence, I cannot use the library since I cannot be held responsibible for it. So, while free as in beer is great, it is not the primary advantage of open source, there are more important things than free as in beer.

So what about free as in speech? An open source library allows me the freedom to maintain and modify it myself, as a last resort I can always fork it. This is an advantage since it means I can always assume the maintenance of that library myself, and that means I can take responsibility for it. But is that the primary advantage? Let’s consider you had a choice of two libraries. One of them was open source, let’s say ASF licensed, but it had no active community around it, the maintainer didn’t accept patches, and only worked on things that he or she was interested in. The other library was commercial, you had to pay for it, and you were restricted in how you could use it, however once you paid for it, you were given full access to source code, and access to an active community of developers, not only within the company that produced the library but also developers from other companies that were using the library and building on it. Which would you go with?

Now I don’t expect everyone to answer the same here, there’s no one right answer, but my answer is the commercial one. Why? Because I think the biggest value in open source is the community. And so an open source project with no community and no opportunity for community involvement is missing the biggest value of open source, which means a non open source project that has commmunity and active community involvement will trump it.

Why is a community so valuable? As I said before, when you decide to use a library in your system, you assume responsibility for maintaining that library, regardless of whether it’s open source or not. When that library comes with an active community with the like minded goal of improving that library, then when you decide to take responsibility for it, you are immediately joined by a community of people who will help you to do that. Ask any open source maintainer, a community member who sees themselves as part of the solution to any problems they have with an open source library is a peer. They are someone who the maintainer wants to work with, and so when you take that responsibility, you get a whole community of people who are wanting to work with you to solve your problems. This is the great thing about open source, and in my opinion, it’s far greater than the freedom to use the library in any way you want.

Now perhaps you might be thinking that the example of a commercial library with an active developer community is contrived - that would never happen, it’s only because of the freedom that open source affords that such communities exist. Well, that’s simply not true, I have worked in such a community before. Atlassian is a company that sells commercial software with commercial licenses, you are not free to do with Atlassian’s software as you please, you may only use it as the license stipulates. But when you do purchase a license, you get full access to their source code, and there is a vibrant plugin development community that you can join, one that is fostered by Atlassian, where knowledge is shared by both outside and Atlassian community members, and people work together to make the entire platform better. This meets all my of requirements of a developer community, and addresses the risk of including an Atlassian product as part of my system.

So if an Atlassian product was the right fit to be part of a system that I was maintaining, I would not hesitate to choose it. The fact that it isn’t free as in speech or beer wouldn’t phase me in the slightlest, I would choose it for the same primary reason that I choose most open source software - that is that I get a whole community of like minded developers working to help me take the responsibility of maintaining my system. And that’s how I know that the primary value of open source to me isn’t that it’s free as in speech, it’s the community that comes with it.

Eventual Consistency is not safe and that's ok

A number of years ago, Peter Bailis wrote an excellent post titled Safety and Liveness: Eventual Consistency Is Not Safe. This post is short and to the point, and is a very salient reality check for anyone that is using eventual consistency. With the prevelance of eventually consistent architectures and systems today, I recommend that all developers read it, in spite of the posts age, it is just as important today as it was when it was written.

A member of our marketing department recently came across the post, and sent an email to our engineers because he was somewhat confused about it. At Lightbend, we are advocates of eventual consistency, and much of our tech embraces it. However he was alarmed by the statement that eventual consistency is not safe. This statement led him to the conclusion that the post was saying that eventual consistency must be bad.

I don’t think this conclusion is an unreasonable conclusion for someone not familiar with eventual consistency to make, not just a marketing person, but for a developer who is completely new to eventual consistency. And so I think the statement needs some further explanation.

So to start with, an analogy. The sun is unsafe. If you were to somehow “stand on” the sun, before you could begin to comprehend your environment, you would cease to exist in any recognisable form. This property of the sun, that standing on it and existing are incompatible, is not a bad thing, in fact it’s not a good thing either, it’s neither bad nor good, it’s just a property. It doesn’t become a good or bad thing until you try to rely on the property or its absence. It simply means, don’t stand on the sun.

Likewise, the unsafeness of eventual consistency is not a bad thing, nor is it a good thing, it’s just a property.

Now the bulk of Peter’s post was about how eventual consistency alone is not useful, it must be combined with other guarantees, such as what versions of the data can be returned when it’s not consistent, and what version of the data will eventually be returned. These guarantees help to make an eventually consistent system more safe, but they don’t stop it from being unsafe.

So what does this unsafety look like in practice, and why isn’t it necessarily a bad thing?

Let’s consider a very common eventually consistent architectural practice - Command Query Responsibility Segregation (CQRS). A CQRS implementation is usually eventually consistent, and offers further guarantees such as that the read side will always return some version of the data that matches some version in the write sides history, and that the convergent version that will eventually be returned will always be the most recent version of the write side.

So, what is not safe about it? Let’s say you have a blog that supports full text search. The search is implemented using a read side view of the data that puts all the posts into a natural text index. Since this is using CQRS, the index is updated asynchronously, and it may take some time for the new post to reach the read side. This means, immediately after publishing a new post, let’s say it’s a post about eventual consistency, if you search for “eventual consistency”, the post won’t come back in the search results. This is in spite of the fact that you can browse to the post, load it, and see the words eventual consistency in the text of the post. The system is in an inconsistent state, when we load the post it has the words “eventual consistency”, when we ask the search index for all the posts with the words “eventual consistency” in them, the index doesn’t return that post. This is the unsafety, it is not safe to rely on every part of the system returning the same answer to whether our post contains the words “eventual consistency”.

So, now we’ve seen an example of the unsafety of eventual consistency, why is this not bad? The answer is simply, there is no requirement for the text search to be updated immediately after the blog post is published. There is a requirement that it be eventually updated, and in a somewhat timely manner, and CQRS allows us to meet that requirement. But no one cares if for a few seconds or even minutes after publishing a blog post, that the text search on their blog doesn’t return the new post. So the unsafety of eventual consistency is not bad because we’re not relying on it to be safe.

How does this play out in a broader context? It’s easy to see how safety is not a requirement for a search index, but that’s a very specific scenario, is this one of the few places where eventual consistency can be used? The answer to that is necessarily no, if you’re writing a sufficiently large system, with even modest scaling and availability requirements by todays standards, you will probably find yourself having to use eventual consistency in order to scale your system and maintain availability. The upshot is that you must embrace the unsafeness property, and not put strong consistency requirements on the parts of your system that are eventually consistent. You must design your system to handle the unsafeness.

There are many ways to do this - for example, by using asynchronous message passing, messages can be queued to be handled when the system is consistent and ready to handle them, in contrast to synchronous RPC calls, which require the system to be consistent before the calls can be made. Another technique is to just store events, or facts, things that happened, and don’t calculate the state until later, when the system has converged, for example, when the state is queried. And yet another is to assume consistency, but then have a process in place to detect when that assumption was not true, and trigger some recovery operation.

In all these cases, the unsafeness of eventual consistency isn’t a bad thing, it’s just a property that has been carefully considered and addressed in the design of the system. It is not safe to rely on the consistency of the system, and that’s ok.

The Noop Monad - doing nothing safely

If you’re a fan of functional programming, as I am, you’ll know that one of the great things about it is how useful it is. But that isn’t the only great thing about functional programming, functional programming is also great for when you want to do nothing at all. Some might even say that doing nothing at all is where functional programming really shines.

So today I’m going to introduce a monad that surprisingly isn’t talked about a lot - the noop monad. The noop monad does nothing at all, but unlike noops in other programming paradigms, the noop monad does nothing safely.

§A demo

For this demonstration, I’m going to use Scala, with Scalaz to implement the monad. Let’s start off with the Noop type:

/**
 * A noop of type T
 */
sealed trait Noop[T] {

  /**
   * Run the noop
   */
  def run: Unit
}

As you can see, the Noop type has a type parameter, so we can do nothing of various types. We can also see the run function, and it returns Unit. Now typically in functional programming, returning Unit is considered a bad thing, because Unit is not a value, so any pure function that returns Unit must have done nothing. But since Noop actually does do nothing, this is the one exception to that rule. So the run function can be evaluated to do the nothing of the type that this particular Noop does.

Now, let’s say I have method that calculates all the primes up to a given number. Here’s its signature:

def calculatePrimes(upTo: Int): List[Int]

And let’s say I want to get a list of all the Int primes, I can use the above method like so:

calculatePrimes(Int.MaxValue)

But wait, you say! That code is going to be very expensive to run, it’s likely to take a very, very long time, and you have better things to do. So, you want to ensure that the code doesn’t run. This is where the noop monad comes on the scene, using the point method, you can ensure that it safely doesn’t run:

val noopAllIntPrimes = calculatePrimes(Int.MaxValue).point[Noop]

And then, when you actually don’t want to run it, you can do that by evaluating the run function:

noopAllIntPrimes.run

For those unfamiliar with scalaz and functional programming, a monad is an applicative, and an applicative is something that lets you create an instance of the applicative from a value. The method on Applicative for doing this is called point, in other languages it’s also called pure.

So, we can see that Noop is an applicative, but can we flatMap it? What if you don’t want to sum all those prime numbers, and then you certainly don’t want to convert that result to a String? The noop monad lets you do that:

val summedPrimesString = for {
  primes <- noopAllIntPrimes
  summed <- primes.reduce(_ + _).point[Noop]
  asString <- summed.toString.point[Noop]
} yield asString

And so then to ensure that we don’t actually do all this expensive computation, we can run it as before:

summedPrimesString.run

§Advantages

We can see how the noop monad can be used to do nothing, but what are the advantages of using the noop monad compared to some other methods of doing nothing? I’m going to highlight three advantages that I think really demonstrate the value of doing nothing in a monadic way.

§Runtime optimisation

This is often an advantage of functional programming in general, but the noop monad is the exemplar of optimization in functional programming. Let’s have a look at the implementation of the noop monads point method:

def point[A](a: => A): Noop[A] = Noop[A]

Here we can see that not only is the passed in value not evaluated, it’s not even referenced in the returned Noop. But how can the noop monad do this? Since the noop monad knows that you don’t want to do anything at all, it is able to infer that therefore it will not need to evaluate the value, and therefore it doesn’t need to hold a reference to the passed in value. But this advanced optimisation doesn’t stop there, let’s have a look at the implementation of bind:

def bind[A, B](fa: Noop[A])(f: A => Noop[B]): Noop[B] = Noop[B]

Here we can see a double optimisation. First, the passed in Noop is not referenced. The noop monad can do this because it infers that since you don’t want to do anything, you don’t need the nothing that you passed in. Secondly, the passed in bind function is never evaluated. As with the other parameter, the noop monad can infer that since the passed in Noop does nothing, there will be nothing to pass to the passed in function, and therefore, the function will never be evaluated.

As you can see, particularly for performance minded developers, the noop monad is incredibly powerful in its ability to optimise your code at runtime to do as little of nothing as possible.

§Code optimisation

But performance isn’t the only place that the noop monad can help with optimisation, the noop monad can also help at optimising your code to ensure it is as simple and concise as possible.

Let’s take our previous example of summing primes:

(for {
  primes <- calculatePrimes(Int.MaxValue).point[Noop]
  summed <- primes.reduce(_ + _).point[Noop]
  asString <- summed.toString.point[Noop]
} yield asString).run

Now, this isn’t bad looking code, but it does feel a little too complex when all we wanted to do in the first place was nothing. So how can we simplify it? Well firstly, you’ll notice that we don’t want to convert the summed result to a string, you can tell this by the .point[Noop] after it. Based on the rules of the noop monad, we can optimise our code to remove this:

(for {
  primes <- calculatePrimes(Int.MaxValue).point[Noop]
  summed <- primes.reduce(_ + _).point[Noop]
} yield summed).run

Is this safe to do? In fact it is, because we have actually replaced our intention of doing nothing, with nothing. We can do the same for summing all the primes:

(for {
  primes <- calculatePrimes(Int.MaxValue).point[Noop]
} yield primes).run

Now the final step in code optimisation, and this is the hardest to follow so bear with me, we can actually remove the not calculating the primes itself, and simultaneously remove the run function on that Noop. But how is this so? You may remember that I explained earlier if a pure function returns Unit, then it must do nothing. Our Noop.run is a pure function, and it does nothing. So since evaluating run does nothing, we can safely replace it with nothing. Finding it hard to follow? This is what it looks like in code:



As you can see, we’ve gone from five reasonably complex lines of code, to absolutely no code at all! This is the embodiment of what Dijkstra meant when he said:

If we wish to count lines of code, we should not regard them as “lines produced” but as “lines spent”.

The noop monad has allowed us to spend zero lines of code in doing nothing.

§Teaching monads

Teaching monads has proven to be the unicorn of evangelising functional programming, no matter how hard anyone tries, no one seems to be able to teach them to a newcomer. The noop monad solves this by grounding monads in a context that all students can relate to - doing nothing.

In particular, the noop monad does a great job for picking up the pieces of a failed attempt to teach a student monads. For example, consider the following situations:

  • A student has been told that monads are just monoids in the category of endofunctors. What does that even mean? But if I say the noop monoid in the category of endofunctors is just something that does nothing, simple!
  • A student has been told that monads are burritos. What does that even mean? But if I say the noop burrito is just something that does nothing, simple!

§Conclusion

So today I’ve introduced you to the noop monad. As you can see, it’s in the noop monad that functional programming is made complete, fullfilling everything that every functional programmer has ever wanted to do, that is, nothing at all.

sbt - A declarative DSL

This is my second post in a series of posts about sbt. The first post, sbt - A task engine, looked at sbt’s task engine, how it is self documenting, making tasks discoverable, and how scopes allow heirarchical fallbacks. In this post we’ll take a look at how the sbt task engine is declared, again taking a top down approach, rooted in practical examples.

§Settings are not settings

In the previous post, we were introduced to settings, being a specialisation of tasks that are only executed once at the start of an sbt session. File that bit of knowledge away, and now reset your definition of setting to nothing. I’m guessing this is a relic of past versions of sbt, but the word setting in sbt can mean two distinct things, one is a task that’s executed at the start of the session, the other is a task (or setting, as in executed at start of session) declaration. No clearer?

Let’s introduce another bit of terminology, a task key. A task key is a string and a type, the string is the name of the task, it’s how you reference it on the command line. The type is the type that the task produces. So the sources task has a name of "sources", and a type of Seq[File]. It is defined in sbt.Keys:

val sources = TaskKey[Seq[File]]("sources", "All sources, both managed and unmanaged.", BTask)

You can see it also has a description and a rank, those are not really important to us now. The thing that uniquely defines this task key is the sources string. You could define another sources key elsewhere, as long as they have the same name, they will be considered the key for the same task. Of course, if you define two tasks using the same key name, but different key types, that’s not allowed, and sbt will give you an error.

In addition to TaskKey’s there are also SettingKey’s, this is setting as in only executed once per session. Now these keys by themselves do nothing. They only do something when you declare some behaviour for them. So, a setting as in a task declaration is a task or setting key, potentially scoped, with some associated behaviour. For the remainder of this post, when I say setting, that’s what I’m referring to.

§Settings are executed like a program

Defining sbt’s task engine is done by giving sbt a series of settings, each setting declaring a task implementation. sbt then executes those settings in order. Tasks can be declared multiple times by multiple settings, the last one to execute wins.

So where do these settings come from? They come from many places. Most obviously, they come from the build file. But most settings don’t come from there - as you’re aware, sbt defines many, many fine grained tasks, but you don’t have to declare settings for these yourself. Most of the settings come from sbt plugins. One in particular that is enabled by default is the JvmPlugin. This contains all the settings necessary for building a Java or Scala program, including settings that declare the sources task that we saw yesterday. Plugin settings are executed before the settings in your build file, this means that any settings you declare in your build file will override the settings declared by the plugins.

This ordering of settings is important to note, it means settings have to be read from top to bottom. I have handled a number of support cases and mailing list questions where people haven’t realised this, they have declared a setting, and then after that included a sequence of settings from elsewhere in their build that redeclares the setting. They expected their setting to take precedence, but since their setting came before the setting from the sequence, the setting from the sequence overwrites it.

§sbt build file syntax

We’re about to get into some concrete examples of declaring settings, so before we do that we better cover the basics of the sbt build file. sbt builds can either be specified in plain *.scala files, or in sbts own *.sbt file format. As of sbt 0.13.7, the sbt format has become powerful enough that there is really not much that you can’t do with it, so we’re only going to look at that.

An sbt file may have any name, convention is that a projects main build file be called build.sbt, but that is only a convention. The file may contain a series of Scala statements and expressions, and it’s important here to distinguish between statements and expressions. What’s the difference? A statement doesn’t return a value. For example:

val foo = "bar"

This is a statement, it has assigned the val foo to "bar", but this assignment doesn’t return a value. In contrast:

5 + 4

This is an expression, it returns a value of type Int.

Expressions in an sbt file must have a type of either Setting[_] or Seq[Setting[_]]. sbt will evaluate all these exrpessions, and add them to the settings for your build. Any expression in your sbt file that isn’t of one of those types will generate an error.

Statements can be anything. They can be utility methods, vals, lazy vals, whatever. In most cases, sbt ignores them, but that doesn’t make them useless, you can use them in other expressions or statements, to help you define your build. There is one type of statement though that sbt doesn’t ignore, and that is statements that assign a val to a project, this is how projects are defined:

lazy val sbtFunProject = project in file(".")

The final thing to know about sbt build files is that sbt automatically brings in a number of imports. For example, sbt._ is imported, as is sbt.Keys._, so you have access to the full range of built in task keys that sbt defines without having to import them yourself. sbt also brings in imports declared by plugins, making it straight forward to use those plugins.

§Declaring a setting

The process of declaring a setting is done by taking a task key, optionally apply a scope to it, and then declaring an implementation for that task. Here’s a very basic example:

name := "sbt-fun"

In this case we’re declaring the implementation of the name task to simply be a static value of "sbt-fun". Note that the above is a expression, not a statement. := is not a Scala language feature, it is actually a method that sbt has provided. sbt’s syntax for declaring settings is a pure scala DSL. If this syntax confuses you, then I strongly recommend that you read a post I wrote a few years ago called Three things I had to learn about Scala before it made sense. This post explains how DSL’s are implemented in Scala, and is essential reading before you read on in this post if you don’t understand that already.

What if we want to declare our own implementation of the sources task? Remembering that we want it scoped to compile, we can do this:

sources in Compile := Seq(file("src/main/scala/MySource.scala"))

Again we’re only setting a static value for this task to return, but this time you can see how we’ve scoped the sources task in the compile scope. Note that configurations such as compile and test are available through capitalised vals, in scope in your build.

§Back to first principles

What if we want to declare a dependency on another task? Let’s say we want to declare sources to be, as it’s described, all managed and unmanaged sources. If you’ve used sbt before, you probably know that you can use this syntax:

sources := managedSources.value ++ unmanagedSources.value

This was introduced is sbt 0.13, and it’s actually implemented by a macro that does some magic for you. It’s great, I use that syntax all the time, and so should you. However, as with anything that does magic for you, if you don’t understand what it’s doing for you and how it does it, you can run into troubles.

As I described in the last post, sbt is a task engine, and tasks declare dependencies that are executed before, and provided as input, to them. In the above example, it doesn’t look like this is happening at all, what it looks like is that when the sources task is executed, it executes the managedSources task by calling value, and the unmanagedSources task by calling value, and then concatenates their results together. There is a macro that is transforming this code to something that does declare dependencies, and takes the inputs of those dependencies and passes them to the implementation.

So in order to understand what the macro is doing for us, let’s implement this ourselves manually - let’s declarce this setting from first principles.

Firstly, we’re going to use the <<= operator instead, this is how to say that I am declaring this task to be dependent on other tasks. Now, we could do a very straight forward declaration to another task:

sources <<= unmanagedSources

This will say that the sources task has a dependency on unmanagadSources, and will take the output of unmanagedSources as is, and return it as the output of sources. What if we wanted to change that value before returning it? We can do that using the map method:

sources <<= unmanagedSources.map(files => files.filterNot(_.getName.startsWith("_")))

So now we’ve filtered out all the files that start with _ (note that sbt already provides an excludesFilter task that can be used to configure this, this is just an example).

At this point let’s take a step back and think about what the code above has done. For one, nothing has yet been executed, at least not the task implementation. That <<= method returns an object of type Setting. This setting has the following attributes:

  • The key (potentially scoped) that it is the task declaration for, in this case, sources.
  • The keys (potentially scoped) of tasks that it depends on, in this case, unmanagedSources.
  • A function that takes the output of the tasks that it depends as input parameters, executes the task, and returns the output of the task being declared (that is, the function we passed to the map method, that filters out all files that start with _).

You can see here that we haven’t actually executed anything in the task, we have only declared how the task is implemented. So when sbt goes to execute the sources task, it will find this declaration, execute the dependencies, and then execute the callback. This is why I’ve called this blog post “sbt - A declarative DSL”. All our settings just declare how tasks are implemented, they don’t actually execute anything.

So, what if we want to depend on two different tasks? Through the magic of the sbt DSL, we can put them in a tuple, and then map the tuple:

sources <<= (unmanagedSources, managedSources).map { (unmanaged, managed) => unmanaged ++ managed) }

And now we actually have our first principles implementation of the sources task. Sort of, we haven’t scoped it to compile, but that’s not hard to do:

sources in Compile <<= (unmanagedSources in Compile, managedSources in Compile).map(_ ++ _)

For brevity I’ve used a shorter syntax for concatenating the two sources sequences.

§sbt uses macros heavily

So now that we’ve seen how to declare tasks from from first principles, let’s see how the macros work. We have our declaration from before:

sources := { managedSources.value ++ unmanagedSources.value }

I’ve inserted the curly braces to make it clear what is being passed to the := macro. The := macro will go through the block of code passed to it, and find all the instances of where value is called, and gather all the keys that it is invoked on. It will then generate Scala code (or rather AST) that builds those keys as a tuple, and then invokes map. To the map call, it will pass the original code block, but replacing all the keys that had value on them with parameters that are taken as the input arguments to the function passed to map. Essentially, it builds exactly the same code that we implemented in our first principles implementation.

Now, it’s important to understand how these macros work, because when you try to use the value call outside of the context of a macro, you will obviously run into problems. An important thing to realise is that the code generated by the macro never actually invokes value, value is just a place holder method used to tell the macro to extract these keys out to be dependencies that get mapped. The value method itself is in fact a macro, one that if you invoke it outside of the context of another macro, will result in a compile time error, the exact error message being value can only be used within a task or setting macro, such as :=, +=, ++=, Def.task, or Def.setting.. And you can see why, since sbt settings are entirely declarative, you can’t access the value of a task from the key, it doesn’t make sense to do that.

From now on in this post we’ll switch to using the macros, but remember what these macros compile to.

§Redeclaring tasks

So far we’ve seen how to completely overwrite a task. What if you don’t want to ovewrite the task, you just want to modify its output? sbt allows you to make a task depend on itself, if you do that, the task will depend on the old implementation of itself, giving the output of that implementation to you as your input. In the previous blog post, I brought up the possibility of only compiling source files with a certain annotation inside them, let’s say we’re only going to compile source files that contain the text "COMPILE ME". Here’s how you might implement that, depending on the existing sources implementation:

sources := {
  sources.value.filter { sourceFile =>
    IO.read(sourceFile).contains("COMPILE ME")
  }
}

sbt also provides a short hand for doing this, the ~= operator, which takes a function that takes the old value and returns the new value:

sources ~= _.filter { sourceFile =>
  IO.read(sourceFile).contains("COMPILE ME")
}

Another convenient shorthand for modifying the old value of a task that sbt provides, and that you have likely come across before, is the += and ++= operators. These take the old value, and append the item or sequence of items produced by your new implementation to it. So, to add a file to the sources:

sources += file("src/other/scala/Other.scala")

Or to add multiple files:

sources ++= Seq(
  file("src/other/scala/Other.scala"),
  file("src/other/scala/OtherOther.scala")
)

These of course can depend on other tasks through the value macro, just like when you use :=:

sources ++= Seq(
  (sourceDirectory.value / "other" / "scala").***
)

The *** method loads every file from a directory recursively.

§Scope me up

We’ve talked a little bit about scopes, but most of our examples so far have excluded them for brevity. So let’s take a look at how to scope settings and their dependencies.

To apply a scope to a setting, you can use the in method:

sources in Compile += file("src/other/scala/Other.scala")

Applying multiple scopes can be done by using multiple in calls, for example:

excludeFilter in sbtFunProject in unmanagedSources in Compile := "_*"

Or, they can also be done by passing multiple scopes to the in method, in the order project, configuration then task:

excludeFilter in (sbtFunProject, Compile, unmanagedSources) := "_*"

The same syntax can be used when depending on settings, though make sure you put parenthesis around the whole scoped setting in order to invoke the value method on it:

(sources in Compile) := 
  (managedSources in Compile).value ++ 
  (unmanagedSources in Compile).value

§Conclusion

In the first post in this series, we were introduced to the concepts behind sbt and its task engine, and how to explore and discover the task graphs that sbt provides. In this post we saw the practical side of how task dependencies and implementations are declared, using both the map method to map dependency keys, as well as macros. We saw how to modify existing task declarations, as well as how to use scopes.

One thing I’ve avoided here is showing cookbooks of how to do specific tasks, for example, how to add a source generator. The sbt documentation is really not bad for this, especially for cookbook type documentation, but I also hope that after reading these posts, you aren’t as dependent on copying and pasting sbt configuration, but rather can use the tools built in to sbt to discover the structure of your build, and modify it accordingly.

About

Hi! My name is James Roper, and I am a software developer with a particular interest in open source development and trying new things. I program in Scala, Java, Go, PHP, Python and Javascript, and I work for Lightbend as the architect of Kalix. I also have a full life outside the world of IT, enjoy playing a variety of musical instruments and sports, and currently I live in Canberra.