all that jazz

james' blog about scala and all that jazz

My complaint against Haskell

I have nothing against Haskell. I've been getting my teeth stuck into some Scala over the recent months, and I am loving functional programming, immutable types, and am even starting to see what's so good about monads. So Haskell, with its reputation of being a good pure functional language, is interesting to me. I did Haskell when I was in university. I didn't learn it, if I had learnt it, I would have understood it. I just did it. We learnt really useful things like doubling every integer in a list with a simple map function. Like curried and uncurried functions, like zipping, folding, reducing... all within the context of trivial contrived examples that did absolutely nothing to help us in the real world. In my 15 years of programming, I have only once ever needed to double every integer in a list, that was in my Haskell lab at uni. Very useful.

My problem is not with Haskell. My problem is with the people that teach and promote Haskell. They have never presented me with an example of how Haskell can help me in the real world. And today I'd like to highlight this, by looking at the first piece of code you'll encounter if you read the documentation on Haskell on the Haskell website. If you visit that website, and you want to know what Haskell is, you'll very quickly notice the link "What is Haskell". Clicking on it, you'll be taken to a page, and as you scroll down, you'll come across your first piece of Haskell code. It's quicksort:

quicksort :: Ord a => [a] -> [a]
quicksort []     = []
quicksort (p:xs) = (quicksort lesser) ++ [p] ++ (quicksort greater)
    where
        lesser  = filter (< p) xs
        greater = filter (>= p) xs

If you've ever implemented quicksort in a procedural language before, you'll notice that that code is very simple in comparison. It has some syntax you might not recognise, but it's not a soup of crazy characters like Perl, it looks ordered, and easy to read. It's impressive. But is it practical?

What is quicksort? The name implies that it is quick. So what's its running time? O(n2). Wait, hang on, that's not quick, there are faster algorithms out there, like heap sort, it runs in O(n log(n)) time. Well, it turns out, the average running time of quicksort is n log(n), but why would you use it over another algorithm that guarantees n log(n) running time? The reason is, the actual implementation of quicksort is really simple from a computation perspective, it does very little per element of the array, making its actual running time (on average) about 3 times faster than heap sort or merge sort. This makes quicksort very useful, if you want an algorithm that sorts an array as fast as possible, at the cost of in some circumstances not being very fast, then quicksort is perfect for you. This is why quicksort is called quicksort, it is quick. It may not mathematically have the best running time, but practically, it is quick.

This brings me to quicksort in Haskell. One of the reasons why quicksort, implemented in a procedural language, is so quick is that it does very close to the minimum necessary swap operations needed to sort the array. It sorts in place, it doesn't require building supplementary data structures, like two separate arrays in merge sort or the maximum heap in heap sort. It plays very nicely with hardware optimisations like CPU caches, maximising cache hits regardless of the size of the array and cache. However, the implementation above in Haskell does use additional structures. It partitions the list by creating two new lists on each recursion of the algorithm, requiring many more memory writes than the in place implementation. It uses linked lists, which play very poorly with CPU caches. It still runs in n log(n) on average, but it's now doing a lot more operations with very little help from hardware. This added overhead means that quicksort is no longer 3 times faster than merge or heap sort. Which means it now, on average, has similar run times to the O(n log(n)) algorithms, and still with the possibility of running in n2 for some inputs. It no longer is quick, and hence is not practical to use in the real world, you may as well use merge sort or heap sort.

To make matters worse, this implementation will always perform poorly on lists that are already k sorted, which is not an uncommon real world input to sort algorithms. A procedural implementation may also perform poorly on k sorted lists, but it's a simple change to make it perform well, just choose a random pivot, or a pivot from somewhere in the middle of the list. In Haskell, you can't do that without adding yet another overhead, because your lists are linked lists, and so accessing a random pivot takes n time.

Now, none of this means that Haskell is a bad language. Merge sort in Haskell is also very simple, and is also a very practical sorting algorithm in many cases, and so could have been used as an example instead. But why did the authors of this documentation choose a piece of code that is completely useless in the real world as the first taste of Haskell that many people will ever see? This is my complaint about Haskell, it's never been promoted to me in a way that bears any practical usefulness. I'm sure it's a great language. Pity about the people that promote it.

Hashbangs on the new Google Groups

Hash-bang URLs have been a contentious issue at my work over the past week, so much so that we completely got rid of them in the project that I'm working on. Coming out of the experience, I'm not convinced that hash-bangs are always a bad idea, however, there are some types of sites where they are very bad. Google groups I believe falls firmly into this category. And it's too late to do anything about it.

Google Groups Logo

The major issue with hash-bangs is not that it's a hack - many people would have called dynamically generating documents on each request a hack when websites first started doing that almost 20 years ago. It's not that it's ugly. It's not that some users don't have Javascript enabled (HTML5 mandates various Javascript APIs, if it doesn't do Javascript, it doesn't do HTML, Javascript is a core building block of the web - get over it). The big issue is maintainability of URLs.

For some sites, this is irrelevant. Take a site that is your own personal dashboard to managing your own private things - a webmail site for example. Nothing is going to link to it, so there is no need to maintain URLs. For other sites, it's an issue, but it's not so important. I know the folks at Twitter are considering it a big issue at the moment, but really, if a link breaks to someones 140 character word of wisdom... does it really matter?

But then there is a category of sites for which it is really important. These are the sites that maintain permanent records for future use. Mailing lists are one such example. I have an open source project, called the mongo jackson mapper. It has a small user base, and I provide support to people when they asked questions. About a month ago, I noticed that I was getting a number of emails from people asking for help. I was more than happy to answer these, however, I was concerned that if one person has a question, then other people are likely to have the same question, and I would be doing a lot of repeat work in answering people questions. So, I setup a Google group for questions.

Now, when someone asks the same question twice, I can simply link to the old answer, no need to rewrite the answer. Furthermore, people on other sites, such as stack overflow, can link to my answers. The URL to that answer will stand as a link in the chain to a record in time that will help many people help themselves in future. However, the usefulness is dependent on that link still working in future. Let's have a look at an example link:

https://groups.google.com/forum/?fromgroups#!topic/mongo-jackson-mapper/AeFF5Cqqolw

As you can see, the ID of the message only appears after the hash-bang. This means, it doesn't get sent to the server when you request it, the only way Google can know what message is being request is to serve up some Javascript that parses it out of the URL. And that's exactly what they do. But what happens, in future, when they decide to rewrite Google groups again? And again? And again... If the resource, https://groups.google.com/forum/?fromgroups, doesn't continue to serve up that Javascript, then all links to all posts that people have copied from their browser address bar and pasted elsewhere will break. The permanent record of networked help pages that people use to solve their every day problems will be broken.

In contrast, if they didn't use hash-bang URLs, then they could make whatever changes to their URL structure that they liked, and all they have to do to maintain the URLs is implement a simple redirect on the server side. No need to serve Javascript up at a particular resource.

Google groups does have a permalink feature, but really, who uses that? It feels unnatural for me even though I know and understand the difference between permalinks and what is in my address bar. When I want to copy a URL, without thinking, I copy it straight from the address bar, because this is what I always do for every other site.

So, Google groups, really? Hash-bangs? Are you not thinking of the future? Are you seriously committing to maintaining Javascript at arbitrary URLs for the lifetime of the internet? Unfortunately, the horse has bolted, these hash-bang links to Google groups are already starting to find themselves around the web, so you have no choice. And I'm guessing that someone in future will forget this, and and so you'll just break all those links.

ScalaFlect: Statically typed query DSLs

I've been hard at work over the past few weeks redesigning my Mongo object mapper, the Mongo Jackson Mapper, to support Jackson 2.0, be simpler to use, and be far more powerful with less surprises to users. Something that has been at the front of my mind is how to make it nicer to use in Scala, and the biggest thing that I'd like to do here is implement the query and update APIs using a DSL that makes the queries look like ordinary code. But there's a problem. If you want your classes to be ordinary case classes, then when querying, you're going to have to refer to the properties using strings. Strings that are very easy to make typos in, strings that don't get updated automatically when you refactor, strings that don't throw compile errors if they are wrong, and in the case of MongoDB queries, strings that don't even throw runtime errors if they are wrong. Scala has no support for referring to a field, rather than its value, so unfortunately, strings are the only option. That is, until now.

Today I started a new open source project called ScalaFlect. It provides exactly what every query DSL implementor has been wanting, statically typed reflection. It allows you to reference a field or a method, and the result being that you get the java.lang.reflect.Field/Method associated with it. Before I go into the details of how I did this, let's first look at an example DSL that I've written to demonstrate the capabilities so far.

Example DSL usage

So to start off, I have some case classes, a BlogPost, and a Comment, with a one to many relationship between blog posts and comments:

case class BlogPost(author: String, title: String, published: Boolean,
  tags: Set[String], comments: List[Comment])
case class Comment(author: String, text: String)

Now I have implemented my DSL by implementing an abstract class that provides the basic elements of my DSL. This class is called ObjectListDao, and as the name suggests, it stores objects in a list. But it could just as easily be storing objects in a database like MongoDB. Let's look at a DAO that uses this DSL:

class BlogPostDao extends ObjectListDao(classOf[BlogPost]) {
  import au.id.jazzy.scalaflect.ScalaFlect.toTraverser

  def findByAuthor(author: String) = {
    query(
      $(_.author) %= author
    )
  }

  def findPublishedByTag(tag: String) = {
    query(
      $(_.tags.$) %= tag,
      $(_.published) %= true
    )
  }

  def findPostsCommentedOnByAuthor(author: String) = {
    query(
      $(_.comments.$.author) %= author
    )
  }
}

So the first thing you notice is the static access to properties. The $ method accepts a function that accepts a parameter of type BlogPost, and the function should return the property that you want to reference. You can also see that using the $ operator on a collection, you can refer to properties in that collections type.

Example DSL definition

So how have I implemented my DSL? Quite simply. Here is the code:

class ObjectListDao[T](val clazz: Class[T]) {
  private val scalaFlect = new ScalaFlect(clazz)
  private val list = new ListBuffer[T]

  def $[R](f : T => R): QueryField[R] = new QueryField(scalaFlect.reflect(f))
  
  def query(exps: QueryExpression*) = {
    var results = list.toList
    for (exp <- exps) {
      results = results.filter(exp.matches(_))
    }
    results.toList
  }

  def save(obj: T) {
    list += obj
  }
}

class QueryField[R](val field: Member[R]) {
  def %=(value: R): QueryExpression = new QueryExpression {
    def matches(obj: Any) = !(field.get(obj) filter {v: Any => v == value} isEmpty)
  }
}

trait QueryExpression {
  def matches(obj: Any): Boolean
}

Here we have an instance of ScalaFlect for our class (in the example usage, this was BlogPost). The $ method simply invokes reflect() on this, which is what does the reflection. reflect() returns a Member. A Member is a reference to a series of methods and fields that were invoked by the reflection function. Using it, you can get access to the Java reflection API classes that represent these methods and fields.

The other thing to note is the implementation of matches. This is where the Member class is actually used. If this was a DSL for something like MongoDB, then at this point the Memmber and its parents would be inspected to build a query object, which would then be sent to the database. In our case though we're just operating on a list of objects that we already have in memory, and Member provides a convenient get function that gets all the values of that property for a given object. Note that I said values, not value, ScalaFlect allows stepping into collections using the $ operator, so if the reference goes through a collection, then multiple values might be returned.

Implementation

At this point, especially if you've thought about how to do this before, you're probably wondering how on earth I've implemented this. The passed in reflection function is never executed. An important thing to remember in Scala is that functions are actually classes themselves. ScalaFlect simply takes the function class, decompiles it, finds the apply method, and then does some fairly simple analysis on the code to find out what properties were referenced. By fairly simple, I mean I wrote it in 2 hours. The result is cached so each function is only ever analysed once.

Ok, so this sounds like a hack, and it is. And it certainly has its limitations. The behaviour if you put complex code into your reflection function is completely undefined (it will most likely throw an exception, in future I'll add some decent error handling). However, it's a hack that will work well - there is nothing magic about how a Scala function that invokes some methods/accesses some fields on the passed in argument and returns the result is compiled down to bytecode. I don't expect that anything will be added to Scala in future that might break this.

One limitation that I do anticipate though is working with hot swapped code. I imagine, if you update a query, and hotswap it into a running JVM, there will be no way for ScalaFlect to either detect that, or even analyse it if it could. This is because the way ScalaFlect decompiles the bytecode is by loading the class from the reflection functions classloader. I actually have no idea how hot swapping works, but I suspect that it doesn't update the bytecode that the classloader sees. The same goes for any runtime generated and enhanced code, but maybe that issue is not as likely to ever cause a problem.

The future

I read somewhere that Scala 2.10 is getting improved reflection, but I don't know what that means. The need for this library can be completely removed if Scala added language level features for referring to a field, and not it's value, in code. My ideal outcome from implementing ScalaFlect is that the value of such a feature will be realised, and hence added to Scala. In the meantime, ScalaFlect will work nicely at giving DSL writers statically typed reflection.

So check out the library on GitHub, and feel free to comment on my approach. I haven't cut a release yet, but I'll get to that and deploy a beta to Maven central in the coming days (depending on how fast Sonatype will process my request to deploy to my domains groupId).

Update

It's just been pointed out to me that Scala 2.10 will (might?) have macro support, which would achieve this (and potentially much more) in a much cleaner fashion. This is great news for writing query DSLs. I don't know though what the state of the macro support is, it's marked as experimental and doesn't seem to be really mentioned in any Scala 2.10 feature lists. So maybe ScalaFlect might have some time to serve some useful purpose yet.

Three things I had to learn about Scala before it made sense

Last year I sat down in a meeting room with a few other colleagues to learn from one of Atlassian's biggest Scala advocates about Scala. I kept on hearing good stuff about Scala, and I like trying new things, so I was eager to have my eyes opened so that I could learn how awesome this thing was. Unfortunately, I was disapponted. Most of the lesson went straight over my head. But to my dismay, the other people in the room seemed to get it. It was on this day that I realised an interesting thing about my learning style. I remembered back at uni, I was exactly the same in lectures... I never got anything I was taught in lectures. Which is probably why by the end of uni, I had pretty much stopped attending all CS lectures. It wasn't until I was given a real problem in a lab that I had to solve that I got it. This is why I never really got functional languages at uni, the lecturers that taught functional languages at my uni weren't the most practical of people. Actually, I don't think they ever even considered whether their work had any practical implications at all. So they taught things like map, reduce, filter, fold, curried functions etc like this "Look, here's something you've never heard of! Behold its awesomeness! Now here's a trivial mathematical use case with no apparent real world application! Look at how practical it is!"

Some people have no problems learning this way, and I envy them. For me, I need to have a problem that I understand and want to solve today, that the thing that I'm learning can solve, in order to learn it. So if you're reading this and thinking "That is so me!", and you want to learn Scala, then hopefully this is the blog post for you. I'm going to start with a practical problem that hopefully you will understand and want to solve, and then show you how three particular Scala features allow solving this really elegantly and simply. Hopefully after that, you'll have a good understanding of those features in Scala, and be able to understand a lot more of Scala code when you encounter it.

The problem

The problem that I want to solve is dependency management. If you've ever used maven, you'll know how verbose this can be. For those that haven't encountered maven, here's what specifying a single dependency looks like:

<dependency>
    <groupId>net.vz.mongodb.jackson</groupId>
    <artifactId>mongo-jackson-mapper</artifactId>
    <version>1.4.1</version>
</dependency>

In English, this means a dependency on the artifact mongo-jackson-mapper version 1.4.1 from the organisation net.vz.mongodb.jackson. The English explanation is shorter than the code! Maybe the Maven authors could have made things nicer, by placing it on one line, and that would be clearer too, for example, something like this:

<dependency>net.vz.mongodb.jackson:mongo-jackson-mapper:1.4.1</dependency>

There are however some advantages to the more verbose model, the language (XML) that the DSL (maven pom.xml format) uses is able to fully express every element of the dependency, rather than the elements being defined in some other format XML is unaware of. There are also other attributes (such as scope and classifier) that the more verbose solution can easily unambiguously specify, these become very hard if it's just an arbitrary string of characters.

So here is the problem. I want a DSL to specify dependencies with as little noise as possible from the DSL, so that my dependency specification is short, unambiguous, and its format should be enforced by the language it's specified in. The language I'm going to use is Scala.

And just to ground this problem even more solidly in reality, this example is not contrived. SBT, the Scala Build Tool, uses Scala to specify its build configuration, including dependencies. The solution I'm going to show you is the solution that SBT uses. So by the end of this you will also understand a little about how to use SBT, which you will no doubt need to know as you delve into the world of Scala.

How would we do it with no weird Scala tricks?

First of all I will give a solution that uses language constructs that you should already be familiar with, using classes, fields, methods etc. If the following syntax makes no sense to you, you should probably do some reading on basic Scala syntax.

def groupId(groupId: String) = new GroupId(groupId)

class GroupId(val groupId: String) {
  def artifact(artifactId: String) = new Artifact(groupId, artifactId)
}

class Artifact(val groupId: String, val artifactId: String) {
  def version(version: String) = new VersionedArtifact(groupId, artifactId, version)
}

class VersionedArtifact(val groupId: String, val artifactId: String, val version: String) {
}

Here we have three classes, a GroupId, which has a method for creating an Artifact from that group id, which in turn has a method for creating a VersionedArtifact, which captures our dependency. There's also a factory method for creating the GroupId. To use it, we can do this:

groupId("net.vz.mongodb.jackson")
    .artifact("mongo-jackson-mapper")
    .version("1.4.1")

Already it's looking better than the XML version, but this could be done in Java too. Scala has a lot more to offer yet. As we go through the following lessons, it's important to remember that each change we make, we are still basically achieving the same as the above.

Lesson One: Scala methods can be made of operator characters

To Java developers such as myself, this is a bit weird, though in other languages this is quite common. There's some strict rules in Scala over exactly what a method name can be, but one of them is that it can be made of just operator characters (ie, +, =, *, /, %, -). So, the first change that we're going to make to our above code is to get rid of the method names. It makes things more concise, but other than that it may seem like a weird first step. Bear with me.

def groupId(groupId: String) = new GroupId(groupId)

class GroupId(val groupId: String) {
  def %(artifactId: String) = new Artifact(groupId, artifactId)
}

class Artifact(val groupId: String, val artifactId: String) {
  def %(version: String) = new VersionedArtifact(groupId, artifactId, version)
}

class VersionedArtifact(val groupId: String, val artifactId: String, val version: String) {
}

And so now our code looks like this:

groupId("net.vz.mongodb.jackson").%("mongo-jackson-mapper").%("1.4.1")

Lesson Two: Scala lets you call methods without dots or braces

There are some strict rules governing this, including what happens when obvious ambiguities arise, and you can read about that in your own research. Simply put, if a method has only one parameter, you can call it without using the dot or the braces, rather replace them with whitespace:

groupId "net.vz.mongodb.jackson" % "mongo-jackson-mapper" % "1.4.1"

This is called "infix" notation, which you may have also come across in other languages. We're looking much more concise now, yet the language is still enforcing the rules of what a dependency needs, and we still end up with a strongly typed VersionedArtifact. But we're not finished yet.

Lesson Three: Scala lets you implicitly convert any type to any other type

For me, this was the weirdest feature of Scala when I first learnt it. It sounded so cool and dangerous at the same time. Most languages, when you call a method on an object, if that method doesn't exist on that object, will throw an error. For the statically typed languages, this will be a compile error, for dynamic languages it will be a runtime error. However, Scala is a bit different. When it can't find a method on an object (at compile time, because Scala is statically typed), it has an extra step before it gives up with an error.

Scala has a concept of implicit methods (and fields, but right now let's only worry about methods). When a method that you call doesn't exist on the object that you're trying to invoke it on, it will check the current scope for any implicit methods that take the type of your object as an argument. If the implicit methods result has a method that matches the method you are trying to invoke, then Scala wraps your object in that implicit method call. Have I lost you? Let's look at some code:

implicit def groupId(groupId: String) = new GroupId(groupId)

So the only change I've made to the groupId method is that it is now implicit. When I define my dependency, I do this:

"net.vz.mongodb.jackson" % "mongo-jackson-mapper" % "1.4.1"

Notice that the call to groupId has completely disappeared. If at this point our newly learned Scala syntax is a bit overwhelming for you, maybe it would help to see what would happen if we applied the implicit rule first, before we got rid of our dots and braces, and before we changed our method names to percent signs:

"net.vz.mongodb.jackson".artifact("mongo-jackson-mapper").version("1.4.1")

What it looks like is that the String class has a method called artifact. But we know it doesn't, and the Scala compiler knows it doesn't. When it sees that, it says "are there any implicit methods available that accept a String as an argument, and return a type that has an artifact method?" And so it finds the groupId method, which matches that criteria, and converts the code to this:

groupId("net.vz.mongodb.jackson").artifact("mongo-jackson-mapper").version("1.4.1")

When I first learnt about this feature, I thought that's going to cause massive amounts of confusion. And, if used irresponsibly, it certainly can. But when used carefully, it provides a massive amount of power, for example, in creating DSLs like the one above, or for adding methods to third party/legacy libraries.

On a side note, Scala actually uses this to add functional methods to the Java collections API, by providing implicit conversion methods to convert Java collections to Scala collections. The confusion that could arise can be avoided by being careful to only import the implicit methods you need, where you need them. Scala allows you to put an import statement anywhere in code, so if you have one method in a class that needs to work with Java collections, you can import the implicit collection conversions into just that block of code. In the case of DSLs, it's usually the case that the use of these are separated from the rest of your code by normal encapsulation, so a DSL for accessing a database would only appear in a DAO, where you're expecting it to appear, so there would be no confusion.

Putting it into context

There is one last step, and that is to show what we actually do with the dependency we've declared. In maven, our XML ends up being part of the rest of the overly verbose POM file. In SBT, there is a file called build.sbt that contains snippets of Scala, which build up the configuration. Available to the snippets are certain variables, for example, libraryDependencies, which contain the configuration. So adding my dependency to my SBT project means adding a line to the build.sbt file, like this:

libraryDependencies += "net.vz.mongodb.jackson" % "mongo-jackson-mapper" % "1.4.1"

You may notice here that there is another use of infix operator named methods, += is a method defined in mutable collections for adding elements. Multiple dependencies can be added more tersely like this:

libraryDependencies ++= Seq(
    "net.vz.mongodb.jackson" % "mongo-jackson-mapper" % "1.4.1",
    "net.vz.mongodb.jackson" % "play-mongo-jackson-mapper" % "1.0.0"
)

So there we have it. We have a very terse way of specifying dependencies, that is strongly typed, and so validated at compile time. I hope that the problem I have presented and solved is a problem that you have understood and can relate to, and that you've understood how Scala can be used to help solve it, and that in learning how Scala helps solve it, you've learnt and understood some important features of the Scala language that you're now keen to apply to other problems that you have.

Open Source Developers

Since everyone's doing it, I thought I'd do it too. My contribution to the What people think I do meme:

About

Hi! My name is James Roper, and I am a software developer with a particular interest in open source development and trying new things. I program in Scala, Java, Go, PHP, Python and Javascript, and I work for Lightbend as the architect of Kalix. I also have a full life outside the world of IT, enjoy playing a variety of musical instruments and sports, and currently I live in Canberra.