all that jazz

james' blog about scala and all that jazz

A sad day for open source

This month is a month of mixed emotions for me.

It is my 10 year anniversary of working at Lightbend, then Typesafe. I’ve loved working for Lightbend, I certainly would not still be working here if I didn’t. I almost passed up the opportunity. At the time Typesafe contacted me I was living in Germany, but I was about to return to Australia. I had taken one years leave without pay from Atlassian when I moved to Germany, and I was really looking forward to going back to Atlassian, it was and still is a great company.

Out of the blue I got an email from Typesafe asking if I’d like to work full time on Play Framework. I had been contributing to Play a lot in my spare time, and I ran the Berlin Play User Group. I was flattered at the offer, but I did a quick lookup of where Typesafe was based - San Francisco - and I really didn’t want to move to the US, so I politely declined, saying I was about to move back to Australia. The response was “we’re all remote, you can work in Australian if you want.”

Initially I thought damn, now I have to come up with a proper excuse of why I didn’t want to work there. But then I remembered something. Open source had been a passion of mine, ever since I was in university. I had dreamed of having a job where I could contribute to open source full time. This job offer was actually for my dream job. So, I took the job, and it didn’t disappoint, working full time in open source was everything I had hoped it would be and more.

All this is to say that open source is something that I’m passionate about, I love the open source model, I love working on it, I love using it, I love being involved with communities of people from all around the world focussed on solving technical problems. Which is why this month is also a very sad month for me. Lightbend is no longer what I would call an open source company. With the relicensing of Akka to the Business Source License (BSL), the donation of Play Framework to the Open Collective, and our smaller contributions to Scala over the years, I can’t say I work for an open source company in good faith.

Of course, the BSL isn’t a terrible license. It reverts to Apache 2 after 3 years. It allows companies with less than US$25 million in annual revenue to still use Akka for free. And there are exclusions to ensure it can be still used as part of Play Framework to ensure Play users aren’t impact by the relicensing. But it’s not in the spirit of open source, as I dreamed of working many years ago.

Also, I don’t think Lightbend is making the wrong choice. We’re not the first company to adopt this style of license. Akka is not just another library that people use, it, and the architectural approach that it allows, forms a core platform on which scalable and resilient systems are built. Akka puts incredibly powerful and complex distributed systems tools into the hands of every day programmers who don’t have PhD’s in distributed computing, allowing them to build scalable and resilent systems easily. Developing and maintaining Akka requires some of the best distributed systems experts our industry has. The open source model simply isn’t working when it comes to maintaining that, and hence, in order to continue to offer and develop this incredibly unique but powerful tool, Lightbend needs to look at an alternative model. I think BSL is a good fit for this.

Nevertheless, I don’t feel good about it at all. It’s not what I signed up for. It’s not the dream that I was hoping to live. But I don’t think this problem is unique to Lightbend. I think the dream of the open source business model, across the industry, is proving impossible. The main open source projects that succeed are ones that are side effects of another companies business, and hence, the project itself doesn’t get the prime focus by its developers, but that company’s other business interests. Which is ok, but also isn’t really what open source is about. So, what I’m sad about is not Ligtbends move, but rather what open source has become in general.

Don’t get me wrong, open source is not going away. The world will continue to run on open source software, forever. But the days of teams of open source developers whose primary focus was on serving their open source communities, funded by businesses whose business interests were well aligned with the interests of their open source communities, those days are over. Projects like Kubernetes will be run by businesses who are not very aligned with their communities in terms of goals and motivations, but will manage to produce just enough to keep the project useful to the masses. Projects like Akka will be developed mostly in a proprietary manner, with decisions made based on what customers will pay for. If I’m being honest, Akka has been developed that way for a long time now, even if it was technically open source.

As for my future, I’m not leaving Lightbend anytime soon. I do still enjoy my job, it’s different, I’m working with production systems now, as the architect of Kalix. One complaint of many open source developers is that when they spend too long working on open source, they forget what it’s like to actually run software in production. So it’s good to be back getting direct exposure to what it’s like running the software I’ve been developing all these years.

Securing Akka cluster communication in Kubernetes

A feature of Akka that we’ve been using in production for some time now but haven’t made a big deal about is Akka remoting’s support for using mTLS certificates that are frequently rotated. This support is designed to work with cert-manager and other Kubernetes based secret providers with an absolute minimum of configuration. All you need is two lines of configuration in Akka’s configuration file, and you’re ready to go on the Akka side.

This feature is important for secure Akka deployments. Akka clusters use a proprietary protocol to communicate with each other. This protocol by default contains no authentication or encryption, and so to prevent malicious hosts from joining your Akka clusters or eavesdropping on your Akka cluster communication, you need to use mTLS to secure the communication.

In a Kubernetes environment, many people turn to a service mesh such as Istio, Linkerd or Consul to authenticate and encrypt their network communications, however unfortunately this is not an option for Akka cluster communication. The goal of a service mesh is to ensure that services do not need to be aware of where and how the services they talk to are deployed, and so service meshes hide this, so the service thinks its only talking to one logical service, while the service mesh handles concerns such as load balancing, encryption, authentication and authorization, canary and A/B deploys, and so on. Akka clusters however need to understand how and where they are deployed, in order to implement their stateful features such as sharding, replication and P2P messaging. So, when deploying a service that uses Akka clustering to a service mesh, the Akka cluster communication must bypass the service mesh.

§Prerequesites

In this blog post, I’ll explain how to provision the certificates needed by Akka using cert-manager. I’ll assume you have a Kubernetes cluster with a standard cert-manager installation.

§Understanding the certificates

First off, let me explain a few concepts. cert-manager has a concept of Certificates and Issuers. A Certificate is a CRD that you deploy that cert-manager will reconcile into a Kubernetes Secret containing a TLS certificate. The Certificate references a Issuer, and the Issuer describes how Certificates that reference it should be issued.

In order to support frequently rotated certificates, Akka can’t just use a self signed certificate, since self signed certificates need to be the same at both ends to authenticate each other properly, and during the time when the certificate is being rotated, two different Akka nodes may have different certificates. Instead, Akka needs certificates issued by a certificate authority (CA). The CA verifies whether a certificate should be trusted or not, so during rotation, both the old and the new certificate can work together, because both are signed by the same CA. So, when we issue our certificates, we’ll use cert-managers CA Issuer type.

The CA Issuer itself needs a certificate to do its signing, and this certificate we’ll also provision using cert-manager. That certificate we’re not going to rotate - its private key never gets shared with anything outside of cert-manager, and so rotating it is not as necessary. Because of this, it will use a self signed certificate, and provisioning that certificate can be done by using a cert-manager self signed Issuer type.

So, in total, we’re going to have two issuers, a self signed issuer that issues certificates for the CA issuer, and then that CA issuer will issue certificates that are frequently rotated for our Akka service to use. The self signed issuer, certificate, and CA issuer can be reused across different Akka deployments - more on that later.

§Kubernetes resources

First we deploy the self signed issuer:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: self-signed-issuer
spec:
  selfSigned: {}

We’re creating this for the whole cluster, self signed issuers don’t have any state or configuration, there’s no reason to have more than one for your entire cluster.

Next we create a self signed certificate for our CA issuer to use that references this issuer:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: akka-tls-ca-certificate
  namespace: default
spec:
  issuerRef:
    name: self-signed-issuer
    kind: ClusterIssuer
  secretName: akka-tls-ca-certificate
  commonName: default.akka.cluster.local
  # 100 years
  duration: 876000h
  # 99 years
  renewBefore: 867240h
  isCA: true
  privateKey:
    rotationPolicy: Always

We’ve created this in the default namespace, which will be the same namespace that our Akka service is deployed to. If you’re using a different namespace, you’ll need to update accordingly.

The commonName isn’t very important, it’s not actually used anywhere, though may be useful for debugging purposes if you’re ever looking into why a particular certificate isn’t trusted by a service. We use a naming convention for common names and DNS names that follows the the pattern <service-name>.<namespace>.akka.cluster.local. The CA uses the same convention without the service name. This convention doesn’t need to be followed, but it makes it easy to reason about the purpose of any given certificate.

Now we create the CA issuer:

apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: akka-tls-ca-issuer
  namespace: default
spec:
  ca:
    secretName: akka-tls-ca-certificate

This uses the secret that we configured to be provisioned in the certificate above. Finally, we provision the certificate that our Akka service is going to use - we’re assuming that the name of the service in this case is my-service:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: my-service-akka-tls-certificate
  namespace: default
spec:
  issuerRef:
    name: akka-tls-ca-issuer
  secretName: my-service-akka-tls-certificate
  dnsNames:
  - my-service.default.akka.cluster.local
  duration: 24h
  renewBefore: 16h
  privateKey:
    rotationPolicy: Always

The actual dnsName configured isn’t important, since Akka cluster does not actually use these names for looking up the service, as long as it’s unique to the service within the issuer. Akka’s mTLS support will verify that the DNS name supplied by an incoming connection matches the DNS name supplied in its own secret, and reject it otherwise. Again, we’re using the naming convention for the dnsName mentioned above.

This certificate is configured to last for 24 hours, and rotate every 16 hours.

§Configuring Akka

To configure your Akka application, you need to have artery based remoting enabled, which will be the case if you’ve followed the Akka guide for configuring cluster bootstrap in Kubernetes, with the following additional configuration:

akka.remote.artery {
  transport = tls-tcp
  ssl.ssl-engine-provider = "akka.remote.artery.tcp.ssl.RotatingKeysSSLEngineProvider"
}

This instructs Akka to use TLS, with the RotatingKeysSSLEngineProvider, an SSL engine provider that is designed to pick up Kubernetes TLS secrets, and poll the file system for when they get rotated. It also applies authorization by matching the incoming DNS name with the DNS name of its own certificate.

§Configuring the Akka deployment

Having configured Akka and built a new Docker image, you can now configure your Akka deployment. To do this, you need to mount the certificate at the path /var/run/secrets/akka-tls/rotating-keys-engine. This is the default path that the RotatingKeysSSLEngineProvider uses to pick up its certificates. So, add the following volume to your pod:

      volumes:
      - name: akka-tls
        secret:
          secretName: my-service-akka-tls-certificate

And then you can mount that in your container:

        volumeMounts:
        - name: akka-tls
          mountPath: /var/run/secrets/akka-tls/rotating-keys-engine

Your complete deployment YAML, configured as described in our Kubernetes deployment guide, might look like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-service
  namespace: default
  labels:
    app: my-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-service
  template:
    metadata:
      labels:
        app: my-service
    spec:
      containers:
      - name: my-service
        image: my-service:latest
        readinessProbe:
          httpGet:
            path: /ready
            port: management
        livenessProbe:
          httpGet:
            path: /alive
            port: management
        ports:
        - name: management
          containerPort: 8558
          protocol: TCP
        - name: http
          containerPort: 8080
          protocol: TCP
        resources:
          limits:
            memory: 1024Mi
          requests:
            cpu: 2
            memory: 1024Mi
        volumeMounts:
        - name: akka-tls
          mountPath: /var/run/secrets/akka-tls/rotating-keys-engine
      volumes:
      - name: akka-tls
        secret:
          secretName: my-service-akka-tls-certificate

And now you have secured your Akka cluster communication with mTLS using frequently rotated certificates. This will prevent both eavesdropping, as well as malicious services trying to join your Akka cluster.

Note that if you apply this to an existing running cluster deployment for the first time, you will need to do an Akka cluster restart. This is because the new nodes will be unable to connect to the old nodes because the new nodes will be attempting to speak TLS to the old nodes, while the old nodes will not be configured to speak TLS. The easiest way to restart the Akka cluster is to scale the deployment down to 0, and then back up to what it was before.

If you have more Akka services that you wish to deploy in the same namespace, you can reuse the same CA Issuer, you only need to deploy an additional Certificate for each service.

Rotating secrets in Kubernetes

I’m building a multitenant SaaS on top of Kubernetes at the moment, and one principle we’ve gone with is that all secrets should be rotated regularly. I’m surprised by the distinct lack of documentation on best practices for how to supply this configuration in Kubernetes.

Of course, when it comes to certificates, it’s fairly straight forward to rotate a certificate using cert-manager, but even that isn’t quite solved - while you can rotate a services certificate, there’s no straight forward way to rotate the certificate for the CA that it and the other services it authenticates with trusts.

When it comes to authentication that is not based on certificates, such as symmetric encryption keys, passwords, or asymmetric keys used for things like JWTs, there is nothing really out there saying how to do this.

Of course, it’s absolutely possible to do it, and I can come up with multiple different ways in my head to do this without requiring downtime. There are also multiple third party solutions, such as HashiCorp Vault, that makes it possible. But I’d like a Kubernetes native mechanism - afterall, Kubernetes does provide a secret management API, it should be possible to use this in a way that supports secret rotation.

So, I’m going to propose a convention, that perhaps could become a best practice, for how to manage secrets in Kubernetes in a way that is compatible with secret rotation. What I’m proposing is all manual, but it shouldn’t be too hard to build tooling around this approach to make it automatic.

§A few principles

Firstly, for secret rotation to work, I need to outline a few principles.

§Secrets should be read, and reread, from the filesystem

One of the great things about Kubernetes secrets is that when mounted as filesystem volumes, an update to the secret is immediately available to the pods consuming it. No restart or redeploy is needed, all you have to do is update the secret. However, to take advantage of this, the code consuming the secret must read it from the filesystem. It doesn’t work for secrets passed using environment variables. Additionally, the code must, at least periodically, reread the secret from the filesystem, otherwise it won’t pick up the changes.

We’ve had success in Akka configuring this for encryption keys and certificates. When we read the secret, we track a timestamp of when we read it. When we next access the certificate (next time we receive or make a new TLS connection), if the timestamp is more than 5 minutes old, we reread it. This way, we can rotate secrets with zero interruption to the service.

§Secrets must be identified by an id

When it comes to TLS certificates, the id of the secret is built in to the certificate, and TLS implementations can typically be configured with multiple trusted certificates to use. For JWTs, there is a semi built in mechanism, you set the key id in the key id JWT parameter. However, most JWT implementations don’t make you set this by default, and sometimes provide only limited if any support for dynamically selecting a key to use based on the passed in key id.

When encrypting arbitrary data, there’s not usually any built in mechanism to indicate the id of the key used. In my case, when we encrypt small amounts of data, we are encoding using the format <keyid>:<base64ed initialization vector>:<base64ed cypher text>. This allows us to associate each piece of encrypted data with the key that was used to encrypt it.

For passwords, it gets harder again, because generally, only one password can work at a time. One possibility here is to support multiple usernames, and use the username as the key id.

§Rotation mechanism

Having set out our principles by which the code that consumes our secrets will abide by, we can now propose a mechanism for configuring and rotating secrets in Kubernetes.

Kubernetes secrets allow multiple key value pairs. We can utilise this. When these secrets get mounted as volumes, the filename corresponds to the key in the key value pair. Given the name clash between secret keys, and key value keys, I’m going to refer to the key value keys as filenames.

Consider the case where you might have a symmetric key, perhaps used for signing JWTs. Each key will get an identifier, this could simply be a counter, a timestamp, a UUID, etc. When there is only one key in use, the filename might be <key-id>.key. When your code loads the key, it looks in the directory of the volume mount for any files called *.key, and will load them all, storing them in a map of key ids to the actual key contents.

Now, this works great when validating JWTs, since you have a key id before you start, and you just need to select the secret for that key id. However when creating a JWT, if you have multiple keys configured, which one do you use? It’s important that you choose the right one, if the secret has only just been updated to add a second key, other pods may not yet have picked up that second key, and so if you use that second key to sign a JWT and send it to those pods, they may fail to validate it. So, to handle this, we will also support specifying a primary secret, by naming it <key-id>.key.primary. There must only be one primary secret, and it will always be the one used to sign or encrypt data.

So, given this set up, this will be our rotation mechanism:

  • A given secret starts with a single key configured, let’s say its id is r1. The id can be anything, we’re using r to stand for revision, and 1 to indicate its the first key, but the id could be a UUID or anything else. The key will be placed in a Kubernetes secret, with a filename of r1.key.primary.
  • When it comes time to rotate the key, a new key will be generated, and a new id assigned, r2. The kubernetes secret will be updated so that it now has both keys, with r1.key.primary being the filename for the old key, and r2.key being the filename for the second key.
  • Once we are sure that all nodes have picked up the new key, we can now change the new key to be the primary. So, the secret is again updated, with r1.key being the filename for the old key, and r2.key.primary being the filename for the new key.
  • After some time, eg, once we are sure that all JWTs signed by the old key have expired, we will delete the old key, updating the secret so it only contains one key, with a filename of r2.key.primary.

This approach can also work when a secret comes in multiple parts, such as asymmetric keys, or self signed certificates, simply replace key with the name of the thing, so for example, I might have r2.private and r2.public or r2.crt.

§Why bother?

It’s not too hard for someone to implement the above themselves, but why bother proposing it as a best practice? If the above approach was to be adopted by many different people, this would open the following possibilities:

§Secret consumption support

Secret consumers could provide built in support for this convention. For example, a JWT implementation might offer it, users would just need to pass it the directory to find the keys, and it would use them, periodically rereading and consuming the new keys. HTTP servers and clients could do likewise, as could database drivers. Generic libraries that implement the convention could be provided so that arbitrary secret usage, such as for encryption, could easily consume the keys.

§Automatic rotation

If enough consumers were using the convention, tools for automatic secret rotation could be implemented. This could be as simple as an operator that would allow you to configure a secret to be generated and rotated, given parameters such as how frequently to rotate secrets, how long to wait before making the new secret the primary, and then how long to wait before deleting the old secret. Such a tool could also be configured to create and wait for migration jobs, to allow data encrypted at rest to be decrypted and rencrypted using the new key.

§Pros and cons

§Pros

Many of the pros are self evident in the explanation above, but are a few more that I can think of:

§Not Kubernetes specific

The way code consumes keys is not at all specific to Kubernetes. It can work on any platform that can pass keys using a filesystem. This includes development environments where perhaps static keys might be used.

§Kubernetes native

In spite of not being specific to Kubernetes, this mechanism is native to the way Kubernetes works. It uses the in built secret mechanism, and it’s workable without any third party tooling. It will work on any Kubernetes distribution, and if you already understand how Kubernetes secrets work, it’s straight forward to understand how this works.

§No vendor lock in

This also provides for a vendor neutral way to rotate and consume certificates. Today, if using HashiCorp Vault for example, you need to use the Vault client in your code to connect to the Vault server to get keys, which ties your code to Vault. This convention allows whatever is managing the keys to be decoupled from the consumers. This also can be advantageous in development and test environments, you might not be able to run your vendors secrets manager on your local machine for example for licensing or cost reasons, so you can substitute in a different one in those environments.

§Cons

The convention is not without its cons of course.

§Reliance on the filesystem

Some people may object to using the filesystem for distributing secrets, preferring to only pass them through authenticated connections. Of course, at some point some secret needs to be passed to the code - if the code is going to authenticate with a third party secret manager to retrieve secrets, the secret for that authentication needs to be stored somewhere, such as the filesystem or an environment variable.

§Reliance or Kubernetes secrets

Some people may be concerned with the way Kubernetes stores secrets itself. As I understand, I think this is either pluggable, or can be encrypted (I know GKE supports integration with Cloud KMS to encrypt the secrets stored by Kubernetes, for example). But in some circumstances this might not be good enough for people.

§Changes to the way code consumes secrets

The requirement to read secrets from the filesystem may be disruptive for libraries that typically consume secrets from configuration files or environment variables.

§Conclusion

So, does this convention sound useful? Please comment if you have anything to add!

Testing sbt 1.0 cross builds

sbt 1.0 is now released, and everyone in the sbt community is hard at work upgrading their plugins. Because many sbt plugins depend on each other, there will be a short period of time (that we’re in now) where people won’t be able to upgrade their builds to sbt 1.0 because the plugins their builds use aren’t yet upgraded and released. However, that doesn’t mean you can’t cross build your plugin for sbt 1.0 now, simply upgrade to sbt 0.13.16 and use its sbt plugin cross building support.

I had a small problem yesterday though when working on the sbt-web upgrade, part of my plugin needed to be substantially rewritten for sbt 1.0 (sbt 1.0’s caching API now uses sjson-new rather than sbinary, so all the formats needed to be rewrittien). I didn’t want to rewrite this without an IDE because I knew nothing about sjson-new and needed to be able to easily browse and navigate its source code to discover how to use it, and I wanted the immediate feedback that IDEs give you on whether something will compile or not. The problem with doing this is that my build was still using sbt 0.13.16, and I couldn’t upgrade it because not all the plugins I depended on supported sbt 1.0. So, I came up with this small work around that I’m posting here for anyone that might find it useful, before reimporting the project into IntelliJ, I added the following configuration to my build.sbt:

sbtVersion in pluginCrossBuild := "1.0.0"
scalaVersion := "2.12.2"

Unfortunately it seems that you can’t leave this in the build file to ensure that sbt 1.0 is always the default, it seems that the sbt cross building support doesn’t override that setting (this is possibly a bug). But if you add that to your build.sbt right before you import into IntelliJ, then remove it later when you’re done developing for sbt 1.0, it’s a nice work around.

socket.io - The good, the bad and the ugly

I have recently finished an implementation of socket.io server-side support for Play Framework. Naturally, I’ve become intimately familiar with the protocol, and I’ve formed a few opinions of it during the course of my efforts, which I’m going to share.

All reviews of technology are relative, relative to the things that the reviewer values in technology. Not everyone values the same things, and so this means that a review by someone who values different things to you is probably irrelevant to you. My reason for publishing this review is that there are a good number of people out there who share similar technology values to me, and so they will find this review useful as they evaluate whether socket.io is right for them. For people that don’t share the same values as I do, there’s not really much point in reading this blog post, you’ll probably disagree with me, but the disagreement will most likely be a disagreement in values, not on socket.io’s ability to meet those values. We can debate all day what set of values is right for software development, but this is not the blog post for that.

So what are my values? Here are a selection of values that are relevant to this review:

  • I am a strong proponent of reactive systems, that is, systems that are responsive, resilient, scalable and message driven. In this context, streaming with backpressure is something that I see as important - a resilient system needs to propagate backpressure to ensure parts of the system don’t get overloaded.
  • I’m strongly for tools, libraries and frameworks that enable high productivity software development.
  • I think it’s important to have well defined standards and interfaces to maximise compatibility between decoupled implementations.

§The good

Why should you use socket.io? I was initially sceptical that socket.io offered much value at all, but over the course of implementing it, I’ve changed my view on that.

A lot of people are saying that with all major browsers supporting WebSockets, there’s no need for socket.io anymore. This is based on the assumption that all socket.io offers is a fallback mechanism to long polling when WebSockets are not available. This is completely false, most pertinently because socket.io doesn’t even provide that, the fallback to long polling is provided by a protocol that socket.io sits on top of, called engine.io. Here’s basically what engine.io provides:

  1. Multiple underlying transports (WebSockets and long polling), able to deal with disparate browser capabilities and also able to detect and deal with disparate proxy capabilities, with seamless switching between transports.
  2. Liveness detection and maintenance.
  3. Reconnection in the case of errors.

The split between engine.io and socket.io is actually a great thing, engine.io implements layer 5 of the OSI model (the session layer), while socket.io implements layer 6, the presentation layer. I don’t know if the authors of engine.io/socket.io were aware of how closely their split mapped to the OSI model, but they did a great job here.

Now, even if you say that wide support of WebSockets makes engine.io redundant, it only makes half of the first point above redundant - it doesn’t address proxies that don’t support WebSockets (although this can be somewhat worked around by using TLS), and then it doesn’t have any mechanism for liveness detection (yes, WebSockets support ping/pong but there’s no way to force a browser to send pings or monitor for liveness, so the feature is useless), and browser WebSocket APIs have no in built reconnection features.

So that’s engine.io, back to my earlier point, transparent fallback of transports is not a socket.io feature. So what exactly does socket.io provide then? Socket.io provides three main features:

  1. Multiplexing of multiple namespaced streams down a single engine.io session.
  2. Encoding of messages as named JSON and/or binary events.
  3. Acknowledgement callbacks per event.

The first two of these I think are important features, the third will feature in what I think is bad about socket.io.

§Namespaces

Multiple namespaces I think is a great feature of socket.io. If you have a rich application that does a lot of push communication back and forth with the server, then you will likely have more than one concern that you’re communicating about. For example, I’ve been working on a monitoring console, this console dynamically subscribes to many different streams based on what is currently in view, sometimes it needs to view logs, sometimes it needs to view state changes, some times it needs to view events. These different concerns are rendered in different components, and should be kept separate, their lifecycles are separate, the backend implementations of the streams are separate, etc. What we don’t want is one WebSocket connection to the server for each of these streams, as that is going to balloon out the number of WebSocket connections. Instead, we want to multiplex them down the one connection, but have both the client and the server treat them as if they were separate connections, able to start and stop independently. This is what socket.io allows. socket.io namespaces can be thought of (and look like) RESTful URL paths. A chat application may encode a connection to a chat room as /chat/rooms/<room-name>, for example. And then a client can connect to multiple rooms simultaneously, disconnect them independently, and handle their events independently.

Without this feature of socket.io, if you did want to multiplex multiple streams down one connection, you would have to encode your multiplexing protocol, implementing join room and leave room messages for example, and then on the server side you would have to carefully manage these messages, ensuring that subscriptions are cleanly cleaned up. There is often a lot more to making this work cleanly than you might think. socket.io pushes all this hard work into the client/server libraries, so that you as the application developer can just focus on your business problem.

§Event encoding

There are two important advantages to event encoding, one I think is less important, and the other is more important.

The less important one is that when the protocol itself encodes the naming of events, libraries that implement the protocol can then understand more about your events, and provide features on top of this. The JavaScript socket.io implementation does just that, you register handlers per event, passing the library a function to handle each event of a particular name. Without this, you’d have to implement that subscription mechanism yourself, you’d probably encode the name inside a JSON object that each message sent down the wire would have to conform to, and then you’d have to provide a mechanism for registering callbacks for that event.

The reason why I think this is the less important advantage is because I think callbacks are a bad way to deal with streams of events communicated over the network. Callbacks are great where the list of possible events is far greater than the number of events that you’re actually interested in, such as in UIs, because they allow you to subscribe to events on an ad hoc basis. But when events are expensive, such as when they’re communicated over a network, then usually you are interested in all events. In those cases, you also often need higher level mechanisms like backpressure, delivery tracking and guarantees, and lifecycle management, callbacks don’t allow this, but many streaming approaches like reactive streams do.

So, what’s the more important advantage? It allows the event name to be independent of the actual encoding mechanism, and this is of particular important for binary messages. It’s easy to put the event name inside a JSON object, but what happens when you want to send a binary message? You could encode it inside the JSON object, but that requires using base 64 and is inefficient. You could send it as a binary message, but then how do you attach meta data to it like what the name of the event is? You’d have to come up with your own encoding to encode the name into the binary, along with the binary payload. socket.io does this for you (it actually splits binary messages into multiple messages, a header text message that contains the name, and then a linked binary message). Without this feature of socket.io, it’s I think it’s impractical to use binary messages over WebSockets unless you’re streaming nothing but binary messages.

§The bad

So, we’ve covered the good, next is the bad.

§Callback centric

I’ve already touched on this, but in my opinion the callback centric nature of socket.io is real downside. As I said earlier, one of my values is reactive streaming, as this allows resilience properties such as backpressure and message guarantees to be implemented. Callbacks offer no such guarantees, if a callback needs to do a whole bunch of asynchronous tasks, there’s no way for it go back to the caller and say “hey, can you wait a minute before you invoke me again, I just have to go and store this to a database”. And so, an over zealous client can overwhelm a server as it sends events at a higher rate than it can process, causing the server to run out of memory or CPU. Likewise, when sending events, there’s no way for the emitter to say to the caller “hey, this consumer has a slow network, can you hold off on emitting more events?”, and so if the server is producing events too quickly for the consumer, it can run out of memory as it buffers them.

Of course, whether callbacks or streams are used to implement socket.io servers and clients is completely an implementation concern, and has nothing to do with socket.io. The implementation of socket.io that I wrote is completely streams based, using Akka streams to send events, and so backpressure is supported. On the client side, in the systems I’ve worked on, we use ngrx to handle events, which once again is stream based. And the authors of socket.io cannot be entirely faulted for implementing a callback based library, the browser WebSocket APIs only support a callback based mechanism with no support for backpressure.

Nevertheless, the callback centric design of socket.io manifests itself in the socket.io protocol - socket.io events are not single messages, but lists of messages, akin to argument lists that get passed to a callback, which is an awkward format to work with when streaming. As a socket.io library implementer that wants to provide full support for the socket.io protocol, this makes defining encoders/decoders for events awkward, because you can’t just supply an API for encoding/decoding a simple JSON structure, you have to allow decoding/encoding a list of JSON structures. For end users though, this doesn’t have to be a major concern - simply design your protocols to only use single argument socket.io events. So I wouldn’t treat this as a reason not to use socket.io, it’s just a feature (ie, supplying multiple messages rather than single messages in a single event) that I think you shouldn’t use.

§Acknowledgements

socket.io allows each event to carry an acknowledgement, which is essentially a callback attached to the event. It can be invoked by the remote side, which will result in the callback on local side being invoked with the arguments passed by the remote side. They’re convenient because they allow you to essentially attach local context to an event that doesn’t get sent over the wire (typically, this context is closed over by the callback function), and then when it’s invoked you have that context to handle the invocation.

Acknowledgements are wrong for all the same reasons that socket.io’s callback centric approach is wrong, acknowledgements subvert back pressure and provide no guarantees (with a stream you can have causal ordering and use that for tracking to implement at least once delivery guarantees, but acknowledgements have no such ordering and hence no such guarantees can be implemented).

Once again, this isn’t a reason not to use socket.io, you can simply not use this feature of socket.io, it doesn’t get in the way if you don’t use it.

§No message guarantees

Again, I’ve touched on this already, but I think it needs to be stated, socket.io offers no mechanism for message guarantees. The acknowledgements feature can be used to acknowledge that a message was received, but this requires the server to track what messages have been received. A better approach would be a more Kafka like approach, where the client is responsible for its own tracking, and it does this by tracking message offsets, and then when a stream is resumed, after a disconnect for whatever reason, the open message sent to the namespace would include the offset of the last message it received, allowing the server to resume sending messages from that point. This feature incidentally is built into Server Sent Events, so it would be nice to see it in socket.io too.

Of course, this can be built on top of socket.io, but I think it’s something that the transport should provide.

§The ugly

At a high level, I don’t think anything about socket.io is really that ugly. My ugly concerns about socket.io mostly exist at a low level, they’re things that only an implementer of the protocol will see, but some of them can impact end users.

§No specification

There is no specification for the socket.io wire protocol. This page seems to think it’s a specification, but all it describes is some of the abstract concepts, it says nothing about how socket.io packets get mapped down onto the wire.

Specifications are important for interoperability. The only way I could implement socket.io support is by reverse engineering the protocol, sometimes I had to do this with wireshark, since browser debugging tools don’t show the contents of binary WebSocket frames or binary request payloads.

Now, I’ve reversed engineered it and implemented tests, that’s a one time problem right? Wrong. Unless the socket.io developers never release another version of socket.io again, there will be incompatibilities in future. New features may be added. The developers might do it in a way that is backwards compatible with their own implementation, but because there’s no specification, other implementations may have implemented their parsing differently, which will cause them to break with the new feature. There may be edge cases that I didn’t come across. And so on.

Although the lack of specification primarily impacts me now, it will negatively impact users of socket.io in future, and this needs to be considered when deciding whether socket.io is right for your project or not.

§Weird encodings

The way socket.io and engine.io are encoded, especially to binary, is very weird in places. For example, there are about 5 different ways that integers get encoded, including one very odd one where each decimal digit is encoded as an octet of value 0 to 9, with a value of 255 used as a terminator for the number. Which might make sort of make sense (but not really) if you had a use case for arbitrary precision integers, but this is for encoding the length of a payload, where a fixed length word, like 32 bit unsigned network byte order, would have done just fine. I mean, it’s not the end of the world to have all these different ways to encode integers, but it really doesn’t inspire a lot of confidence in the designers ability to design network protocols that will be tolerant to future evolution.

People much smarter than us have, over many years, come up with standard ways to encode things, which overcome many gotchas of communicating over a network, including performance concerns, efficiency concerns and compatibility concerns. There are then many libraries that implement these standard ways of encoding things to facilitate writing compatible implementations. Why forgo all that knowledge, experience and available technology, and instead come up with new ways to encode integers? It just seems very odd to me.

§Unnecessary binary overhead

There’s not a lot of use cases for sending binary messages, but many of the use cases I can think of I would want to have as little overhead as possible, such as streaming voice/video. Binary messages in socket.io require sending two messages, one text message that looks like a regular text message, including the name and namespace of the event, and a JSON encoded placeholder for the binary message (it literally looks something like {"_placeholder":true,"num":1}), and then the binary message gets sent immediately after. This seems to me to be a lot of overhead, it would have been better to encode the entire event into one message, using a separate binary encoding for encoding the namespace/name and then placing the binary message in that.

I can understand a little why it is the way it is - because events contain multiple messages, you can mix binary and text messages. Having one reference the other with placeholders is a sensible way to encode that. But, this all comes back to the callback centric nature of socket.io, the reason events contain multiple messages is that callbacks can have multiple arguments, if each event only contained one message then this wouldn’t be an issue.

§Conclusion

I think socket.io is a very useful piece of technology, and is incredibly relevant today in spite of the popular view that widespread support for WebSockets makes it redundant. I would recommend that it be used for highly interactive applications, its namespacing in particular is its strongest point.

When using it, I recommend not taking advantage of the multi-argument and acknowledgement features, rather, simply use plain single argument events. This allows it to integrate well with any reactive streaming technology that you want to use, and allows you to use causal ordering based tracking of events to implement messaging guarantees.

The underlying protocol is a bit of a mess, and this is compounded by the lack of a specification. That particular point can be fixed, and I hope will be, but it will require more than just someone like myself writing a spec for it, it requires the maintainers of the socket.io reference implementations to be committed to working within the spec, and ensuring compatibility of new features with all implementations going forward. There’s no point in having a spec if it’s not followed. This will introduce friction into how quickly new features can be added, but this is a natural consequence of more collaboration, and more collaboration is a good thing.

About

Hi! My name is James Roper, and I am a software developer with a particular interest in open source development and trying new things. I program in Scala, Java, Go, PHP, Python and Javascript, and I work for Lightbend as the architect of Kalix. I also have a full life outside the world of IT, enjoy playing a variety of musical instruments and sports, and currently I live in Canberra.