all that jazz

james' blog about scala and all that jazz

Securing Akka cluster communication in Kubernetes

A feature of Akka that we’ve been using in production for some time now but haven’t made a big deal about is Akka remoting’s support for using mTLS certificates that are frequently rotated. This support is designed to work with cert-manager and other Kubernetes based secret providers with an absolute minimum of configuration. All you need is two lines of configuration in Akka’s configuration file, and you’re ready to go on the Akka side.

This feature is important for secure Akka deployments. Akka clusters use a proprietary protocol to communicate with each other. This protocol by default contains no authentication or encryption, and so to prevent malicious hosts from joining your Akka clusters or eavesdropping on your Akka cluster communication, you need to use mTLS to secure the communication.

In a Kubernetes environment, many people turn to a service mesh such as Istio, Linkerd or Consul to authenticate and encrypt their network communications, however unfortunately this is not an option for Akka cluster communication. The goal of a service mesh is to ensure that services do not need to be aware of where and how the services they talk to are deployed, and so service meshes hide this, so the service thinks its only talking to one logical service, while the service mesh handles concerns such as load balancing, encryption, authentication and authorization, canary and A/B deploys, and so on. Akka clusters however need to understand how and where they are deployed, in order to implement their stateful features such as sharding, replication and P2P messaging. So, when deploying a service that uses Akka clustering to a service mesh, the Akka cluster communication must bypass the service mesh.

§Prerequesites

In this blog post, I’ll explain how to provision the certificates needed by Akka using cert-manager. I’ll assume you have a Kubernetes cluster with a standard cert-manager installation.

§Understanding the certificates

First off, let me explain a few concepts. cert-manager has a concept of Certificates and Issuers. A Certificate is a CRD that you deploy that cert-manager will reconcile into a Kubernetes Secret containing a TLS certificate. The Certificate references a Issuer, and the Issuer describes how Certificates that reference it should be issued.

In order to support frequently rotated certificates, Akka can’t just use a self signed certificate, since self signed certificates need to be the same at both ends to authenticate each other properly, and during the time when the certificate is being rotated, two different Akka nodes may have different certificates. Instead, Akka needs certificates issued by a certificate authority (CA). The CA verifies whether a certificate should be trusted or not, so during rotation, both the old and the new certificate can work together, because both are signed by the same CA. So, when we issue our certificates, we’ll use cert-managers CA Issuer type.

The CA Issuer itself needs a certificate to do its signing, and this certificate we’ll also provision using cert-manager. That certificate we’re not going to rotate - its private key never gets shared with anything outside of cert-manager, and so rotating it is not as necessary. Because of this, it will use a self signed certificate, and provisioning that certificate can be done by using a cert-manager self signed Issuer type.

So, in total, we’re going to have two issuers, a self signed issuer that issues certificates for the CA issuer, and then that CA issuer will issue certificates that are frequently rotated for our Akka service to use. The self signed issuer, certificate, and CA issuer can be reused across different Akka deployments - more on that later.

§Kubernetes resources

First we deploy the self signed issuer:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: self-signed-issuer
spec:
  selfSigned: {}

We’re creating this for the whole cluster, self signed issuers don’t have any state or configuration, there’s no reason to have more than one for your entire cluster.

Next we create a self signed certificate for our CA issuer to use that references this issuer:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: akka-tls-ca-certificate
  namespace: default
spec:
  issuerRef:
    name: self-signed-issuer
    kind: ClusterIssuer
  secretName: akka-tls-ca-certificate
  commonName: default.akka.cluster.local
  # 100 years
  duration: 876000h
  # 99 years
  renewBefore: 867240h
  isCA: true
  privateKey:
    rotationPolicy: Always

We’ve created this in the default namespace, which will be the same namespace that our Akka service is deployed to. If you’re using a different namespace, you’ll need to update accordingly.

The commonName isn’t very important, it’s not actually used anywhere, though may be useful for debugging purposes if you’re ever looking into why a particular certificate isn’t trusted by a service. We use a naming convention for common names and DNS names that follows the the pattern <service-name>.<namespace>.akka.cluster.local. The CA uses the same convention without the service name. This convention doesn’t need to be followed, but it makes it easy to reason about the purpose of any given certificate.

Now we create the CA issuer:

apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: akka-tls-ca-issuer
  namespace: default
spec:
  ca:
    secretName: akka-tls-ca-certificate

This uses the secret that we configured to be provisioned in the certificate above. Finally, we provision the certificate that our Akka service is going to use - we’re assuming that the name of the service in this case is my-service:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: my-service-akka-tls-certificate
  namespace: default
spec:
  issuerRef:
    name: akka-tls-ca-issuer
  secretName: my-service-akka-tls-certificate
  dnsNames:
  - my-service.default.akka.cluster.local
  duration: 24h
  renewBefore: 16h
  privateKey:
    rotationPolicy: Always

The actual dnsName configured isn’t important, since Akka cluster does not actually use these names for looking up the service, as long as it’s unique to the service within the issuer. Akka’s mTLS support will verify that the DNS name supplied by an incoming connection matches the DNS name supplied in its own secret, and reject it otherwise. Again, we’re using the naming convention for the dnsName mentioned above.

This certificate is configured to last for 24 hours, and rotate every 16 hours.

§Configuring Akka

To configure your Akka application, you need to have artery based remoting enabled, which will be the case if you’ve followed the Akka guide for configuring cluster bootstrap in Kubernetes, with the following additional configuration:

akka.remote.artery {
  transport = tls-tcp
  ssl.ssl-engine-provider = "akka.remote.artery.tcp.ssl.RotatingKeysSSLEngineProvider"
}

This instructs Akka to use TLS, with the RotatingKeysSSLEngineProvider, an SSL engine provider that is designed to pick up Kubernetes TLS secrets, and poll the file system for when they get rotated. It also applies authorization by matching the incoming DNS name with the DNS name of its own certificate.

§Configuring the Akka deployment

Having configured Akka and built a new Docker image, you can now configure your Akka deployment. To do this, you need to mount the certificate at the path /var/run/secrets/akka-tls/rotating-keys-engine. This is the default path that the RotatingKeysSSLEngineProvider uses to pick up its certificates. So, add the following volume to your pod:

      volumes:
      - name: akka-tls
        secret:
          secretName: my-service-akka-tls-certificate

And then you can mount that in your container:

        volumeMounts:
        - name: akka-tls
          mountPath: /var/run/secrets/akka-tls/rotating-keys-engine

Your complete deployment YAML, configured as described in our Kubernetes deployment guide, might look like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-service
  namespace: default
  labels:
    app: my-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-service
  template:
    metadata:
      labels:
        app: my-service
    spec:
      containers:
      - name: my-service
        image: my-service:latest
        readinessProbe:
          httpGet:
            path: /ready
            port: management
        livenessProbe:
          httpGet:
            path: /alive
            port: management
        ports:
        - name: management
          containerPort: 8558
          protocol: TCP
        - name: http
          containerPort: 8080
          protocol: TCP
        resources:
          limits:
            memory: 1024Mi
          requests:
            cpu: 2
            memory: 1024Mi
        volumeMounts:
        - name: akka-tls
          mountPath: /var/run/secrets/akka-tls/rotating-keys-engine
      volumes:
      - name: akka-tls
        secret:
          secretName: my-service-akka-tls-certificate

And now you have secured your Akka cluster communication with mTLS using frequently rotated certificates. This will prevent both eavesdropping, as well as malicious services trying to join your Akka cluster.

Note that if you apply this to an existing running cluster deployment for the first time, you will need to do an Akka cluster restart. This is because the new nodes will be unable to connect to the old nodes because the new nodes will be attempting to speak TLS to the old nodes, while the old nodes will not be configured to speak TLS. The easiest way to restart the Akka cluster is to scale the deployment down to 0, and then back up to what it was before.

If you have more Akka services that you wish to deploy in the same namespace, you can reuse the same CA Issuer, you only need to deploy an additional Certificate for each service.

Fun doesn't mean compromising scalability

Today I read an interesting piece on InfoWorld about Meteor, Meteor aims to make JavaScript programming fun again. It is an interview with Matt DeBergalis, a co-author of Meteor, about Meteor and why a developer would choose it. The title in particular resonated well with me, "making programming fun again" is a catch phrase I have often used in presentations I've given about Play Framework.

As the demands on the applications we write shifts, the technologies we use start to make it harder to meet them, and pretty soon we feel like we are always working against the technologies that are supposed to be helping us. By taking a step back, rethinking the technologies, and creating new ones that are better suited to todays demands, we can continue being productive writing modern applications, and its then that development becomes fun again. Though obviously not always the case, how much fun you have working with a particular technology is often well correlated to how well suited it is for solving the problems you are trying to solve, and so there is some merit to switching to technologies that are more fun.

In this light, Meteor is not a bad framework, it is particularly very interesting in its approach to solving the problems of making web applications responsive to data updates. Writing apps in it will definitely, at least initially, be very fun. But my reason for writing this post is that I had one main gripe with the article. The problem was that DeBergalis continually likened what Meteor achieves with Facebook, implying that Facebook could be implemented using Meteor. This couldn't be further from the truth.

While the end result of an application written in Meteor and Facebook are very similar - they are both applications that update instantly as people interact with them - the approach that Facebook takes to writing their apps is the complete opposite from Meteor. Meteor places a massive emphasis on "don't worry about how data is communicated, let the framework deal with that for you". Although I have not worked on Facebook myself, I am sure that their approach is all about how the data is communicated - they don't just let the framework deal with that for them.

The problem with Meteor's approach to web development is that it makes the same mistakes that some very old technologies that many people now loath made. I am going to highlight two such technologies.

The first is relational databases. The promise of relational databases was that you don't have to worry about how your data was accessed - just make sure you store it in a normalised form, and let the database handle whatever load you throw at it. Performance can be achieved by tuning with indexes. But the problem that we found on the web is that that approach did not scale. Denormalisation and caching became necessary in any app with even a modest load. And that's when NoSQL databases started popping up. NoSQL databases intentionally limited what you could do in them - forcing you to take a different perspective on your data, namely how is it going to be read/written? They forced you to make decisions that would allow you to scale early in the design process, and we found that making these decisions early were key to successfully scaling a web application.

The second technology is n-tier application servers. The promise of application servers was that you didn't have to worry about deployment, you just wrote your applications, and let the application server worry about scalability and resilience. This led to people writing massive monolithic apps, where almost every function in the app depended on every single other function, killing any chance of ever having either resilience or scalability. When performance became an issue, clustering was "turned on", and often performance went down. And that's when containerless micro service solutions started becoming popular - small services that could be individually scaled. These new architectures forced you to think about scalability up front, making those decisions early.

Are you seeing a pattern here? Letting the technology handle resilience and scaling for you is bad, forcing you to address it up front is good. But Meteor seems to be making the exact same mistakes the relational databases and n-tier application servers made. It's trying to hide those concerns from you, in the name of "making programming fun again". While fun at first, this is certainly not going to be fun when your site gets popular and starts falling over because of the load it gets.

But maybe the Meteor developers have come up with a smart way to scale it. There are apparently two ways you can run multiple Meteor nodes, and the apparently better one is described here. The approach? Have each Meteor node tail the MongoDB Oplog. Or in simple English, make every write operation in the system go to every node in the cluster. I'll let you decide whether you think making that approach scale is fun.

As I said at the start I resonated well with the title of the article - but it seems that I have a very different idea of what's fun to what the authors of Meteor have. In my opinion, hiding the details of hard problems to scale is not fun. Rather, putting them in your face, giving you the tools to solve them at the right time, now that's fun. This is exactly what Play Framework and Akka do - particularly Akka, in which the assumption when you program is that every other part of the app is likely down or not responding, and you are forced to deal with what happens when that's the case. Using these technologies to solve these hard problems is not only fun, it's very satisfying - seeing an app with 50000 concurrent users broadcasting updates every second scale with only 10 nodes, it's exciting too!

The fun approach to hard problems is not to run away from them to something that pretends they don't exist. It's to embrace them head on, using technologies that are designed to help you do so.

About

Hi! My name is James Roper, and I am a software developer with a particular interest in open source development and trying new things. I program in Scala, Java, Go, PHP, Python and Javascript, and I work for Lightbend as the architect of Kalix. I also have a full life outside the world of IT, enjoy playing a variety of musical instruments and sports, and currently I live in Canberra.