Calling Go from Elixir with a CNode (in Crystal!)

Header image

At Mozi, we needed to connect a new Elixir Phoenix LiveView app to an existing Go backend. This is how we did it.

Background

We have a backend built in Go, which is fully-evented along the lines of the patterns we used at Community, described in a previous post. In order to support all of that, we have some hefty internal libraries and existing patterns, and get a lot for free by building on top of them.

Previously the only frontend to our application was an iOS app and that app is also event-sourced and evented. And now Tom Patterer and I wanted to add a webapp to the mix, in order to support scenarios outside of the iOS app, either because they work better on the web, or so we can support Android and desktop browsers to limited extent (for now) as well. We chose Phoenix LiveView for the web frontend because it is a great fit for this kind of web app, the main backend developers at Mozi already know Elixir well, and the comparable Go live implementation is not as robust or complete.

However, we really didn’t want to have to rewrite/duplicate a lot of the code that handles events in our existing Go services, or maintain two different stacks. It would be great if the Elixir app could just call the Go code.

Solutions We Didn’t Use

I have actually done this before and it works: you can compile the Go code into a C ABI library and then call it from Elixir via NIFs (Native Implemented Functions). If you aren’t familiar with the BEAM ecosystem (Erlang VM on which Elixir runs), NIFs are foreign function interface glue that allows you to call C code from code running on the BEAM. There are some problems with this approach (not in order of importance):

  1. You have two runtimes with complex internal schedulers running in the same process, potentially competing with each other for resources.

  2. One of the great things about the BEAM is that it is super robust and with OTP applications (OTP is the framework built originally to run phone switches), you get a lot of fault tolerance built in. But now you have some C code in the BEAM and that has to be exactly right or your Elixir app will crash.

  3. Compiles and builds become a mess because your build for your Elixir app is either linked against the C (Go) library, or you create an Elixir lib that wraps the C library. In either case you end up with a painful build somewhere.

There are probably some other issues I haven’t mentioned. It’s not a great solution.

Ports are another option. They are a sub-process running on the other end of a pipe that you control from the BEAM process. This is a bit better because you have a separate process and there is not a worry about two schedulers in the same process. However, the overhead is higher and you are still not fully decoupled because the BEAM process has to manage the running port and sub-process. This was the most viable option other than the one we chose, which allows even more decoupling.

C Nodes to the Rescue

The option we actually chose was to implement what is called a “C Node” in the Erlang ecosystem. There is a library that ships with the Erlang distribution called erl_interface that allows you to implement a BEAM distribution node in C. What that means is that you can write C code that will talk to the Elixir/Erlang application over the native distribution protocol used to connect nodes together running the BEAM.

This is a great option because, while it does introduce more overhead than the NIFs, it allows you to fully decouple the codebases from each other at both compile and runtime. All you need to do is write a lightweight wrapper library in the Elixir side that makes it easy to call the remote node using native Elixir functions like send/2. And on the C side you use the library to decode the distribution messages and process them as needed, then call the library to return data back, or make other remote calls. If the nodes are connected, sending to the remote node feels just like making a normal function call from the Elixir side.

What we did was build the Go code as a C ABI library. We then wrote a small C wrapper that processes some CLI args and environment variables, and starts looping on the inbound messages. It calls the Go code as needed from the C code. In this setup, main() is in the C code and the Go code is initialized and called from there.

The way it works is that the C code starts up and calls to the Elixir app on a well-known local address. This then begins distribution with the BEAM running the Elixir app. You can tell on the Elixir side if the C node is connected by calling Node.list(:hidden). On the C side the call either succeeds or fails, so you can easily manage retries as needed. In our case, embracing the “let it crash” philosophy, the process exits cleanly, shutting down the Go code. Then it is restarted by S6 running inside the container. Because we use a well-known name for the C node, it’s easy to tell if it is connected or not.

Discovering the remote node is handled in application.ex like this:

# This is used when findin the events-sidecar node, overriden in tests
Application.put_env(:elmozi, :find_events_sidecar, fn ->
  node =
    Node.list(:hidden)
    |> Enum.filter(fn n -> String.contains?(Atom.to_string(n), "events_sidecar") end)
    |> hd

As the comment says, this then makes it easy to override the sidecar in tests, using a a mock or stub as needed. Note that we did not implement Node.ping/1 and this is not necessary.

A Short Interlude

Some small number (of the 5 of you who got this far) of you may now be asking… but why use C, there is a Go implementation of the BEAM distribution protocol? I had been watching this implementation, called Ergo, for awhile. I wrote some simple stuff using it. There was always a bit of an issue with making sure it supported the latest OTP version. In the past I steered away from it because we couldn’t realiably be sure that we would be able to upgrade Elixir and not suffer issues talking to Go. As of last fall, in a major departure, that project no longer supports the native distribution protocol. Instead, you must run a separate implementation on the BEAM side. And it’s now a commercial offering. Fair enough, the developer deserves to make some money, but I am glad we didn’t build anything serious on it.

Crystal Upgrade

Back to our C node. While I can write and maintain C code, I’m one of the few people here who can. So in order to improve the maintainability of the codebase, I decided to rewrite the C code in Crystal, a strongly typed language that looks and feels a lot like Ruby but is compiled to native machine code. This was not a major effort. It took some work to build the wrappers for the erl_interface library, but it wasn’t too bad. We did get this for free in C, but the Crystal wrappers are thin, and in the end the total line count (as a basic measure of complexity) of the Crystal code is still less than the C code. The result is a three language mash-up that actually feels pretty slick and fairly natural. It is certainly nicer to work on than the C code was.

To make life a little easier, we exposed some additional functions from Go to allow the Crystal code to log using our same Go logging setup and a few other basic infrastructure bits that we then don’t have to duplicate in Crystal. There is a performance penalty every time you cross the C/Go boundary, but for our use case, it’s not a big deal.

This is roughly what it looks like to work with the erl_interface library. Note that Elgo is the Crystal module wrapping the Go functions.

def handle_message(xbuf : Erlang::EiXBuff)
  # We predeclare these because we need to pass pointers to them to the Erlang code
  index = 0
  version = 0
  arity = 0
  pid = uninitialized Erlang::ErlPid
  atom_buf = StaticArray(UInt8, COMMAND_ATOM_MAX_SIZE).new(0)

  if Erlang.ei_decode_version(xbuf.buff, pointerof(index), pointerof(version)) != 0 ||
     Erlang.ei_decode_tuple_header(xbuf.buff, pointerof(index), pointerof(arity)) != 0 || arity != 2 ||
     Erlang.ei_decode_pid(xbuf.buff, pointerof(index), pointerof(pid)) != 0
    Elgo.elgo_log_error("Invalid message format".to_unsafe)
    return
  end

  if Erlang.ei_decode_tuple_header(xbuf.buff, pointerof(index), pointerof(arity)) != 0 || arity != 2
    Elgo.elgo_log_error("Failed to decode command tuple header".to_unsafe)
    return
  end

  if Erlang.ei_decode_atom(xbuf.buff, pointerof(index), atom_buf.to_unsafe) != 0
    Elgo.elgo_log_error("Failed to decode command".to_unsafe)
    return
  end

  command = String.new(atom_buf.to_slice)

  type = 0
  size = 0
  if Erlang.ei_get_type(xbuf.buff, pointerof(index), pointerof(type), pointerof(size)) != 0
    Elgo.elgo_log_error("Failed to get message type".to_unsafe)
    return
  end

  case type
  when ErlDefs::ERL_BINARY_EXT
    bin_buf = Bytes.new(size)
    bin_size = size.to_i64
    if Erlang.ei_decode_binary(xbuf.buff, pointerof(index), bin_buf, pointerof(bin_size)) == 0
      Elgo.elgo_log_info("Received binary message".to_unsafe)

      # Main switch for the functions we support
      case
      when command.starts_with?("allow_publish_event")
        # Call the Go code
        Elgo.elgo_add_allowed_publish_event(bin_buf)
      when command.starts_with?("notify_publish_ready")
        Elgo.elgo_notify_publish_ready(bin_buf)
      else
        Elgo.elgo_log_error("Unsupported command: #{command}".to_unsafe)
      end

    else
      Elgo.elgo_log_error("Failed to decode binary".to_unsafe)
    end
  else
    Elgo.elgo_log_error("Unsupported message type".to_unsafe)
  end

How It Works In Practice

It’s quite solid! We build and deploy the Crystal/Go code as a single Docker container, running in the same Kubernetes pod as the Elixir app. They are be independently built and are only coupled by the deploy-time configuration that specifies which version of each to deploy in the pod.

Because both Crystal and Go are able to build easily on both macOS and Linux, we can develop locally on macOS and build and deploy on Linux. It took a bit of fiddling to find the right distribution of Linux to build on with easy support for Crystal and that Go would run on properly.

The only hard part here was a temporary issue: in the end I had to build a custom build of the Go 1.24 compiler on Alpine Linux that has a patch to properly support MUSL libc when starting under Cgo. Shortly this won’t be ncessary as I did not write this patch myself, it was contributed to the Go project but has not yet shipped.

If there is enough interest, I will work to open source the Crystal wrapper we wrote to erl_interface so others can use it as well. Hit me up on Mastodon to let me know you are interested!