Karl Matthias

Member of Technical Staff at Mozi. Co-Author of “Docker: Up and Running” from O’Reilly Media

Building Tiny, Reliable Docker Container Images

Building good, clean Docker container images is a bit of an art, and there is a lot of conflicting advice out there about how to do them properly. I’d like to share some thoughts gained from running Docker containers in production for two years at New Relic. Some of this is discussed in the O’Reilly book Docker: Up and Running that I co-wrote with Sean Kane. There are all kinds of best practices we could talk about. Here I’ll focus on a few best practices aimed at making things small and reliable.

Size Matters

At scale you’re going to be shipping around tens, hundreds, or thousands of copies of your image. Docker’s distribution mechanism means that each host will need to have a copy of all of the layers of your image locally. If your image is 800MB then you’re going to have 800MB of data to pull at least once. There are a few prongs of attack to get that size down to a minimum. Some of these intermesh nicely with also making your images more reliable, as you’ll see.

Use Standard Base Images

Savings: Depends. Big environments will save a lot, smaller ones won’t.

Build all of your production containers on top of a simple set of base images. If your base image is big (usually meaning it’s based on a full OS distro), then with this pattern you still have to pull all of those layers, but you don’t have to do it for every application on each deployment. Since the layers are shared between apps, the overhead is reduced. There are all kinds of commonality benefits to be had here, too, so it’s a good pattern even if you don’t need the space.

I recommend constructing an image hierarchy with a build job that rebuilds, re-tags, and re-pushes all the affected base images when any upstream node is changed in the tree. This is a really nice pattern that has worked out well in my experience. With this pattern of building all the affected nodes in the tree whenever a change is pushed, anyone who will build an application image derived from one of those base layers will get the newest version of the base they depended on, even if it is an upstream image that changes. It also means you detect breaking changes to downstream images immediately, not when an application build fails down the road.

Here’s a linear example:

OS base -> Webapp base -> Ruby webapp -> Your Application

The behavior you want is to make sure that if security updates were applied to the “OS base” image, that your “Webapp base”, and “Ruby webapp” are automatically rebuilt, re-pushed, and re-tagged. Or imagine that the Nginx config in “Webapp base” was just improved. You want all future builds of “Your Application” to pick them up even though it’s built from “Ruby webapp” and not “Webapp base” directly.

Don’t Ship a Whole Linux Distribution

Savings: Huge.

If you’re building and deploying Docker containers FROM ubuntu or FROM centos and the like, then you may be causing yourself a lot of unecessary pain. In some cases this is the right thing to do. But for many applications you can get away with much less. I’ll talk about the MVP here in the next section. But let’s assume that you need a shell and maybe some other tools to bootstrap your application. That’s why you’re using one of the big distro base images. The good news is that there are great alternatives out there. I won’t go into all of them, I’ll just tell you about my favorite: Alpine Linux. This is a tiny distribution, aimed at embedded systems and other small installations. It’s perfect for containers because it has a full package manager, a lot of available packages, and there is a good Docker base image being maintained.

So, next time try FROM gliderlabs/alpine:3.3 and see if it works for you.

Statically Link Your Applications

Savings: Potentially huge. In many cases no benefit.

If you can get away with it, you can ship the most minimal application of all: your application binary and assorted supporting files. Rather than even using Alpine Linux as your base, here you just declare FROM scratch. This is a great way to ship Go applications, or Rust, or C, or other compiled languages where the application artifact can be a statically linked binary. If you’re running a JVM, or a Python, Ruby, or Node app, then this is probably not a solution for you. But if you can get away with it, there is basically 12KB of overhead here on top of your application. That’s pretty minimal! Your Dockerfile then shrinks to a few lines, with FROM scratch and then adding your application and its configs.

Process Management is Important

You need a program running at the top of your container that is meant to be run as PID 1 on a Linux system. That process is usually SysV init or Upstart, or Systemd on the major distributions. You need something like that in your container. Phusion wrote a good post explaining why this is important so I won’t rehash that here. Suffice it to say that you need a real PID 1 process at the top of your container tree. But there are other reasons.

There is a very worthy goal in the Docker community of running as little in your container as possible. I applaud that. But reality is a harsh master and it turns out that in widespread production deployments lots of things go wrong no matter how well you build them. And as they get more fluid, with changes flying constantly, they break more often. The best solution here is to use a platform that schedules your containers to hosts and manages their life cycle. Examples are Mesos with Marathon, Kubernetes, Deis, and friends. Even so, it’s often the case that you need to run more than one process in your container and because of that you must have something that makes sure that dependency inside the container is maintained. Docker and outside schedulers can see the container but if one process in the container dies they may not notice. Don’t go overboard running lots of things in a container. Your container should do one thing. But doing one thing doesn’t mean you must have one process.

Enter the process manager. A lot of alternatives exist in this space. But you want something that can act both as a real PID 1 and also does active process management. Phusion recommends runit which is a perfectly fine solution. If you are running a big distro, you can easily use Upstart or systemd. I have used supervisor here extensively and can recommend its robustness, but it doesn’t handle the other duties of PID 1 (reaping children, signals, etc). Also, a negative is that it needs a whole Python environment which can add 25-40MB to your container size.

My new favorite in this space is Skaware’s S6, a not-that-well-known alternative that is dead simple to use and configure. It’s minimalist but without giving up ability. The binaries are statically linked, very small, and each do one thing (take that, systemd). Installing it in your container image involves unpacking a single tarball and dropping a config file.

One final point in this section. If you’re not running a distributed scheduler, you owe it to yourself to use a process manager, even if there’s only one other process in your container. Without something managing your application health, you’re relying on Docker to restart your dead container for you. My experience says that’s a much worse than 50-50 proposition. A process manager will make sure your container stays where you put it. Without one, you’re on an upside down roller coaster with no safety bar.

Conclusion

That’s some coverage of a few simple ideas that can have a big impact on the reliability and sustainability of your Docker images. Docker is a great tool but getting the platform right around is not trivial. This advice is based on two years of real production Docker use where we shipped 75+ application deployments on Docker per day. It should stand you in good stead.