CRFS: Container Registry Filesystem

Discussion: https://github.com/golang/go/issues/30829

Overview

CRFS is a read-only FUSE filesystem that lets you mount a container image, served directly from a container registry (such as gcr.io), without pulling it all locally first.

Background

Starting a container should be fast. Currently, however, starting a container in many environments requires doing a pull operation from a container registry to read the entire container image from the registry and write the entire container image to the local machine's disk. It's pretty silly (and wasteful) that a read operation becomes a write operation. For small containers, this problem is rarely noticed. For larger containers, though, the pull operation quickly becomes the slowest part of launching a container, especially on a cold node. Contrast this with launching a VM on major cloud providers: even with a VM image that's hundreds of gigabytes, the VM boots in seconds. That's because the hypervisors' block devices are reading from the network on demand. The cloud providers all have great internal networks. Why aren't we using those great internal networks to read our container images on demand?

Why does Go want this?

Go's continuous build system tests Go on many operating systems and architectures, using a mix of containers (mostly for Linux) and VMs (for other operating systems). We prioritize fast builds, targetting 5 minute turnaround for pre-submit tests when testing new changes. For isolation and other reasons, we run all our containers in a single-use fresh VMs. Generally our containers do start quickly, but some of our containers are very large and take a long time to start. To work around that, we've automated the creation of VM images where our heavy containers are pre-pulled. This is all a silly workaround. It'd be much better if we could just read the bytes over the network from the right place, without the all the hoops.

Tar files

One reason that reading the bytes directly from the source on demand is somewhat non-trivial is that container images are, somewhat regrettably, represented by tar.gz files, and tar files are unindexed, and gzip streams are not seekable. This means that trying to read 1KB out of a file named /var/lib/foo/data still involves pulling hundreds of gigabytes to uncompress the stream, to decode the entire tar file until you find the entry you're looking for. You can't look it up by its path name.

Introducing Stargz

Fortunately, we can fix the fact that tar.gz files are unindexed and unseekable, while still making the file a valid tar.gz file by taking advantage of the fact that two gzip streams can be concatenated and still be a valid gzip stream. So you can just make a tar file where each tar entry is its own gzip stream.

We introduce a format, Stargz, a Seekable tar.gz format that's still a valid tar.gz file for everything else that's unaware of these details.

In summary:

That traditional *.tar.gz format is: Gzip(TarF(file1) + TarF(file2) + TarF(file3) + TarFooter))
Stargz's format is: Gzip(TarF(file1)) + Gzip(TarF(file2)) + Gzip(TarF(file3_chunk1)) + Gzip(F(file3_chunk2)) + Gzip(F(index of earlier files in magic file), TarFooter), where the trailing ZIP-like index contains offsets for each file/chunk's GZIP header in the overall stargz file.

This makes images a few percent larger (due to more gzip headers and loss of compression context between files), but it's plenty acceptable.

Converting images

If you're using docker push to push to a registry, you can't use CRFS to mount the image. Maybe one day docker push will push stargz files (or something with similar properties) by default, but not yet. So for now we need to convert the storage image layers from tar.gz into stargz. There is a tool that does that. TODO: examples

Operation

When mounting an image, the FUSE filesystem makes a couple Docker Registry HTTP API requests to the container registry to get the metadata for the container and all its layers.

It then does HTTP Range requests to read just the stargz index out of the end of each of the layers. The index is stored similar to how the ZIP format's TOC is stored, storing a pointer to the index at the very end of the file. Generally it takes 1 HTTP request to read the index, but no more than 2. In any case, we're assuming a fast network (GCE VMs to gcr.io, or similar) with low latency to the container registry. Each layer needs these 1 or 2 HTTP requests, but they can all be done in parallel.

From that, we keep the index in memory, so readdir, stat, and friends are all served from memory. For reading data, the index contains the offset of each file's GZIP(TAR(file data)) range of the overall stargz file. To make it possible to efficiently read a small amount of data from large files, there can actually be multiple stargz index entries for large files. (e.g. a new gzip stream every 16MB of a large file).

Union/overlay filesystems

CRFS can do the aufs/overlay2-ish unification of multiple read-only stargz layers, but it will stop short of trying to unify a writable filesystem layer atop. For that, you can just use the traditional Linux filesystems.

Using with Docker, without modifying Docker

Ideally container runtimes would support something like this whole scheme natively, but in the meantime a workaround is that when converting an image into stargz format, the converter tool can also produce an image variant that only has metadata (environment, entrypoints, etc) and no file contents. Then you can bind mount in the contents from the CRFS FUSE filesystem.

That is, the convert tool can do:

Input: gcr.io/your-proj/container:v2

Output: gcr.io/your-proj/container:v2meta + gcr.io/your-proj/container:v2stargz

What you actually run on Docker or Kubernetes then is the v2meta version, so your container host's docker pull or equivalent only pulls a few KB. The gigabytes of remaining data is read lazily via CRFS from the v2stargz layer directly from the container registry.

Status

WIP. Enough parts are implemented & tested for me to realize this isn't crazy. I'm publishing this document first for discussion while I finish things up. Maybe somebody will point me to an existing implementation, which would be great.

Discussion

See https://github.com/golang/go/issues/30829