gotoz / runq

четверг, 12 июля 2018 г. в 09:10:14

Go
run regular Docker images in KVM/Qemu

runq

runq is a hypervisor-based Docker runtime based on runc to run regular Docker images in a lightweight KVM/Qemu virtual machine. The focus is on solving real problems, not on number of features.

Key differences to other hypervisor-based runtimes:

minimalistic design, small code base
no modification to existing Docker tools (dockerd, containerd, runc...)
coexistence of runq containers and regular runc containers
no extra state outside of Docker (no libvirt, no changes to /var/run/...)
simple init daemon, no systemd, no busybox
no custom guest kernel or custom qemu needed
runs on x86_64 and s390x

runc vs. runq

       runc container                   runq container
       +-------------------------+      +-------------------------+
       |                         |      | +---------------------+ |
       |                         |      | |                  VM | |
       |                         |      | |                     | |
       |                         |      | |                     | |
       |       application       |      | |     application     | |
       |                         |      | |                     | |
       |                         |      | |                     | |
       |                         |      | +---------------------+ |
       |                         |      | |     guest kernel    | |
       |                         |      | +---------------------+ |
       |                         |      |           qemu          |
       +-------------------------+      +-------------------------+
 ----------------------------------------------------------------------
                                host kernel

Installation

The easiest way to build runq and to put all dependencies together is using Docker. For fast development cycles a regular build environment might be more efficient. For this refer to section Developing runq below.

# get the source
git clone https://github.com/gotoz/runq.git
cd runq

# compile and create a release tar file in Docker container
make release

# install runq to `/var/lib/runq`
make release-install

/etc/docker/daemon.json
{
  "runtimes": {
    "runq": {
      "path": "/var/lib/runq/runq",
      "runtimeArgs": [
        "--cpu", "1",
        "--mem", "256",
        "--dns", "8.8.8.8,8.8.4.4"
      ]
    }
  }
}

reload Docker config

systemctl reload docker.service

Note: To deploy runq on further Docker hosts only /var/lib/runq and /etc/docker/daemon.json must be copied.

Usage examples

the simplest example

docker run --runtime runq -ti busybox sh

custom VM with 512MiB memory and 2 CPUs

docker run --runtime runq -e RUNQ_MEM=512 -e RUNQ_CPU=2 -ti busybox sh

allow loading of extra kernel modules by adding the SYS_MODULE capabilitiy

docker run --runtime runq --cap-add sys_module -ti busybox sh -c "modprobe brd && lsmod"

full example PostgreSQL with custom storage

dd if=/dev/zero of=data.img bs=1M count=200
mkfs.ext4 -F data.img

docker run \
    --runtime runq \
    --name pgserver \
    -e RUNQ_CPU=2 \
    -e RUNQ_MEM=512 \
    -e POSTGRES_PASSWORD=mysecret \
    -v $PWD/data.img:/dev/disk/writeback/ext4/var/lib/postgresql \
    -d postgres:alpine

sleep 10

docker run \
    --runtime runq \
    --link pgserver:postgres \
    --rm \
    -e PGPASSWORD=mysecret \
    postgres:alpine psql -h postgres -U postgres -c "select 42 as answer;"

#  answer
# --------
#      42
# (1 row)

runq Components

 docker cli
    dockerd engine
       docker-containerd-shim
            runq                                           container
          +--------------------------------------------------------+
          |                                                        |
docker0   |                                                  VM    |
   veth <---> veth                  +--------------------------+   |
          |        `<--- macvtap ---|-------> eth0             |   |
          |                         |                          |   |
          |   proxy                 |      init                |   |
          |                         |                          |   |
          |     msg, signals  <-----|------->   vport          |   |
          |                         |                          |   |
          |     /overlayfs    <-----|------->   /app           |   |
          |                         |                          |   |
          |     block dev     <-----|------->   /dev/xvda      |   |
          |                         |                          |   |
          |                         +--------------------------+   |
          |                         |       guest kernel       |   |
          |                         +--------------------------+   |
          |                                     qemu               |
          |                                                        |
          +--------------------------------------------------------+

 --------------------------------------------------------------------------
                                host kernel

cmd/runq
- new docker runtime
cmd/proxy
- first process in container (PID 1)
- new Docker entry point
- configures and starts Qemu (network, disks, ...)
- forwards signals to VM init
- receives application exit code
cmd/init
- first process in VM (PID 1)
- initializes the VM guest (network, disks, ...)
- starts/stops target app (Docker entry point)
- sends signals to target application
- forwards application exit code back to proxy
qemu
- creates /var/lib/runq/qemu
- read-only volume attached to every container
- contains proxy, qemu, kernel and initrd
initrd
- temporary root file system of the VM
- contains only init and a few kernel modules
pkg/vm
- type definitions and configuration
pkg/util
- utiliy functions used accross all commands

Qemu and guest Kernel

runq runs Qemu and Linux Kernel from the /var/lib/runq/qemu directory on the host. This directory is populated by make -C qemu. For simplicity Qemu and the Linux kernel are taken from the Ubuntu 18.04 LTS Docker base image. See qemu/x86_64/Dockerfile for details. This makes runq independent of the Linux distribution on the host. Qemu does not need to be installed on the host.

The kernel modules directory (/var/lib/runq/qemu/lib/modules) is bind-mounted into every container to /lib/modules. This allows the loading of extra kernel modules in any container if needed. For this SYS_MODULES capability is required (--cap-add sys_modules).

Networking

runq uses Macvtap devices to connect Qemu VirtIO interfaces to Docker bridges. By default a single ethernet interface is created. Multiple networks can be used by connecting a container to the networks before start. See test/integration/net.sh as an example.

Docker uses an embedded DNS server (127.0.0.11) for containers that are connected to custom networks. This IP is not reachable from within the VM. Therefore DNS for runq containers must be configured seperatetly.

DNS configuration can be done globally via runtime options specified in 'daemon.json' (see example above) or via environment variables for each container at container start. The environment variables are RUNQ_DNS, RUNQ_DNS_OPT and RUNQ_DNS_SEARCH. Environment variables have priority over global options.

Storage

Extra storage can be added in the form of Qcow2 images, raw file images or regular block devices. Devices will be mounted automatically if they contain a supported filesytem and a mountpoint has been specified. Supported filesystems are ext2, ext3, ext4, xfs and btrfs.

The mount point must be prefixed with /dev/disk and one of the supported cache types (writeback, writethrough, none or unsafe). See man qemu(1) for details.

Syntax:

--volume <image  name>:/dev/disk/<cache type>/<filesystem type>/<mountpoint>
--device <device name>:/dev/disk/<cache type>/<filesystem type>/<mountpoint>

Storage examples

Mount the existing Qcow image /data.qcow2 that contains an xfs filesystem to /mnt/data.

docker run --volume /data.qcow2:/dev/disk/writeback/xfs/mnt/data ...

Attach the host device /dev/sdb1 with an ext4 filesystem to /mnt/data2.

docker run --device /dev/sdb1:/dev/disk/writethrough/ext4/mnt/data2 ...

Attach the host device /dev/sdb2 but disable outomatic mounting. Use none as filesystem and any uniq ID to distingish multiple devices. The device will show up as /dev/vda inside the container.

docker run --device /dev/sdb2:/dev/disk/writethrough/none/0001 ...

Capabilities

By default runq drops all capabilities except those needed (same as regular Docker does). The white list of the remaining capabilities is provided by the Docker engine.

AUDIT_WRITE CHOWN DAC_OVERRIDE FOWNER FSETID KILL MKNOD NET_BIND_SERVICE NET_RAW SETFCAP SETGID SETPCAP SETUID SYS_CHROOT

See man capabilities for a list of all available capabilities. Additional Capabilities can be added to the whitelist at container start:

docker run --cap-add SYS_TIME --cap-add SYS_MODULE ...`

Seccomp

runq supports the default Docker seccomp profile as well as custom profiles.

docker run --security-opt seccomp=<profile-file> ...

The default profile is defined by the Docker daemon and gets applied automatically. Note: Only the runq init binary is statically linked against libseccomp. Therefore libseccomp is needed only at compile time.

If the host operating system where runq is beeing built does not provide static libseccomp libraries one can also simply build and install libseccomp from the sources.

Seccomp can be disabled at container start:

docker run --security-opt seccomp=unconfined ...

Note: Some Docker daemon don't support custom Seccomp profiles. Run docker info to verify that Seccomp is supported by your daemon. If it is supported the output of docker info looks like this:

Security Options:
 seccomp
  Profile: default

SIGUSR1 and SIGUSR2

When sigusr is enabled the directory /var/lib/runq/qemu/.runq will be bind-mounted into the container VM under /.runq (read-only). Sending a signal SIGUSR1 or SIGUSR2 to the container will then trigger the execution of /.runq/SIGUSR1 or /.runq/SIGUSR2 in the VM.

This feature must be enabled explicitly via the --sigusr runtime option (see daemon.json).

The siguser command will run with uid 0 and gid 0, environment variables. The seccomp profile and the capabilities are the same as for the application process. If this feature is not enabled then the signals SIGUSR1 and SIGUSR2 will be forwarded to the application process as usual. Note: The default behavior of a process receiving SIGUSR1 or SIGUSR2 is to terminate.

docker kill --signal SIGUSR1 <container ID>

Limitations

Most docker commands and options work as expected. However, due to the fact that the target application runs inside a Qemu VM which itself runs inside a Docker container and because of the minimalistic design principle of runq some docker commands and options don't work. E.g:

adding / removing networks and storage dynamically
docker exec
docker swarm
privileged mode
apparmor, selinux, ambient

The following common options of docker run are supported:

--attach                    --name
--cap-add                   --network
--cap-drop                  --publish
--cpus                      --restart
--cpuset-cpus               --rm
--detach                    --runtime
--entrypoint                --sysctl
--env                       --security-opt seccomp=unconfined
--env-file                  --security-opt no-new-privileges
--expose                    --security-opt seccomp=<filter-file>
--group-add                 --tmpfs
--help                      --tty
--hostname                  --ulimit
--interactive               --user
--ip                        --volume
--link                      --volumes-from
--mount                     --workdir

Nested VM

A nested VM is a virtual machine that runs inside of a virtual machine. In plain KVM this feature is considered working but not meant for production use. Running KVM guests inside guests of other hypervisors such as VMware might not work as expected or might not work at all. However to try out runq in a VM guest the (experimental) runq runtime configuration parameter --nestedvm can be used. It modifies the parameters of the Qemu process.

Developing runq

For fast development cycles runq can be build on the host as follows:

Prerequisites:

Docker >= 17.09.x-ce
Go >= 1.9
GOPATH must be set
/var/lib/runq must be writable by the current user
Libseccomp static library. E.g. libseccomp-dev for Ubuntu or libseccomp-static for Fedora

Download runc and runq source code

go get -d -u github.com/opencontainers/runc
go get -d -u github.com/gotoz/runq

Install Qemu and guest kernel to /var/lib/runq/qemu
All files are taken from the Ubuntu 18.04 LTS Docker base image.
```
cd $GOPATH/src/github.com/gotoz/runq
make -C qemu
```

Compile and install runq components to /var/lib/runq

make install
sudo chown -R root:root /var/lib/runq

Contributing

See CONTRIBUTING for details.

License

The code is licensed under the Apache License 2.0.
See LICENSE for further details.