gotoz / runq
- четверг, 12 июля 2018 г. в 09:10:14
Go
run regular Docker images in KVM/Qemu
runq is a hypervisor-based Docker runtime based on runc to run regular Docker images in a lightweight KVM/Qemu virtual machine. The focus is on solving real problems, not on number of features.
Key differences to other hypervisor-based runtimes:
runc container runq container
+-------------------------+ +-------------------------+
| | | +---------------------+ |
| | | | VM | |
| | | | | |
| | | | | |
| application | | | application | |
| | | | | |
| | | | | |
| | | +---------------------+ |
| | | | guest kernel | |
| | | +---------------------+ |
| | | qemu |
+-------------------------+ +-------------------------+
----------------------------------------------------------------------
host kernel
The easiest way to build runq and to put all dependencies together is using Docker. For fast development cycles a regular build environment might be more efficient. For this refer to section Developing runq below.
# get the source
git clone https://github.com/gotoz/runq.git
cd runq
# compile and create a release tar file in Docker container
make release
# install runq to `/var/lib/runq`
make release-install
Register runq as Docker runtime with appropriate defaults. See daemon.json for more options.
/etc/docker/daemon.json
{
"runtimes": {
"runq": {
"path": "/var/lib/runq/runq",
"runtimeArgs": [
"--cpu", "1",
"--mem", "256",
"--dns", "8.8.8.8,8.8.4.4"
]
}
}
}
reload Docker config
systemctl reload docker.service
Note: To deploy runq on further Docker hosts only /var/lib/runq
and /etc/docker/daemon.json
must be copied.
the simplest example
docker run --runtime runq -ti busybox sh
custom VM with 512MiB memory and 2 CPUs
docker run --runtime runq -e RUNQ_MEM=512 -e RUNQ_CPU=2 -ti busybox sh
allow loading of extra kernel modules by adding the SYS_MODULE capabilitiy
docker run --runtime runq --cap-add sys_module -ti busybox sh -c "modprobe brd && lsmod"
full example PostgreSQL with custom storage
dd if=/dev/zero of=data.img bs=1M count=200
mkfs.ext4 -F data.img
docker run \
--runtime runq \
--name pgserver \
-e RUNQ_CPU=2 \
-e RUNQ_MEM=512 \
-e POSTGRES_PASSWORD=mysecret \
-v $PWD/data.img:/dev/disk/writeback/ext4/var/lib/postgresql \
-d postgres:alpine
sleep 10
docker run \
--runtime runq \
--link pgserver:postgres \
--rm \
-e PGPASSWORD=mysecret \
postgres:alpine psql -h postgres -U postgres -c "select 42 as answer;"
# answer
# --------
# 42
# (1 row)
docker cli
dockerd engine
docker-containerd-shim
runq container
+--------------------------------------------------------+
| |
docker0 | VM |
veth <---> veth +--------------------------+ |
| `<--- macvtap ---|-------> eth0 | |
| | | |
| proxy | init | |
| | | |
| msg, signals <-----|-------> vport | |
| | | |
| /overlayfs <-----|-------> /app | |
| | | |
| block dev <-----|-------> /dev/xvda | |
| | | |
| +--------------------------+ |
| | guest kernel | |
| +--------------------------+ |
| qemu |
| |
+--------------------------------------------------------+
--------------------------------------------------------------------------
host kernel
cmd/runq
cmd/proxy
cmd/init
qemu
/var/lib/runq/qemu
initrd
pkg/vm
pkg/util
runq runs Qemu and Linux Kernel from the /var/lib/runq/qemu
directory
on the host. This directory is populated by make -C qemu
. For simplicity
Qemu and the Linux kernel are taken from the Ubuntu 18.04 LTS Docker base image.
See qemu/x86_64/Dockerfile for details.
This makes runq independent of the Linux distribution on the host.
Qemu does not need to be installed on the host.
The kernel modules directory (/var/lib/runq/qemu/lib/modules
)
is bind-mounted
into every container to /lib/modules
.
This allows the loading of extra kernel modules in any container if needed.
For this SYS_MODULES capability is required (--cap-add sys_modules
).
runq uses Macvtap devices to connect Qemu VirtIO interfaces to Docker bridges. By default a single ethernet interface is created. Multiple networks can be used by connecting a container to the networks before start. See test/integration/net.sh as an example.
Docker uses an embedded DNS server (127.0.0.11) for containers that are connected to custom networks. This IP is not reachable from within the VM. Therefore DNS for runq containers must be configured seperatetly.
DNS configuration can be done globally via runtime options specified in
'daemon.json' (see example above) or via environment variables for each
container at container start.
The environment variables are RUNQ_DNS
, RUNQ_DNS_OPT
and RUNQ_DNS_SEARCH
.
Environment variables have priority over global options.
Extra storage can be added in the form of Qcow2 images, raw file images or regular block devices. Devices will be mounted automatically if they contain a supported filesytem and a mountpoint has been specified. Supported filesystems are ext2, ext3, ext4, xfs and btrfs.
The mount point must be prefixed with /dev/disk
and one of the
supported cache types (writeback, writethrough, none or unsafe).
See man qemu(1) for details.
Syntax:
--volume <image name>:/dev/disk/<cache type>/<filesystem type>/<mountpoint>
--device <device name>:/dev/disk/<cache type>/<filesystem type>/<mountpoint>
Mount the existing Qcow image /data.qcow2
that contains an xfs filesystem
to /mnt/data
.
docker run --volume /data.qcow2:/dev/disk/writeback/xfs/mnt/data ...
Attach the host device /dev/sdb1
with an ext4 filesystem to /mnt/data2
.
docker run --device /dev/sdb1:/dev/disk/writethrough/ext4/mnt/data2 ...
Attach the host device /dev/sdb2
but disable outomatic mounting. Use none
as
filesystem and any uniq ID to distingish multiple devices. The device
will show up as /dev/vda
inside the container.
docker run --device /dev/sdb2:/dev/disk/writethrough/none/0001 ...
By default runq drops all capabilities except those needed (same as regular Docker does). The white list of the remaining capabilities is provided by the Docker engine.
AUDIT_WRITE CHOWN DAC_OVERRIDE FOWNER FSETID KILL MKNOD NET_BIND_SERVICE NET_RAW SETFCAP SETGID SETPCAP SETUID SYS_CHROOT
See man capabilities
for a list of all available capabilities.
Additional Capabilities can be added to the whitelist at container start:
docker run --cap-add SYS_TIME --cap-add SYS_MODULE ...`
runq supports the default Docker seccomp profile as well as custom profiles.
docker run --security-opt seccomp=<profile-file> ...
The default profile is defined by the Docker daemon and gets applied automatically. Note: Only the runq init binary is statically linked against libseccomp. Therefore libseccomp is needed only at compile time.
If the host operating system where runq is beeing built does not provide static libseccomp libraries one can also simply build and install libseccomp from the sources.
Seccomp can be disabled at container start:
docker run --security-opt seccomp=unconfined ...
Note: Some Docker daemon don't support custom Seccomp profiles. Run docker info
to verify
that Seccomp is supported by your daemon. If it is supported the output of docker info
looks like this:
Security Options:
seccomp
Profile: default
When sigusr is enabled the directory /var/lib/runq/qemu/.runq
will be bind-mounted into the container VM under /.runq (read-only).
Sending a signal SIGUSR1 or SIGUSR2 to the container will then trigger
the execution of /.runq/SIGUSR1
or /.runq/SIGUSR2
in the VM.
This feature must be enabled explicitly via the --sigusr
runtime
option (see daemon.json).
The siguser command will run with uid 0 and gid 0, environment variables. The seccomp profile and the capabilities are the same as for the application process. If this feature is not enabled then the signals SIGUSR1 and SIGUSR2 will be forwarded to the application process as usual. Note: The default behavior of a process receiving SIGUSR1 or SIGUSR2 is to terminate.
docker kill --signal SIGUSR1 <container ID>
Most docker commands and options work as expected. However, due to the fact that the target application runs inside a Qemu VM which itself runs inside a Docker container and because of the minimalistic design principle of runq some docker commands and options don't work. E.g:
The following common options of docker run
are supported:
--attach --name
--cap-add --network
--cap-drop --publish
--cpus --restart
--cpuset-cpus --rm
--detach --runtime
--entrypoint --sysctl
--env --security-opt seccomp=unconfined
--env-file --security-opt no-new-privileges
--expose --security-opt seccomp=<filter-file>
--group-add --tmpfs
--help --tty
--hostname --ulimit
--interactive --user
--ip --volume
--link --volumes-from
--mount --workdir
A nested VM is a virtual machine that runs inside of a virtual machine. In plain KVM this feature is
considered working but not meant for production use. Running KVM guests inside guests of other
hypervisors such as VMware might not work as expected or might not work at all.
However to try out runq in a VM guest the (experimental) runq runtime configuration parameter
--nestedvm
can be used. It modifies the parameters of the Qemu process.
For fast development cycles runq can be build on the host as follows:
/var/lib/runq
must be writable by the current userlibseccomp-dev
for Ubuntu or libseccomp-static
for Fedorago get -d -u github.com/opencontainers/runc
go get -d -u github.com/gotoz/runq
/var/lib/runq/qemu
cd $GOPATH/src/github.com/gotoz/runq
make -C qemu
/var/lib/runq
make install
sudo chown -R root:root /var/lib/runq
See CONTRIBUTING for details.
The code is licensed under the Apache License 2.0.
See LICENSE for further details.