When interacting with people who have worked with docker containers, you can learn a lot about the hows of this concept, that's to say how to make one using Docker and do all sorts of fun things with it. But what was interesting to me, is quite a few of us developers have never had a chance to lift the curtains and what is Docker actually doing for you?
I am a curious mind so I dug a little deeper to understand what's really happening behind the scenes. In this article I'll go over the process of creating containers without using Docker (except for the first level of Ubuntu VM that I'll create, aka I'll be using nested virtualization) and manually running commands that'll give us a containerized environment eventually. This can be categorized as a more low-level and academic exercise. So sit tight!
Before we start, we got to introduce,
Containers! ๐
We know from Docker's website, that containers are defined as a standard unit of software that packages up code and all its dependencies so that the application runs quickly and reliably from one computing environment to another.
But if we were to over simplify this, we could say, at the core of a container
are a few kernel features of Linux bundled together, that help achieve isolation of processes inside a host, and we will be going over three of these: chroot
, namespaces
and cgroups
.
Cool! Let's start with chroot
and take a crack at building our container manually.
chroot aka Linux jails:
chroot
is a Linux command that allows you to set the root directory of a new process.
Consider the use case where two different parties are running processes/containers on the same host OS. As a security measure, it's important that we make sure each of them can't see files outside of their respective containers. this is where chroot
can help.
Let's dive in and see how it works, stepwise:
1.Firstly, we need to sort our host. I'll be using a Docker container with Ubuntu on it as my host. To do this, install Docker and simply run:
$ docker run -it --name docker-host --rm --privileged ubuntu:bionic
This will start up an Ubuntu container and drop us inside of it. I'd suggest do not worry about the details of the command above, but if you are curious, Google your way up! ๐
2.After step 1, your prompt should now show that you are at the root of the container you just created. Our aim here is to create a process that's limited to one folder of our system. So go ahead make a new directory and cd
into it:
$ mkdir eg-new-root
$ cd eg-new-root
- To
chroot
, we use the command like so:
$ chroot . bash
Which basically is saying change root to the current directory and run bash
after. But.. this fails. ๐
This is because, bash
is a program that we haven't installed yet on this new environment we just created.
To solve this, do the following:
$ cd ..
$ mkdir /eg-new-root/bin
$ cp /bin/bash /eg-new-root/bin/
$ chroot /eg-new-root bash
But.. it still fails. ๐ค
This is because, programs like bash
, cat
etc. need some libraries to run. We copied the commands but didn't bring over the right libraries yet. So let's do that. On running ldd
you should see the libraries required:
$ ldd /bin/bash
linux-vdso.so.1 (0x00007fffa89d8000)
libtinfo.so.5 => /lib/x86_64-linux-gnu/libtinfo.so.5 (0x00007f6fb8a07000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f6fb8803000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6fb8412000)
/lib64/ld-linux-x86-64.so.2 (0x00007f6fb8f4b000)
The ones with no paths in front of them can be ignored, but the others need to be copied under proper directories, which we shall make now:
$ mkdir /eg-new-root/lib{,64}
Let's copy the libraries, and make sure you get the files to directory mapping correct, so lib
โ /eg-new-root/lib
and lib64
โ /eg-new-root/lib64
:
$ cp /lib/x86_64-linux-gnu/libtinfo.so.5 /lib/x86_64-linux-gnu/libdl.so.2 /lib/x86_64-linux-gnu/libc.so.6 /eg-new-root/lib
$ cp /lib64/ld-linux-x86-64.so.2 /eg-new-root/lib64
And viola! Now your chroot /eg-new-root bash
should work beautifully and land you inside of bash
in eg-new-root
.
Keep in mind, some commands like pwd
come built-in so they will work just fine inside of the new bash
, but some others like ls
, cat
, etc. won't. So now you know what to do when that happens, run through the same process as above for these. Run ldd /bin/ls
or ldd /bin/cat
.
In our container use case, we just set the root directory to be wherever the new container's new root directory should be. And now because the new file system has no visibility outside of its new root, we are all good right?
Wrong, we still have a major security flaw. We hid the file systems of the two parties from each other but what about the processes. Because these are still exposed, one party could maliciously or by accident, kill one of the other parties processes. Enter namespaces
!
Namespaces:
Namespaces allow you to hide one set of processes from the other. If we give each chroot
-ed environment different namespaces, they get a whole different set of PIDs, or process IDs, networking layer, etc. and hence processes in one environment are completely isolated from the other, solidifying security further.
Note: There's a lot more depth to namespaces beyond what I've outlined here. The above is describing just the UTS (or UNIX Timesharing) namespace. There are more namespaces as well that help these containers stay isolated from each other.
The command we will be using here is unshare
. This creates a new isolated namespace for us.
Let's start implementing this:
- Make sure you are in the root directory of your host (type
exit
to exit yourchroot
-ed env) - We are going to create a new
chroot
-ed environment, which we will callimproved-root
, to demo namespaces. Here you can repeat the steps above to createimproved-root
, or use a tool calleddebootstrap
that creates a totally new,chroot
-able environment.
# update and install debootstrap
$ apt-get update -y
$ apt-get install debootstrap -y
$ debootstrap --variant=minbase bionic /improved-root
3.Now, if we cd
-ed into this new improved-root
, and ran chroot . bash
everything would just work fine and we have programs like ls
, cat
, etc. all available to us.
Make sure to exit
out of this bash before you proceed to the next step.
4.Now inside the improved-root
, we mention everything we would want to unshare
from this container, like so:
# inside the improved-root do these
$ unshare --mount --uts --ipc --net --pid --fork --user --map-root-user chroot /improved-root bash # this also chroot's for us
$ mount -t proc none /proc # process namespace
$ mount -t sysfs none /sys # filesystem
$ mount -t tmpfs none /tmp # filesystem
With this, we have created a new environment that's isolated on the system with its own PIDs, mounts (like storage and volumes), and the network stack. Beware though, that the host can still access the processes on these unshare
-d environments, because precisely, what we did was restricting the capabilities of the improved-root
container to access other processes outside of it but the vice-versa isn't automatically implemented.
This is how namespaces restrict capabilities of containers to interfere with other containers on the same host.
The two kernel features we discussed thus far, have been built into Linux for a long time, but the last one we will go over is relatively new and was invented at Google.
cgroups:
Cool, now we've successfully containerized the file systems and processes. Yay! But hold up. You might think this is good enough but not really. There is one major issue we haven't addressed yet; the resource sharing rules.
Say one of the parties in our running example are an e-commerce website and a major sale day like Black Friday hits. The website floods with more than usual visits and ends up using all the resources available, bring the other parties site/server down. Other scenarios might even include one party maliciously trying to bring down others servers by pegging all the resources to 100%. As we can see, we have a major problem at hand. Enter cgroups!
On Wikipedia, cgroups (abbreviated from control groups) are defined as a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, network, etc.) of a collection of processes.
This is a bit more technical than the aforementioned features but let's take a crack at it:
- We will hop outside of our
unshare
-d environment, and get some tools.
$ apt-get install -y cgroup-tools htop
htop
is just another tool that'll helps us better visualize the processes going on in the system. It's just a personal preference and could be ignored.
2.Next we use the cgreate
to create new cgroups. (Note: sandbox
is just a name, you can choose a different one)
# create new cgroups
$ cgcreate -g cpu,memory,blkio,devices,freezer:/sandbox
3.Note here that we can apply cgroups / limit the containers from the host, a container cannot limit itself. To put this in action, we run the unshare
-d environment from inside the improved-root
with the command:
$ unshare --mount --uts --ipc --net --pid --fork --user --map-root-user chroot /improved-root bash # this also chroot's for us and starts a bash with a PID that we can inspect from the host
4.Now, hop back into the host and add our unshare
-d environments to our cgroup, created above.
# add our unshare'd env to our cgroup
$ ps aux # grab the bash PID that's right after the unshare one
$ cgclassify -g cpu,memory,blkio,devices,freezer:sandbox <PID> #PID of the bash right after the unshare
Because processes run in a tree like structure, with the command above, we added bash
and all it's child processes inside the cgroup we created. But that's not all!
5.You can check what processes are in your cgroup with the cat
command, as follows:
# list tasks associated to the sandbox cpu group, we should see the above PID
$ cat /sys/fs/cgroup/cpu/sandbox/tasks
# show the cpu share of the sandbox cpu group, this is the number that determines priority between competing resources, higher is is higher priority
$ cat /sys/fs/cgroup/cpu/sandbox/cpu.shares
6.Now to actually limit the resources, for eg. processing power for this sandbox
cgroup to 5% and memory to 80M, we do:
# Limit usage at 5% for a multi core system
$ cgset -r cpu.cfs_period_us=100000 -r cpu.cfs_quota_us=$[ 5000 * $(getconf _NPROCESSORS_ONLN) ] sandbox
# Set a limit of 80M
$ cgset -r memory.limit_in_bytes=80M sandbox
# You can cross-check and get memory stats used by the cgroup
$ cgget -r memory.stat sandbox
And congrats! With this forever long process, you have now reached the absolute bare-minimum of what can be called a container. ๐๐ผ
Somethings we still haven't taken care of are: how would volumes expand, how would networks be managed, and a whole plethora of things.
Enter Docker, the popular kid on the block
Now see this:
$ docker run --interactive --tty alpine:3.10 # or, to be shorter: docker run -it alpine:3.10
If you had Docker installed and ran this one line of code above, it will drop you into a Alpine ash shell inside of a container as the root user of that container. When you're done, just run exit or hit CTRL+D.
So basically, all we did so far and a whole world of other things, in just this one line. Mind blown yet? I am. ๐คฏ ๐คฉ
Conclusion
With this article my aim is to tickle your curious mind a little and also hopefully, the next time you come across an unexpected behaviour when working with this awesome tool called Docker and it makes you wanna throw your systems at the wall, maybe you'll be a little more considerate knowing everything that its doing for you. Happy coding!