Do you know what container layers are? How are they stored on the file system? How are they used run a program? It’s time to answer these questions.
Containers are isolated environments for programs, and the foundation of programs is files. The program itself is an executable file, almost every program needs libc (the libc.so file), the time zone database (the /usr/share/zoneinfo directory), a dynamic linker (the ld-linux.so file).
Containers are also self-sufficient: you can download it and you don’t need to install anything on the host system. To secure this property, we must isolate its files from the host files.
Chroot Link to heading
The easiest way to achieve this isolation is chroot(2). We’ll use this syscall via a command with the same name: chroot(1).
Let’s build our first container with bash and ls.
$ which bash ls
/usr/bin/bash
/usr/bin/ls
$ mkdir -p ./container/usr/bin
$ cp /usr/bin/bash /usr/bin/ls ./container/usr/bin/
$ sudo chroot ./container /usr/bin/bash
chroot: failed to run command ‘/usr/bin/bash’: No such file or directory
No such file or directory? That doesn’t sound right. Let’s check it again.
$ ls ./container/usr/bin/bash
./container/usr/bin/bash
$ file /usr/bin/bash
/usr/bin/bash: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=33a5554034feb2af38e8c75872058883b2988bc5, for GNU/Linux 3.2.0, stripped
Oh, bash is a dynamically linked program. The kernel cannot find its dynamic linker: /lib64/ld-linux-x86-64.so.2
. We need to copy its dependencies and the dynamic linker.
$ ldd /usr/bin/bash
linux-vdso.so.1 (0x00007fff3cff9000)
libtinfo.so.6 => /lib/x86_64-linux-gnu/libtinfo.so.6 (0x00007ffbb7b03000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffbb78db000)
/lib64/ld-linux-x86-64.so.2 (0x00007ffbb7c9f000)
$ ldd /usr/bin/ls
linux-vdso.so.1 (0x00007ffffc78d000)
libselinux.so.1 => /lib/x86_64-linux-gnu/libselinux.so.1 (0x00007f3f2ac51000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3f2aa29000)
libpcre2-8.so.0 => /lib/x86_64-linux-gnu/libpcre2-8.so.0 (0x00007f3f2a992000)
/lib64/ld-linux-x86-64.so.2 (0x00007f3f2acaa000)
Libraries may need other libraries, let’s check dependencies of libtinfo.so.6.
$ ldd /lib/x86_64-linux-gnu/libtinfo.so.6
linux-vdso.so.1 (0x00007ffec9b42000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f786236a000)
/lib64/ld-linux-x86-64.so.2 (0x00007f78625cd000)
No new dependencies, output for other libraries on Ubuntu 22.04 is the same.
Let’s put all dependencies into our container directory and try it again.
$ mkdir -p ./container/lib/x86_64-linux-gnu
$ cp \
/lib/x86_64-linux-gnu/libtinfo.so.6 \
/lib/x86_64-linux-gnu/libselinux.so.1 \
/lib/x86_64-linux-gnu/libpcre2-8.so.0 \
/lib/x86_64-linux-gnu/libc.so.6 \
./container/lib/x86_64-linux-gnu/
$ mkdir -p ./container/lib64
$ cp /lib64/ld-linux-x86-64.so.2 ./container/lib64
$ sudo chroot ./container /usr/bin/bash
bash-5.1# ls -al /
total 20
drwxrwxr-x 5 1000 1000 4096 Jan 30 13:06 .
drwxrwxr-x 5 1000 1000 4096 Jan 30 13:06 ..
drwxrwxr-x 3 1000 1000 4096 Jan 30 13:06 lib
drwxrwxr-x 2 1000 1000 4096 Jan 30 13:06 lib64
drwxrwxr-x 3 1000 1000 4096 Jan 30 12:58 usr
bash-5.1# exit
exit
It works! We can pack it, send it to another Linux machine, and it’ll work.
$ tree ./container
./container
├── lib
│ └── x86_64-linux-gnu
│ ├── libc.so.6
│ ├── libpcre2-8.so.0
│ ├── libselinux.so.1
│ └── libtinfo.so.6
├── lib64
│ └── ld-linux-x86-64.so.2
└── usr
└── bin
├── bash
└── ls
5 directories, 7 files
Chroot is not a security feature Link to heading
Some programs need /proc, let’s mount it and check what we have there.
$ mkdir ./container/proc
$ sudo mount -t proc proc ./container/proc
$ sudo chroot ./container /usr/bin/bash
bash-5.1# ls -al /proc/1/cwd/
total 72
drwxr-xr-x 20 0 0 4096 Jan 20 23:11 .
drwxr-xr-x 20 0 0 4096 Jan 20 23:11 ..
lrwxrwxrwx 1 0 0 7 Jan 10 02:08 bin -> usr/bin
drwxr-xr-x 3 0 0 4096 Jan 21 11:17 boot
drwxr-xr-x 17 0 0 3840 Jan 20 23:10 dev
drwxr-xr-x 95 0 0 4096 Jan 29 22:21 etc
drwxr-xr-x 4 0 0 4096 Jan 20 23:10 home
lrwxrwxrwx 1 0 0 7 Jan 10 02:08 lib -> usr/lib
lrwxrwxrwx 1 0 0 9 Jan 10 02:08 lib32 -> usr/lib32
lrwxrwxrwx 1 0 0 9 Jan 10 02:08 lib64 -> usr/lib64
lrwxrwxrwx 1 0 0 10 Jan 10 02:08 libx32 -> usr/libx32
drwx------ 2 0 0 16384 Jan 10 02:10 lost+found
drwxr-xr-x 2 0 0 4096 Jan 10 02:08 media
drwxr-xr-x 2 0 0 4096 Jan 10 02:08 mnt
drwxr-xr-x 2 0 0 4096 Jan 10 02:08 opt
dr-xr-xr-x 167 0 0 0 Jan 20 23:10 proc
drwx------ 4 0 0 4096 Jan 24 09:44 root
drwxr-xr-x 34 0 0 1040 Jan 30 12:52 run
lrwxrwxrwx 1 0 0 8 Jan 10 02:08 sbin -> usr/sbin
drwxr-xr-x 6 0 0 4096 Jan 10 02:09 snap
drwxr-xr-x 2 0 0 4096 Jan 10 02:08 srv
dr-xr-xr-x 13 0 0 0 Jan 20 23:10 sys
drwxrwxrwt 11 0 0 4096 Jan 30 12:52 tmp
drwxr-xr-x 14 0 0 4096 Jan 10 02:08 usr
drwxr-xr-x 1 1000 1000 160 Jan 23 01:42 vagrant
drwxr-xr-x 13 0 0 4096 Jan 10 02:09 var
bash-5.1#
Oops, we’ve got access to the host files.
Chroot is not a security feature, and this is not the only way to break out of the chroot, you can read more about it in these articles:
- Breaking out of CHROOT Jailed Shell Environment by Gurkirat Singh
- chw00t by Balazs Bucsay et al.
Even though our container isn’t secure at all, it’s good enough for demonstration purposes, so let’s clean up and move on.
$ sudo umount ./container/proc
$ rm -rf ./container
BusyBox Link to heading
We can avoid all this dynamic-library fuss if we use BusyBox.
$ mkdir -p ./container/bin
$ cp "$(which busybox)" ./container/bin/sh
$ sudo chroot ./container /bin/sh
BusyBox v1.30.1 (Ubuntu 1:1.30.1-7ubuntu3) built-in shell (ash)
Enter 'help' for a list of built-in commands.
/ # ls /
bin
/ # exit
BusyBox is a statically linked executable, so it doesn’t need any dependencies. If you don’t know about BusyBox, I’d recommend to read its man.
Layers Link to heading
Even putting a single binary like bash into an isolated root directory is quite a challenge. If you want to put something that doesn’t exist on your machine (some library), it’s even more difficult.
There is one thing that can help us: package managers. If we have a package manager in a container, we can use it to install additional programs and libraries.
We can create a root directory with only a package manager. This base root directory may be relatively small, but if we need to run a lot of containers, things can quickly get worse: each container needs its own copy of these base files, and creating them (i.e. creating a container) takes time and takes up disk space.
Luckily, we have a solution: the overlay filesystem. It can merge multiple directories into a single directory tree, and keep the base directory untouched.
mount \
-t overlay \
-o lowerdir=./lower1:./lower2,upperdir=./upper,workdir=./work \
overlay \
./mergeddir
If you read a file from ./mergeddir, first it’ll be searched in ./upper, then in ./lower1, and then ./lower2. The first found file is used. Writes happen only into ./upper directory. When you delete a file or directory, a special marker (a whiteout) is created in the upper directory. The workdir is an empty directory for the kernel purposes, it should be on the same filesystem as upperdir. You can read more about overlayfs in the kernel documentation.
It allows us to have only one copy of the base files, and use them in different containers. Then we can run a container with the base files, install some programs, and create a snapshot of the result files. These snapshots are layers.
This concept of layers is at the heart of containers. Note that the layers are in a different order than the lowerdir. The second layer overrides content of the first layer.
Creating a container with multiple layers Link to heading
Let’s create the layers of our container. We’ll use BusyBox this time.
mkdir ./data ./data/layer1 ./data/layer2 ./data/layer3
# layer 1
mkdir ./data/layer1/bin ./data/layer1/etc
cp "$(which busybox)" ./data/layer1/bin/sh
echo 'root:x:0:0:root:/:/bin/sh' >./data/layer1/etc/passwd
# layer 2
cat >./data/layer2/myapp.sh <<'END'
#!/bin/sh -x
echo .* *
. /.env
echo "$MSG"
END
chmod +x ./data/layer2/myapp.sh
# layer 3
mkdir ./data/layer3/etc
cp ./data/layer1/etc/passwd ./data/layer3/etc/passwd
echo 'user:x:1000:1000:user:/:/bin/sh' >>./data/layer3/etc/passwd
echo 'MSG="Hello, world!"' >./data/layer3/.env
Note that to add only one line, we copied the entire file. We would need to copy it even if we wanted to change permissions or other attributes.
We also need an upperdir (we’ll call it diff), workdir, and mergeddir.
mkdir ./data/container ./data/container/diff ./data/container/work \
./data/container/merged
Let’s run it.
LOWERDIR="./data/layer3:./data/layer2:./data/layer1"
UPPERDIR="./data/container/diff"
WORKDIR="./data/container/work"
sudo mount \
-t overlay \
-o "lowerdir=$LOWERDIR,upperdir=$UPPERDIR,workdir=$WORKDIR" \
overlay \
./data/container/merged
sudo chroot ./data/container/merged /myapp.sh
The last command should print:
+ echo . .. .env bin etc myapp.sh
. .. .env bin etc myapp.sh
+ . /.env
+ MSG='Hello, world!'
+ echo 'Hello, world!'
Hello, world!
We’ve created a container that has 3 layers.
As a final step, we must clean up after ourselves.
sudo umount ./data/container/merged
rm -rf ./data
Tool for creating layers and running containers Link to heading
Let’s create a bash script wiac (what-is-a-container) that automates everything we did before.
wiac:
#!/bin/bash -eu
TMPDIR="${TMPDIR:-/tmp}"
DATADIR="${DATADIR:-"$TMPDIR/wiac-data"}"
err() { echo "$*" >&2; exit 1; }
copy() {
if [[ $# < 3 ]]; then
err "usage: copy <layerdir> <src1> [<src2> ...] <dst>"
fi
local layerdir="$1"
local src=("${@:2:$#-2}")
local dst="${@:$#}"
mkdir -p "$layerdir/diff/${dst%/*}"
cp -R "${src[@]}" "$layerdir/diff/$dst"
}
run() {
if [[ $# < 3 ]]; then
err "usage: run <lowerdirN>[...:<lowerdir1>] <containerlayer> <entrypoint...>"
fi
mkdir -p "$2/diff" "$2/work" "$2/merged"
local mergeddir="$2/merged"
sudo mount -t overlay \
-o "lowerdir=$1,upperdir=$2/diff,workdir=$2/work" \
overlay "$mergeddir"
shift 2 # remove lowerdir and containerlayer from args
(
trap 'sudo umount "$mergeddir"' EXIT
sudo chroot "$mergeddir" "$@"
)
}
# preprocess_dockerfile removes comments and handles line continuations.
preprocess_dockerfile() {
sed -n '
: again
/^#/d
/\\$/ {
N
s/\\\n//
t again
}
/^[A-Z]/p
' "$@"
}
build() {
if [[ $# != 1 ]]; then
err "usage: build <path>"
fi
cd "$1"
local n=1 lowerdir="" cmd
while read -r line; do
set -- $line
cmd=$1; shift
case "$cmd" in
FROM)
;;
COPY)
copy "$DATADIR/layer$n" "$@"
if [ -z "$lowerdir" ]; then
lowerdir="$DATADIR/layer$n/diff"
else
lowerdir="$DATADIR/layer$n/diff:$lowerdir"
fi
n=$((n+1))
;;
RUN)
run "$lowerdir" "$DATADIR/layer$n" /bin/sh -c "$*"
lowerdir="$DATADIR/layer$n/diff:$lowerdir"
n=$((n+1))
;;
*)
err "unexpected command $cmd"
;;
esac
done < <(preprocess_dockerfile ./Dockerfile)
echo "$lowerdir"
}
cmd=$1; shift
case "$cmd" in
copy|run|build)
"$cmd" "$@"
;;
*)
err "unknown command $cmd"
;;
esac
It can build a container from scratch.
mkdir -p ./src/bin ./src/etc
cp "$(which busybox)" ./src/bin/sh
echo 'root:x:0:0:root:/:/bin/sh' >./src/etc/passwd
cat >./src/Dockerfile <<'END'
FROM scratch
COPY . /
RUN printf '#!/bin/sh -x\necho .* *\n. ./.env\necho $MSG\n' >./myapp.sh && \
chmod +x ./myapp.sh
RUN echo "user:x:1000:1000:user:/:/bin/sh" >>/etc/passwd && \
echo 'MSG="Hello, world!"' >/.env
END
~$ cd ./src
~/src$ ../wiac build .
/tmp/wiac-data/layer3/diff:/tmp/wiac-data/layer2/diff:/tmp/wiac-data/layer1/diff
~/src$ ../wiac run /tmp/wiac-data/layer3/diff:/tmp/wiac-data/layer2/diff:/tmp/wiac-data/layer1/diff /tmp/upperlayer /myapp.sh
+ echo . .. Dockerfile bin etc myapp.sh
. .. Dockerfile bin etc myapp.sh
+ . ./.env
+ MSG='Hello, world!'
+ echo Hello, 'world!'
Hello, world!
~/src$
Reusing podman layers Link to heading
$ podman run --rm -i -t node mount
...
overlay on / type overlay (rw,relatime,context="system_u:object_r:container_file_t:s0:c539,c879",lowerdir=/home/obulatov.linux/.local/share/containers/storage/overlay/l/PGV3FCQXCAYYYHBPDQLHTXJPWX:/home/obulatov.linux/.local/share/containers/storage/overlay/l/Q4GEF43UFNU4FNEAS7MRBU2FQC:/home/obulatov.linux/.local/share/containers/storage/overlay/l/ED34UM2AKWBESQJIMUQDTQQELT:/home/obulatov.linux/.local/share/containers/storage/overlay/l/KSWWLEPB242QCOOU3B5RS6HC5K:/home/obulatov.linux/.local/share/containers/storage/overlay/l/QU2NNNGE6BF434EXHCPCR4IQDX:/home/obulatov.linux/.local/share/containers/storage/overlay/l/VC2QU23NKXJVC45HPNSUIYS6NY:/home/obulatov.linux/.local/share/containers/storage/overlay/l/DGNGMQ7DLOSBUDMSIEAS2ZW7AB:/home/obulatov.linux/.local/share/containers/storage/overlay/l/4VV5PCPETKU3GOAMABZB544SK5:/home/obulatov.linux/.local/share/containers/storage/overlay/l/6VQMGZHRJAWC374ZJWASTEPROT,upperdir=/home/obulatov.linux/.local/share/containers/storage/overlay/1e236df8934fbff8de521d81813e79054c67e963daa6fa91c678d3b01ea55d6f/diff,workdir=/home/obulatov.linux/.local/share/containers/storage/overlay/1e236df8934fbff8de521d81813e79054c67e963daa6fa91c678d3b01ea55d6f/work,volatile,userxattr)
...
$ ./wiac run /home/obulatov.linux/.local/share/containers/storage/overlay/l/PGV3FCQXCAYYYHBPDQLHTXJPWX:/home/obulatov.linux/.local/share/containers/storage/overlay/l/Q4GEF43UFNU4FNEAS7MRBU2FQC:/home/obulatov.linux/.local/share/containers/storage/overlay/l/ED34UM2AKWBESQJIMUQDTQQELT:/home/obulatov.linux/.local/share/containers/storage/overlay/l/KSWWLEPB242QCOOU3B5RS6HC5K:/home/obulatov.linux/.local/share/containers/storage/overlay/l/QU2NNNGE6BF434EXHCPCR4IQDX:/home/obulatov.linux/.local/share/containers/storage/overlay/l/VC2QU23NKXJVC45HPNSUIYS6NY:/home/obulatov.linux/.local/share/containers/storage/overlay/l/DGNGMQ7DLOSBUDMSIEAS2ZW7AB:/home/obulatov.linux/.local/share/containers/storage/overlay/l/4VV5PCPETKU3GOAMABZB544SK5:/home/obulatov.linux/.local/share/containers/storage/overlay/l/6VQMGZHRJAWC374ZJWASTEPROT /tmp/upperdir npm --version
9.3.1
The wiac tool can run it! If only it could pull images… Stay tuned.