Unveiling Whiteout Files: Do you know how file deletions are handled between layers of a Docker image? -

Union file systems are a mechanism for merging two or more file systems, to present them unified, under a single mount point for the user.

The main idea behind this mechanism is to be able to alter the contents of the first file system (e.g. the contents of a CD-ROM) by writing all changes (additions, deletions, modifications) to the second (which could be a disk partition, a USB stick, …).

While adding and modifying may seem trivial, deleting is not. So let’s explore in this article what whiteout files are and how they can simulate the deletion of a file.

Another common use of filesystem unions is in containers: container images are made up of layers. If you launch a php container and then a nginx container, both images based on debian, you will only download the underlying debian image once. Files from the debian image may be modified or deleted by an image such as php or nginx. Thanks to union file system!

Understanding Union File System

Unions file system share a number of concepts, which we will illustrate with the following diagram:

File access by layer

Here we see a two-layer file system, referred to in the jargon as two branches. They are denoted Lower for the lowest layer and upper for the layer that is inserted on top of the lower layer; and finally Merged for the resulting view. Some implementations support more than 2 branches, with sometimes complex access and modification policies.

When a file is deleted from the union, a so-called whiteout file is placed in the upper layer to indicate that this file should no longer be displayed in the merged layer. The same concept applies to folders, which are referred to as opaque directory.

When accessing a file in the lower branch that has not been modified in upper, the lower file is accessed directly.

When a file is modified, its entire contents are copied from the lower branch to the upper branch. A file that is added, overwritten or modified will therefore have its entire contents in the upper layer.

History

The concept of whiteout file has its origins in the early development of file system unions.

Translucent File System is undoubtedly the first implementation of the whiteout file concept. Developed by David Hendricks in the 1980s for SunOS 3, the idea was to allow users of a machine to take advantage of the base system, making modifications without impacting other users, and without having access to other users’ files.

The first union mounts were implemented with BSD 4.4, in the 90s.

The best-known implementation today is UnionFS, by Erez Zadok. It was to be the implementation used for the Linux kernel, but like aufs, their code and solution didn’t convince to be fully integrated.

It wasn’t until 2014 that a union mount was integrated into the Linux kernel. This is OverlayFS. It arrived in kernel 3.18, after more than 4 years of rewrites and structural improvements, to reach the demanding and uncompromising level required for its integration into the official kernel.

What issues complicate the implementation of an union file system?

One of the trickiest problems is finding a way to represent file and folder deletions: it has to be a valid file (with or without metadata), as the information needs to be stored in a concrete way. In many implementations, a .wh.<filename> file serves as a whiteout file, which can create conflicts with the user’s own file names (or reduce the user’s choice of file names).

A similar problem applies to folders: should you delete every file contained in the folder, or does the mere presence of an opaque directory prevents discovery?

Memory usage can quickly get out of hand, especially if the implementation allows a lot of branches, because if you want the system to perform well you’ll need to have the topologies of each file system in memory.

Implementing mmap(2) is necessarily a nightmare: when a file is modified by two processes that mmap(2), we normally expect to see the modifications in both processes, but the first to make a modification creates a new file in the writeable branch. This makes it difficult to reconcile the pointers of the two processes.

Similarly, think about hard links management: all pointers to updated content should be modified in the write layer, but there is no pointer index, so it’s not easy to find the files to be updated.

And let’s not forget that the underlying file systems of each branch don’t necessarily have the same constraints (file name sizes, extended attributes, metadata, accent encoding, etc.), so you have to juggle between them, while returning consistent errors where appropriate.

And many more besides. Not least readdir(2), which needs to be stable despite the turbulence that can occur between two calls, …

See this series of articles summarizing the different implementations, their choices and differences: https://lwn.net/Articles/325369/, https://lwn.net/Articles/327738/.

In what follows, we’ll be concentrating mainly on the operation of this file system, trying as far as possible to draw parallels with the others.

Whiteout Files in Practice

First of all, you need to know how to set up such a file system. Here’s a general example of how to create a simple union between a read-only and a read/write file system:

mount -t overlay -olowerdir=/lower,upperdir=/upper,workdir=/work ignored /merged

The type to use is overlay, with the lowerdir options indicating the location of the folder(s) to be combined in read-only mode (separated by : when there are several), the directory containing the read/write system in the upperdir option, and don’t forget the workdir option, a path on the same partition as the upperdir, which must be empty.

We end the call by giving the source device, which is useless in our case (ignored or any other string will do), and finally the folder to which our union will be mounted: /merged in the example.

Usage in Containerization

Let’s analyze a running Docker container to learn more.

First, we check that we’re using the overlay2 storage driver:

42sh$ docker info | grep "Storage Driver"
 Storage Driver: overlay2

This is the case (depending on your kernel configuration, Docker may have chosen a different driver), so let’s start the analysis:

42sh$ docker container run --rm -it debian
  incntr$ mount | grep "on / "
  overlay on / type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/B62UNV3UB3X4TBWQMM6XCMM6W5:/var/lib/docker/overlay2/l/V6HGFN3C3PEW6CZ6XWRSHHDKJH,upperdir=/var/lib/docker/overlay2/2a353708e5b16ea7775cf1a33dd23ce31430faaa504bcde5508691b230f9d700/diff,workdir=/var/lib/docker/overlay2/2a353708e5b16ea7775cf1a33dd23ce31430faaa504bcde5508691b230f9d700/work)

Note that 2 lowerdir are used. These are symbolic links pointing to the folders identifying the layers (the names of the links are random, the aim being to have a shortened path to the layer’s file system, as the number of characters that can be passed to the mount(2) system call is limited).

The lowest branch (furthest to the right of the lowerdir parameter) contains the single layer of our debian image, while the branch furthest to the left overlays a number of configuration files required to run the container (/etc/hosts, resolv.conf, …).

The read/write branch is also registered in the /var/lib/docker/overlay2 folder, and its identifier can be seen. The upperdir is in the diff folder, while the workdir is in the work folder, under the same layer ID.

We can also see the folders used by inspecting our:

42sh$ docker container inspect youthful_wilbur | jq .[0].GraphDriver.Data

{
  "LowerDir": "/var/lib/docker/overlay2/22753d0d81...8706f1a31-init/diff:/var/lib/docker/overlay2/2cc3656c06...c0fb91d6/diff",
  "MergedDir": "/var/lib/docker/overlay2/22753d0d81...8706f1a31/merged",
  "UpperDir": "/var/lib/docker/overlay2/22753d0d81...8706f1a31/diff",
  "WorkDir": "/var/lib/docker/overlay2/22753d0d81...8706f1a31/work"
}

If you test with an image with more layers, you’ll get more lowerdir, one per layer. Feel free to run the same series of commands with the python image, for example.

Adding files

At this point, if we look at the contents of our upperdir folder, we can see that it’s empty. This is normal, since we haven’t made any changes.

In our previously launched container, let’s make a modification, by adding a:

incntr$ echo "newfile" > /root/foobar

42sh$ tree /var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff
/var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff
└── root
    └── foobar

Our new file, which is not the only one in the tree structure shown in the container, has been added, as you’d expect, to the read/write branch.

Modifying files

If we make a change to a file, for example by adding a line, it’s not just the difference that is stored in the write branch, but the whole file, as it has been modified:

incntr$ echo "Bienvenue dans le conteneur" >> /etc/issue

42sh$ tree /var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff
/var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff
└── etc
    └── issue

42sh$ cat /var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff/etc/issue
Debian GNU/Linux 11 \n \l
Bienvenue dans le conteneur

Deleting files

When you want to delete a file you’ve just added, there’s not much you can do, since deleting the file from the write branch will make the file disappear from the mounted tree.

When it comes to deleting a file from a read-only branch, you need to be able to hide the file using a marker. Depending on the storage driver, this marker is different: in OverlayFS, a deletion is materialized by a special character file of the same name.

incntr$ rm /etc/adduser.conf

42sh$ tree /var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff
/var/lib/docker/overlay2/1531651afa872006a4b2b9b913d5d8ee317cf12be7883517ba77f3d094f871b4/diff
└── etc
    └── adduser.conf

42sh$ cat /var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff/etc/adduser.conf
cat: No such device or address

42sh$ stat /var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff/etc/adduser.conf
  File: /var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff/etc/adduser.conf
  Size: 0         	Blocks: 0          IO Block: 4096   character special file
Device: fe0bh/65035d	Inode: 515773      Links: 2     Device type: 0,0

Note here Device type: 0,0.

To create a similar file ourselves, we would need to use:

42sh$ mkdir /var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff/bin
42sh$ mknod /var/lib/docker/overlay2/2a353708e5...91b230f9d700/diff/bin/sh c 0 0

Caution, undefined behavior!

Running this mknod command while the file system union is mounted elsewhere will not make the /bin/sh file disappear, as any modifications modifications that could be made to the branches outside the mounted system lead to explicitly undefined results.

Deletion on `unionfs` and AuFS

The concept of whiteout file, as we have seen, differs depending on the file system. It turns out that, although OverlayFS was integrated into the Linux kernel after many ups and downs, when specifying the format of the archives used to distribute layers, Docker now uses the AuFS format to represent deletions. It is therefore important to know it too.

Instead of using a special file, AuFS creates a standard file .wh.<filename>, where <filename> is the name of the file to be hidden.

In order to adapt to the storage driver, when the archive is decompressed, Docker converts¹ the whiteout files it encounters into the expected expected format.

Conclusion

Just when you thought you didn’t want to know what whiteout files were all about, I’m sure that reading this article has given you a glimpse into the complexity of both union mounts and software that takes advantages of different implementations.

Now you know why, in particular, it’s pointless to delete a large file in a layer other than the one that contributed it, for example:

RUN wget https://dumps.wikimedia.org/enwiki/enwiki-pages-articles-multistream.xml.bz2

RUN ... # some other stuff

RUN rm enwiki-pages-articles-multistream.xml.bz2

Each RUN creates a separate layer, so our enwiki-pages-articles-multistream.xml.bz2 file will be distributed with the first layer of our image, then a whiteout file will be inserted in the layer corresponding to the third RUN.

See the source code https://github.com/moby/moby/blob/master/pkg/archive/archive_linux.go#L27 ↩︎