This time we can take a look at a very very very famous bug: Dirty-COW vulnerability.


CVE-2016-5195, also named Dirty-COW is an infamous race condition bug revealed in the Linux kernel.

After finish googling, I found this blog,done by chao-tic, ranked the first place in elaborating the entire detail. It’s recommended that you watch the video presented by liveoverflow first, then you can read this post before you dive into chao-tic’s one.

If you are interested in its inside or want to exploit it, this post shall be helpful :P.

This post will take Linux-4.4.0 kernel for example, other versions may have differences.

Understand The Bug

The most practical way to understand the bug is to learn how the exploit works. Before doing that, we need to review some kernel-related knowledge.

Copy On Write (COW)

Copy on write is an optimization skills that adopted in Linux memory mangement. For example, when kernel requires free memory pages, the MMU just adjusts some metadata (vm_area_struct, for instance), the real page is not allocated until a page fault is triggered.

In addition, shared resource management, database, etc also count on this skill.


People know about /proc filesystem, while /proc/self/mem is a relatively uncommon one. In generel, /proc/pid/mem is a binary image representing the process’s virtual memory[^4]. One concise answer I preferred is in stack exchange

/proc/$pid/mem shows the contents of $pid’s memory mapped the same way as in the process, i.e., the byte at offset x in the pseudo-file is the same as the byte at address x in the process.

Generally, we won’t be bothered by this /proc/$$/mem stuff, ptrace or other debugger stuff will wrap that for us. The precise related operations is defined at proc_mem_operations in Linux kernel.

static const struct file_operations proc_mem_operations = {
	.llseek		= mem_lseek,
	.read		= mem_read,
	.write		= mem_write,
	.open		= mem_open,
	.release	= mem_release,

Okay, before dig into the internal of writing to proc/self/mem, here is another important component of this vulnerability : madvise syscall.


Be sure that your kernel compiled with CONFIG_ADVISE_SYSCALLS

There is an abstract extracted from the system call manual

The madvise() system call is used to give advice or directions to the kernel about the address range beginning at address addr and with size length bytes. Initially, the system call supported a set of “conventional” advice values, which are also available on several other implementations. (Note, though, that madvise() is not specified in POSIX.) Subsequently, a number of Linux-specific advice values have been added.

Hmm, advice from user sounds quite dangerous.

MADV_DONTNEED: Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it

So far so good, the user who possesses the resources can advise kernel that he won’t use that resource recently. Thus kernel can do something about that to optimize resource management.

Let’s look into the comments in the source code. (mm/madvise.c)

 * Application no longer needs these pages.  If the pages are dirty,
 * it's OK to just throw them away.  The app will be more careful about
 * data it wants to keep.  Be sure to free swap resources too.  The
 * zap_page_range call sets things up for shrink_active_list to actually free
 * these pages later if no one else has touched them in the meantime,
 * although we could add these pages to a global reuse list for
 * shrink_active_list to pick up before reclaiming other pages.
 * NB: This interface discards data rather than pushes it out to swap,
 * as some implementations do.  This has performance implications for
 * applications like large transactional databases which want to discard
 * pages in anonymous maps after committing to backing store the data
 * that was kept in them.  There is no reason to write this data out to
 * the swap area if the application is discarding it.
 * An interface that causes the system to free clean pages and flush
 * dirty pages is already available as msync(MS_INVALIDATE).
static long madvise_dontneed(struct vm_area_struct *vma,
			     struct vm_area_struct **prev,
			     unsigned long start, unsigned long end)

In conclusion, if the pages advised by owner with MADV_DONTNEED flag are dirty, the kernel will just throw them away. You can check this blog if you are not familiar with Dirty Page.

Cool, let’s combine what we have in hand.


With the help of dynamic debugging, we found the writing to /proc/self/mem leads to __get_user_pages() function, which defined in mm/gup.c.

mem_write() => mem_rw() => access_remote_vm() => __access_remote_vm() => get_user_pages() => get_user_pages_locked() => __get_user_pages()

Let’s have a glance at this function.

// mm/gup.c __get_user_pages()
long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
		unsigned long start, unsigned long nr_pages,
		unsigned int gup_flags, struct page **pages,
		struct vm_area_struct **vmas, int *nonblocking)
	/* ...skip... */
	do {
		struct page *page;
    /* ...skip... */
		 * If we have a pending SIGKILL, don't keep faulting pages and
		 * potentially allocating memory.
		if (unlikely(fatal_signal_pending(current)))
			return i ? i : -ERESTARTSYS;
		page = follow_page_mask(vma, start, foll_flags, &page_mask);
		if (!page) {
			int ret;
			ret = faultin_page(tsk, vma, start, &foll_flags,
			switch (ret) {
			case 0:
				goto retry;
			case -EFAULT:
			case -ENOMEM:
			case -EHWPOISON:
				return i ? i : ret;
			case -EBUSY:
				return i;
			case -ENOENT:
				goto next_page;
		} else if (PTR_ERR(page) == -EEXIST) {
			 * Proper page table entry exists, but no corresponding
			 * struct page.
			goto next_page;
		} else if (IS_ERR(page)) {
			return i ? i : PTR_ERR(page);
		if (pages) {
			pages[i] = page;
			flush_anon_page(vma, page, start);
			page_mask = 0;
    /* ...skip... */
	} while (nr_pages);
	/* ...skip... */

The page is returned from follow_page_mask() function, whose comments is listed below. This function walks the page table structure to find the page descriptor.

follow_page_mask - look up a page descriptor from a user-virtual address

Supposing we are writing to a read-only region, follow_page_mask() function comes into that part

if ((flags & FOLL_WRITE) && !pte_write(pte)) {
	pte_unmap_unlock(ptep, ptl);
	return NULL;

That is to say, a NULL pointer is returned. Then __get_user_pages() will call faultin_page. The follow-up process is kinda complex so we ignore it and focus on the trace supplied by the CVE details.

        FAULT_FLAG_WRITE && !pte_write
	    PageAnon() <- this is CoWed page already
	    reuse_swap_page <- page is exclusively ours
	      maybe_mkwrite <- dirty but RO again
	      ret = VM_FAULT_WRITE

In brief, writing to the read-only area causes the faultin_page() function to create COW pages with dirty marking. So the writing operations won’t affect the original content.

Question? Why not just throw error when write to read-only page??? Is that debugger support?

When the faultin_page() function finish its job, it will remove the flag FOLL_WRITE in foll_flags to avoid another call of faultin_page() after go retry in __get_user_pages(). That is to say, with FOLL_WRITE removed, the new follow_page_mask() will return the created COW page instead of NULL pointer.

// mm/gup.c faultin_page()
	 * The VM_FAULT_WRITE bit tells us that do_wp_page has broken COW when
	 * necessary, even if maybe_mkwrite decided not to set pte_write. We
	 * can thus safely do subsequent page lookups as if they were reads.
	 * But only do so when looping for pte_write is futile: in some cases
	 * userspace may also be wanting to write to the gotten user page,
	 * which a read fault here might prevent (a readonly page might get
	 * reCOWed by userspace write).
	if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))
		*flags &= ~FOLL_WRITE;
	return 0;

Without FOLL_WRITE flag, seems follow_page_mask() will go deeper and call vm_normal_page() function to obtain the new allocated copy page.

Well, we knew that there are tons of goto in kernel souce code. This flag remove thing is pretty Linux, but also dangerous. Image the following scenario, a thread keep writing to a read-only mem while another thread keep madvise to abandon it. The expected sequence is like below.

write(f, str, strlen(str))  |
          ||                |
          \/                |
__get_user_pages(...)       |
          ||                |
          \/                |
follow_page_mask(...)       |
 [return NULL]              |
          ||                |
          \/                |
faultin_page(...)           |
 [create dirty cow]         |
 [eliminate FOLL_WRITE]     |
          ||                |
          \/                |
follow_page_mask(...)       |
 [return COWed  page]       |
                            | madvise(map,100,MADV_DONTNEED)
                            |              ||
                            |              \/
                            |      zap_page_range(...)
                            |   [Drop the Drity COWed page]

However, without the protection of spinlock stuffs, the hacker can race this and leads to below sequence.

write(f, str, strlen(str))  |
          ||                |
          \/                |
__get_user_pages(...)       |
          ||                |
          \/                |
follow_page_mask(...)       |
 [return NULL]              |
          ||                |
          \/                |
faultin_page(...)           |
 [create dirty cow]         |
 [eliminate FOLL_WRITE]     |
                            | madvise(map,100,MADV_DONTNEED)
                            |              ||
                            |              \/
                            |      zap_page_range(...)
                            |   [Drop the Drity COWed page]
follow_page_mask(...)       |
[get the real pagecache!]   | 

That is because the re-entry of function follow_page_mask() just check (flags & FOLL_WRITE) && !pte_write(pte) to decide whether or not continue to obtain the page. The FOLL_WRITE has been removed so it will return the real page!

Exploit The Bug

There are many scripts supplied in the offical website that you can use to hijack a vulnerable machine. Here we pick the classical one. To emulate an OS, we can use qemu (virtual machine is also doable). In addition, you can use buildroot to quickly setup a qemu+Linux+buildroot environment.

Busybox prepare

Here we try an arm architecture. This blog will help you setup everything you need. like cross-compile toolchain, initramfs….

Booting Linux

I just show the scripts here, make sure you have everything settle down.

$ wget -c
$ tar xvf v4.4.tar.gz
$ cd linux-4.4
$ make ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- vexpress_defconfig
$ make ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- all -j4

After you get the zImage, and you can start cross-compile the dirtc0w exploit and remember to put it into your initramfs.

$ CROSS_COMPILE=arm-linux-gnueabi-gcc dirtyc0w.c -o dirtyc0w
$ find . | cpio -H newc -o > {TARGET_INITRAMFS_PATH}

Okay, we then use qemu to start booting.

qemu-system-arm -M vexpress -dtb linux-4.4/arch/arm/boot/dts/vexpress-v2p-ca9.dtb -kernel {ZIMAGE PATH} -initrd {TARGET_INITRAMFS_PATH} -nographic -append "earlyprintk=serial,ttyS0 console=ttyAMA0"

After you hanve a Linux shell, just create a file that can only written by root and then su to your hacker use.

$ echo 'aaaaa' > foo
$ chmod 744 foo
$ su hacker
$ echo 'bbb' > foo
sh: can't create foo: Permission denied
$ cat foo
$ ./dirtyc0w foo bbb
$ cat foo

Boom! We write to that file with our own payloads.


The patching of this CVE is rather simple.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e9caec6..ed85879 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2232,6 +2232,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
 #define FOLL_TRIED	0x800	/* a retry, previous pass started an IO */
 #define FOLL_MLOCK	0x1000	/* lock present pages */
 #define FOLL_REMOTE	0x2000	/* we are working on non-current tsk/mm */
+#define FOLL_COW	0x4000	/* internal GUP flag */
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/gup.c b/mm/gup.c
index 96b2b2f..22cc22e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -60,6 +60,16 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
 	return -EEXIST;
+ * FOLL_FORCE can write to even unwritable pte's, but only
+ * after we've gone through a COW cycle and they are dirty.
+ */
+static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
+	return pte_write(pte) ||
+		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
 static struct page *follow_page_pte(struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd, unsigned int flags)
@@ -95,7 +105,7 @@ retry:
 	if ((flags & FOLL_NUMA) && pte_protnone(pte))
 		goto no_page;
-	if ((flags & FOLL_WRITE) && !pte_write(pte)) {
+	if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
 		pte_unmap_unlock(ptep, ptl);
 		return NULL;
@@ -412,7 +422,7 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 	 * reCOWed by userspace write).
 	if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))
-		*flags &= ~FOLL_WRITE;
+	        *flags |= FOLL_COW;
 	return 0;

The key idea of this patch is no longer remove FOLL_WRITE flag when faultin_page() function finished. It use an new flag FOLL_COW to do that. We can find the check in follow_page_pte() function now checks can_follow_write_pte(), which shall return True if a COW page is created. (But False when this page is dropped by madvise).