In this article we discuss changes in the Linux kernel slab allocator implementation and exploitation challenges associated with kernel heap-related vulnerabilities. We focus on the SLUB (unqueued slab allocator) implementation in this article since it is the most common allocator enabled by default on most Linux distributions and Android devices.
We discuss some major changes in modern kernels that affect exploitability of heap-related vulnerabilities. Some of these changes/features were intentional where others were a side-effect hindering or aiding the exploitation process.
The two common dynamic object allocation mechanisms in the kernel are:
- General-purpose allocations performed using
kmalloc/kzalloc/...
API - Special-purpose allocations via
kmem_cache_create/kmem_cache_alloc
Special-purpose caches are generally created for frequently allocated/used objects such as task_struct
, cred
, inode
, sock
, etc.
A standard kernel general-purpose allocation may look similar to the following:
kmalloc(sizeof(struct some_struct), GFP_KERNEL)
There are other flags besides GFP_KERNEL
that could make the allocation atomic, request high memory, etc. but GFP_KERNEL
indicates a normal kernel allocation and represents the most common case for the allocation of a vulnerable object in UAF / heapovf related vulnerabilities.
On a standard Linux system, SLUB general-purpose caches are named as kmalloc-*
and start with as small as 8 bytes object caches and go up to 8k, in power of two increments, with an exception of 96- and 192-byte caches:
# cat /proc/slabinfo | grep ^kmalloc kmalloc-8192 40 40 8192 4 8 : tunables 0 0 0 : slabdata 10 10 0 kmalloc-4096 127 136 4096 8 8 : tunables 0 0 0 : slabdata 17 17 0 kmalloc-2048 256 256 2048 8 4 : tunables 0 0 0 : slabdata 32 32 0 kmalloc-1024 896 896 1024 8 2 : tunables 0 0 0 : slabdata 112 112 0 kmalloc-512 537 608 512 8 1 : tunables 0 0 0 : slabdata 76 76 0 kmalloc-256 1613 1680 256 16 1 : tunables 0 0 0 : slabdata 105 105 0 kmalloc-192 1525 1596 192 21 1 : tunables 0 0 0 : slabdata 76 76 0 kmalloc-128 1184 1184 128 32 1 : tunables 0 0 0 : slabdata 37 37 0 kmalloc-96 1260 1260 96 42 1 : tunables 0 0 0 : slabdata 30 30 0 kmalloc-64 5760 5760 64 64 1 : tunables 0 0 0 : slabdata 90 90 0 kmalloc-32 3072 3072 32 128 1 : tunables 0 0 0 : slabdata 24 24 0 kmalloc-16 1792 1792 16 256 1 : tunables 0 0 0 : slabdata 7 7 0 kmalloc-8 2048 2048 8 512 1 : tunables 0 0 0 : slabdata 4 4 0
When a new allocation is requested, e.g., kmalloc(24, GFP_KERNEL)
, this allocation gets rounded up to the next closest cache size. In this case, the requested 24 bytes will get served from the kmalloc-32
general-purpose cache. Objects of the same size (or when rounded up to the next closest general-purpose cache size) allocated via k*alloc
API, get allocated in the same cache, assuming they get served by the same core (if SMP). This makes refills / heap sprays trivial if both vulnerable and target / refill objects are allocated via k*alloc
.
For comparison, on Android/arm64 the smallest general-purpose cache is kmalloc-128
. From the exploitation point of view this could be an advantage or a disadvantage.
- If there are less general-purpose caches, there are more candidates for a refill. For example, on Android, all
k*alloc
allocations less than or equal to 128 bytes are served from thekmalloc-128
cache. - This could be a disadvantage as well. If all allocations ≤ 128 bytes are served by
kmalloc-128
, the cache becomes hot and refills become less reliable, i.e., it is more likely that some other unintended allocation takes the slab spot of the freed vulnerable object before the actual refill.
Finally, special-purpose caches / allocations are created using kmem_cache_create/kmem_cache_alloc
API. For example, creating a special-purpose cache test_cache
for objects of size 86 bytes and requesting a single allocation:
struct kmem_cache * kmem_cache_create(const char *name, size_t size, size_t align, unsigned long flags, void (*ctor)(void *)); ... struct kmem_cache *s = kmem_cache_create("test_cache", 86, 0, SLAB_HWCACHE_ALIGN, NULL); void *test_obj = kmem_cache_alloc(s, GFP_KERNEL);
There are no restriction on special-purpose cache sizes - they can be arbitrary without any specific alignment. SLAB_HWCACHE_ALIGN
indicates that the cache needs to be aligned on the cache line size (which is 64 bytes on both x86_64 and aarch64) and almost all special-purpose caches would be using this flag. This flag is also import from the exploitation perspective, since it may influence cache aliasing/mergeability with one of the general-purpose caches (depending on the kernel version, see below).
From the exploitation perspective, it is important to determine if a vulnerable object and its refill (UAF) or vulnerable and target objects (heapovf) are allocated in the same cache/slab.
As discussed above, if both vulnerable and target objects are allocated via k*alloc
, the allocations get served from the same general-purpose cache (subject to size requirements). What if a vulnerable or target object is allocated in a special-purpose cache? For example, the vulnerable object A of size 86 bytes is allocated in a special-purpose cache and its refill of size 128 bytes B is allocated via k*alloc
(a very common exploitation scenario). For this particular example, successful exploitation depends on whether A's cache is merged or aliased with B's general-purpose cache.
SLUB cache aliasing is used to reduce kernel memory fragmentation by merging caches with similar characteristics. For example, when a special-purpose cache is created without any specific flags such as SLAB_ACCOUNT
(more on this below), this cache may get aliased with one of the general-purpose caches or possibly several other special-purpose caches. The following function determines cache mergeability when a special-purpose cache is created:
struct kmem_cache *find_mergeable(size_t size, size_t align, slab_flags_t flags, const char *name, void (*ctor)(void *)) { struct kmem_cache *s; if (slab_nomerge) return NULL; if (ctor) return NULL; size = ALIGN(size, sizeof(void *)); align = calculate_alignment(flags, align, size); [1] size = ALIGN(size, align); [2] flags = kmem_cache_flags(size, flags, name, NULL); if (flags & SLAB_NEVER_MERGE) return NULL; list_for_each_entry_reverse(s, &slab_root_caches, root_caches_node) { [3] if (slab_unmergeable(s)) continue; if (size > s->size) continue; if ((flags & SLAB_MERGE_SAME) != (s->flags & SLAB_MERGE_SAME)) continue; /* * Check if alignment is compatible. * Courtesy of Adrian Drzewiecki */ if ((s->size & ~(align - 1)) != s->size) continue; if (s->size - size >= sizeof(void *)) continue; if (IS_ENABLED(CONFIG_SLAB) && align && (align > s->align || s->align % align)) continue; return s; } return NULL; }
Object alignment and size calculation are performed in [1] and [2]. The loop in [3] then iterates over all available caches checking several conditions such size, alignment and certain flags to determine cache mergeability. calculate_alignment()
function is shown below:
unsigned long calculate_alignment(slab_flags_t flags, unsigned long align, unsigned long size) { /* * If the user wants hardware cache aligned objects then follow that * suggestion if the object is sufficiently large. * * The hardware cache alignment cannot override the specified * alignment though. If that is greater then use it. */ if (flags & SLAB_HWCACHE_ALIGN) { unsigned long ralign = cache_line_size(); while (size <= ralign / 2) ralign /= 2; align = max(align, ralign); } if (align < ARCH_SLAB_MINALIGN) align = ARCH_SLAB_MINALIGN; return ALIGN(align, sizeof(void *)); }
As mentioned above, almost all special-purpose caches are aligned on the cache line size (64 bytes). The alignment returned by calculate_alignment()
is then used to calculate cache object size in [2]. In our example above, test_obj
size aligned on the cache line size becomes 128 bytes. The loop in [1] iterates over all available caches checking several conditions such size, alignment and certain flags to determine cache mergeability. On kernels prior to 4.16, the test_cache
special-purpose cache will get aliased with kmalloc-128
general-purpose cache since their sizes match after alignment and there are no special flags that prevent cache aliasing. On the other hand, if the test_cache
object size was 286 bytes, this cache will not get aliased with any general-purpose caches since its size after alignment on the cache line size becomes 320 bytes.
Cache aliasing information is available in /sys/kernel/slab/
. For example, the following caches are all merged together:
# ls -al /sys/kernel/slab | grep ':t-0000128' lrwxrwxrwx 1 root root 0 May 16 21:30 aio_kiocb -> :t-0000128 lrwxrwxrwx 1 root root 0 May 16 21:30 btree_node -> :t-0000128 lrwxrwxrwx 1 root root 0 May 16 21:30 cifs_mpx_ids -> :t-0000128 lrwxrwxrwx 1 root root 0 May 16 21:30 ecryptfs_key_tfm_cache -> :t-0000128 lrwxrwxrwx 1 root root 0 May 16 21:30 eventpoll_epi -> :t-0000128 lrwxrwxrwx 1 root root 0 May 16 21:30 fib6_nodes -> :t-0000128 lrwxrwxrwx 1 root root 0 May 16 21:30 ip6_mrt_cache -> :t-0000128 lrwxrwxrwx 1 root root 0 May 16 21:30 ip_mrt_cache -> :t-0000128 lrwxrwxrwx 1 root root 0 May 16 21:30 kmalloc-128 -> :t-0000128 lrwxrwxrwx 1 root root 0 May 16 21:30 pid -> :t-0000128 lrwxrwxrwx 1 root root 0 May 16 21:30 scsi_sense_cache -> :t-0000128 drwxr-xr-x 3 root root 0 May 16 21:30 :t-0000128 lrwxrwxrwx 1 root root 0 May 16 21:30 uid_cache -> :t-0000128
All special-purpose caches above (aio_kiocb
, btree_node
, etc.) are merged/aliased with kmalloc-128
. There is also a slabinfo
tool shipped with the kernel source that can print the aliasing information in a nicer format (i.e., slabinfo -a
) by parsing /sys/kernel/slabinfo
.
# ./slabinfo -a :at-0000104 <- ext4_prealloc_space buffer_head :at-0000256 <- dquot jbd2_transaction_s :t-0000016 <- kmalloc-16 ecryptfs_file_cache :t-0000024 <- numa_policy scsi_data_buffer :t-0000032 <- dnotify_struct ecryptfs_dentry_info_cache sd_ext_cdb kmalloc-32 :t-0000040 <- khugepaged_mm_slot Acpi-Namespace ext4_system_zone :t-0000048 <- fasync_cache shared_policy_node ip_fib_trie ftrace_event_field ksm_mm_slot jbd2_inode :t-0000056 <- fanotify_event_info Acpi-Parse uhci_urb_priv dm_io ip_fib_alias zswap_entry file_lock_ctx nsproxy :t-0000064 <- ecryptfs_key_sig_cache dmaengine-unmap-2 io secpath_cache anon_vma_chain task_delay_info tcp_bind_bucket ksm_stable_node kmalloc-64 ksm_rmap_item fanotify_perm_event_info fs_cache ecryptfs_global_auth_tok_cache :t-0000072 <- eventpoll_pwq Acpi-Operand :t-0000080 <- Acpi-State fsnotify_mark Acpi-ParseExt :t-0000088 <- trace_event_file dnotify_mark inotify_inode_mark :t-0000112 <- dm_rq_target_io flow_cache :t-0000120 <- kernfs_node_cache cfq_io_cq :t-0000128 <- uid_cache eventpoll_epi pid ecryptfs_key_tfm_cache ip6_mrt_cache kmalloc-128 scsi_sense_cache btree_node aio_kiocb ip_mrt_cache cifs_mpx_ids fib6_nodes :t-0000192 <- bio_integrity_payload cred_jar inet_peer_cache dmaengine-unmap-16 kmalloc-192 key_jar ...
This flag is used to account all objects of a particular kmem cache. When kmem accounting is enabled (or disabled but compiled in), special-purpose caches created with SLAB_ACCOUNT
will not get aliased with any other caches (special- or general-purpose) that are created without the SLAB_ACCOUNT
flag. For example, the test_cache
created with SLAB_ACCOUNT
will not get merged with kmalloc-128
this time:
struct kmem_cache *s = kmem_cache_create("test_cache", 86, 0, SLAB_HWCACHE_ALIGN|SLAB_ACCOUNT, NULL); void *test_obj = kmem_cache_alloc(s, GFP_KERNEL);
From the exploitation perspective, if the vulnerable object is allocated in a kmem accounted cache, and it does not contain any function pointers or other useful data for privilege escalation, this vulnerability may be unexploitable. Since the cache becomes standalone, the refill can only be performed by the same object type.
For example, struct cred
was often used for privilege escalation, since the cred_jar
special-purpose cache was mergeable (after cache line alignment) with kmalloc-192
, where sizeof(struct cred)
was around 168-176 bytes depending on the kernel version. struct cred
was typically used in heap overflow vulnerabilities in 192-byte caches - it was trivial to trigger cred
allocation using standard set*uid/set*gid
syscalls and overflowing the first reference counter (first 4 bytes) with zeros had no major impact on stability.
However, in kernel versions after 4.4, SLAB_ACCOUNT
was added to the cred_jar
cache:
void __init cred_init(void) { /* allocate a slab in which we can store credentials */ cred_jar = kmem_cache_create("cred_jar", sizeof(struct cred), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); }
cred_jar
became standalone and this direct exploitation vector was closed.
That being said, there are still ways to refill objects when dealing with standalone caches. For example, consider a UAF where the vulnerable object is allocated in general-purpose cache A of size Y and the target object is allocated in special-purpose cache B of size X. Sizes Y and X can be almost arbitrary. The general technique is to
- Spray cache A (which should be trivial given it is a general-purpose cache) filling in all partial slabs.
- Trigger freeing the vulnerable object and then free all sprayed objects at the same time.
- Spray objects from cache B.
- Trigger the UAF.
Once objects from cache A start getting freed in step 2, entire slabs (which generally span multiple pages) will become free. We then reallocate those freed slabs with new slabs belonging to cache B in step 3 before triggering the UAF. This exact technique was used in this LPE where freed slabs were refilled with slabs belonging to the cred_jar
cache and the UAF was overwriting a few members of the struct cred
with zeros. The technique itself is obviously less reliable since it relies on the entire slab being reallocated to another cache, but given there are no memory restrictions, it can achieve a high success rate.
As discussed above, prior to 4.16, special-purpose caches created via kmem_cache_create
/kmem_cache_alloc
were mergeable with general-purpose caches as long as their size matched one of the general-purpose caches and there was no kmem accounting used.
Hardened usercopy (CONFIG_HARDENED_USERCOPY
) was introduced as a mitigation against heap overflows/infoleaks when copying data from/to user space. It was introduced prior to 4.16 but starting from 4.16 two new members useroffset
and usersize
were added to struct kmem_cache
.
These two members provide more fine-grained access for user- / kernel-space data transfer functions such as copy_to_user
/copy_from_user
, etc. For example, instead of marking the entire object as user accessible via copy_to_user()
, it can be accessed partially to prevent infoleaks. Furthermore, not all caches should be user-space accessible - kmem_cache_create_usercopy
was introduced specifically for creating user-space accessible caches and the original kmem_cache_create
was kept for all other caches. The signatures for both functions are shown below:
struct kmem_cache *kmem_cache_create(const char *name, unsigned int size, unsigned int align, slab_flags_t flags, void (*ctor)(void *)); struct kmem_cache *kmem_cache_create_usercopy(const char *name, unsigned int size, unsigned int align, slab_flags_t flags, unsigned int useroffset, unsigned int usersize, void (*ctor)(void *));
kmem_cache_create_usercopy
takes useroffset
and usersize
parameters and initialises the corresponding struct kmem_cache
with those values. kmem_cache_create
on the other hand sets values in struct kmem_cache
to 0 making the cache inaccessible to user space via copy_to/from_user()
, etc.
One side effect of these API changes (starting from 4.16) is that slab_unmergeable()
now has a check for usersize
in [4]:
int slab_unmergeable(struct kmem_cache *s) { if (slab_nomerge || (s->flags & SLAB_NEVER_MERGE)) return 1; if (!is_root_cache(s)) return 1; if (s->ctor) return 1; if (s->usersize) [4] return 1; /* * We may have set a slab to be unmergeable during bootstrap. */ if (s->refcount < 0) return 1; return 0; }
If usersize
is non-zero (i.e., the cache is user-space accessible), this cache is not merged with any other cache on the system.This is a major downside from the exploitation perspective since all general-purpose caches are now marked as user-space accessible in create_boot_cache()
where useroffset
is set to 0
and usersize
is the entire cache/object size. As a result, general-purpose caches are no longer mergeable with special-purpose caches. This is true even if CONFIG_HARDENED_USERCOPY
is disabled!
# cat /sys/kernel/slab/kmalloc-128/aliases 0
It is still possible to cross the general- / special-purpose cache boundary when refilling/spraying using the technique discussed in the previous section (i.e., freeing entire slabs and reallocating with slabs belonging to different caches).
Prior to 5.9 kernels, kmem cache accounting for general-purpose allocation was implemented using standalone caches. The following general-purpose allocation with the GFP_KERNEL_ACCOUNT
flag set would get served from a stand-alone cache:
void *obj1 = kmalloc(..., GFP_KERNEL_ACCOUNT);
However, significant changes were introduced to 5.9 kernels and this limitation was removed. For example, the following two general-purpose allocations are now served from the same kmalloc-512
cache:
void *obj1 = kmalloc(286, GFP_KERNEL); void *obj2 = kmalloc(286, GFP_KERNEL_ACCOUNT);
This is obviously an advantage from the exploitation perspective since there are several public refills/sprays using the GFP_KERNEL_ACCOUNT
flag. However, starting from 5.14 versions, separate caches for accounted objects were reintroduced and as of today, 5.14+ kernels have separate caches for accounted and unaccounted objects.
A freelist pointer randomisation CONFIG_SLAB_FREELIST_RANDOM
was introduced in 4.8 and is now enabled by default on most modern distributions.
A common misconception is that freelist pointer randomisation is a mitigation against UAF since it affects the order in which objects are allocated. The later part of this statement is partially true, i.e., when a new slab is allocated (consisting of one or several pages), allocations from this slab are no longer sequential. This was done primarily as a mitigation against kernel heap overflow vulnerabilities when vulnerable and target objects need to placed next to each other in the same slab before triggering the overflow. A common technique to shape the heap was to exhaust the cache by allocating objects of the same size as the target cache (taking into account object/cache alignments). When all partial/fragmented slabs are filled, new slabs will get allocated and all allocations from those slabs become sequential. However, when freelist pointer randomisation is enabled, allocations from a new empty slab are now randomised preventing deterministic placement of vulnerable and target objects on the heap.
When a new cache is created, a random pre-computed list/sequence (random_seq
int array member of the kmem_cache
struct) is allocated [5] for that cache and shuffled using the Fisher-Yates algorithm [6]:
int cache_random_seq_create(struct kmem_cache *cachep, unsigned int count, gfp_t gfp) { struct rnd_state state; if (count < 2 || cachep->random_seq) return 0; cachep->random_seq = kcalloc(count, sizeof(unsigned int), gfp); [5] if (!cachep->random_seq) return -ENOMEM; /* Get best entropy at this stage of boot */ prandom_seed_state(&state, get_random_long()); freelist_randomize(&state, cachep->random_seq, count); [6] return 0; }
This code path is common for both SLAB and SLUB implementations. For SLUB specifically, random_seq
is converted from an array of freelist indices to an array of offsets within the new slab (based on the cache object size). When a new slab is allocated, a random starting position/index within random_seq
for the first object is generated [7] and then all other objects are assigned an allocation sequence based on that starting position [8]:
static bool shuffle_freelist(struct kmem_cache *s, struct page *page) { void *start; void *cur; void *next; unsigned long idx, pos, page_limit, freelist_count; if (page->objects < 2 || !s->random_seq) return false; freelist_count = oo_objects(s->oo); pos = get_random_int() % freelist_count; [7] page_limit = page->objects * s->size; start = fixup_red_left(s, page_address(page)); /* First entry is used as the base of the freelist */ cur = next_freelist_entry(s, page, &pos, start, page_limit, freelist_count); page->freelist = cur; for (idx = 1; idx < page->objects; idx++) { [8] setup_object(s, page, cur); next = next_freelist_entry(s, page, &pos, start, page_limit, freelist_count); set_freepointer(s, cur, next); cur = next; } setup_object(s, page, cur); set_freepointer(s, cur, NULL); return true; }
where next_freelist_entry()
simply returns the address of the next object based on the next index value (or offset within the slab) in the pre-computed list and wraps around if the current position reaches the end of the pre-computed list:
static void *next_freelist_entry(struct kmem_cache *s, struct page *page, unsigned long *pos, void *start, unsigned long page_limit, unsigned long freelist_count) { unsigned int idx; ... do { idx = s->random_seq[*pos]; *pos += 1; if (*pos >= freelist_count) *pos = 0; } while (unlikely(idx >= page_limit)); return (char *)start + idx; }
For example, for kmalloc-8
, the following pre-computed index sequence with starting pos = 2
would result in the following slab layout:
In the example above, there is no way to deterministically place target and vulnerable objects next to each other (when a new kmalloc-8
slab is created) by simply performing sequential heap allocations. Instead, a common technique to exploit heap overflows with freelist pointer randomisation enabled is to
- Exhaust the cache by allocating objects of the right size to fill in all partial slabs and start allocating new slabs.
- Start filling in new slabs with target objects.
- Free one target object and allocate the vulnerable object.
- Perform the overflow and check which target object was modified.
Checking which target object was overflown in step 4 may or may not be possible depending the overflow itself, target cache and/or chosen target object. To improve reliability of this technique, often a few "holes" are made in the slab(s) in step 3 (instead of freeing only one target object) and refilled with multiple vulnerable objects.
As mentioned above, freelist pointer randomisation is not a mitigation against UAF vulnerabilities and does not affect the order of the refill - it is always performed in the FILO order with or without freelist pointer randomisation. This means that the last freed spot in the slab will get allocated first.
Modern kernels introduced changes that hinder successful exploitation of kernel heap-related vulnerabilities. Some of these changes were intentional (such as freelist pointer randomisation) and others were a mere side-effect.
Cache aliasing is one of the most import features from the exploitation perspective. Newer kernels keep general kmalloc
caches unmergeable as a side-effect of hardened usercopy implementation (regardless CONFIG_HARDENED_USERCOPY
being enabled or disabled).
5.9+ kernels (up to 5.14), however, made kmem accounted (GFP_KERNEL_ACCOUNT
) general-purpose allocations mergeable with other unaccounted objects turning some previously unexploitable memory corruption bugs into exploitable vulnerabilities.