Plash: the Principle of Least Authority Shell Mark Seaborn mrs35@cantab.net http://plash.beasts.org Introduction ============ Typically Unix programs run with all the authority of the user who is running them. Programs have access to all the user's files -- they can modify files, delete files, and send copies to other computers on the Internet. Plash is a Unix shell which lets you run Unix programs with access only to the files and directories they need to run. Programs are given access to files which were passed as command line arguments. In order to implement this the filesystem is virtualised. Each process can have its own namespace -- its own root directory -- which can contain a subset of your files. This is implemented by modifying GNU libc and replacing the system calls that use filenames. For example, open() is changed so that it sends a message to a file server via a socket. If the request is successful, the server sends the client a file descriptor via the socket as a result. Processes are run in a chroot jail, under freshly-allocated user IDs. They can't access any files using the usual system calls and so must go through the file server instead. This approach doesn't require modifying the kernel at all. Plash can run Linux binaries unmodified, provided they are dynamically linked with libc, which is almost always the case. Shell overview ============== The syntax of Plash is similar to Unix shells such as the Bourne shell or Bash. Here are some examples of command invocations using Plash: ls . Arguments that were implicit before must now be made explicit. With the Bourne shell or Bash you can write `ls' to list the current directory's contents. With Plash you must add `.' to grant access to the current directory. gcc -c foo.c => -o foo.o Files are passed to the program as read-only by default. Adding the `=>' operator to a command invocation allows you to grant write access to a file. Files that appear to the right of `=>' are passed to the program with write access. Directories to the left of `=>' will be passed as *recursive* (or *transitive*) read-only: files and directories that they contain will also be read-only. make => + . If you want to grant access to a file or directory without passing the filename as an argument, you can use the `+' operator. Files that appear to the right of a `+' are attached to the namespace of the process being run, but the filename is not included in the argument list. echo "Hello, world!" The shell distinguishes between filename arguments and plain string arguments so that it can tell which files to grant access to. Arguments beginning with a hyphen (`-') are interpreted as plain strings, but otherwise you must quote arguments to prevent them from being interpreted as filenames. tar -cvzf { => foo.tar.gz } dir1 If you want to put a read-write file before a file that should only be read-only in the argument list, you can limit the scope of the `=>' operator by enclosing arguments in curly brackets { ... }. xclock + ~/.Xauthority => + /tmp/.X11-unix OR xclock + { ~/.Xauthority => /tmp/.X11-unix } You can run X Windows programs if you give them access to ~/.Xauthority, which contains a password generated by the X server, and /tmp/.X11-unix, which contains the socket for connecting to the X server. Programs must be given write access to a socket in order to connect to it. Note that `+' binds more tightly than `=>'. grep 'pattern' file | less Pipes work as in conventional shells. !!bash If you want to execute a command in the conventional way, without running the process with a virtualised filesystem, in a chroot jail, etc., you can prefix it with "!!". This can be applied to individual command invocations in a pipeline. The syntax for command invocations is the same whether "!!" is used or not, but when it is used, files listed after the "+" operator are ignored. cd directory Changing directory works as before. Redirecting file descriptors to files and to other FDs works as in Bash. Globbing is also implemented: you can use `*' as a wildcard in filenames. It does not function as a wildcard in quoted strings. (The `{...}' wildcard is not implemented.) A number of standard shell features are not implemented yet. This includes expanding and setting environment variables. Shell semantics --------------- The shell uses the argument list to construct a filesystem for the process that is to be run. The filesystem is populated with the files named in the argument list. If a pathname argument contains symbolic links, they are also included in the filesystem. For example, if the argument `/a/b/c' is given, and `/a/b' is a symlink to `/x/y', the objects '/a/b' and `/x/y/c' will both be included in the filesystem. Parent directories with the same names will be included, but the parent directories themselves (with their full contents) will not be included unless they were given as separate arguments. The filesystem is also populated with `/usr', `/etc', `/bin', `/lib', `/dev/tty' and `/dev/null'. (This is not configurable yet.) One future extension would be to allow the user to attach an existing file or directory at a different location in the new filesystem (ie. with a different pathname). Other commands -------------- fg bg Used for job control: puts jobs in the foreground or background. opts Opens an options window (using Gtk). This provides some options for debugging processes under Plash. You can get the file server to print a log of (some of) the calls the processes make. Architecture overview ===================== Plash limits the ability of a process to open files by running it in a chroot environment, under dynamically-allocated user IDs. The chroot environment only contains one file, an executable to exec to start the program running in the process. Rather than using the open() syscall to open files, the client process sends messages to a server process. One of the file descriptors that the client is started with is a socket which is connected to the server; the environment variable COMM_FD gives the file descriptor number. The server can send the client open file descriptors across the socket in response to `open' requests (see cmsg(3)). The server can handle multiple connections. If the client wishes to fork() off another process, it first asks the server to send it another socket for a duplicate connection. GNU libc is re-linked so that open() etc. send requests to the server rather than using the usual Unix system calls. The dynamic linker (/lib/ld.so or, equivalently, /lib/ld-linux.so.2) is similarly re-linked. execve() is changed so that it always invokes the dynamic linker directly, since the chroot environment does not contain the main executable and the kernel does not provide an fexecve() call. The dynamic linker is passed the executable via a file descriptor. The file server uses its own filesystem object abstraction internally. Filesystem objects may be files, directories or symbolic links on the underlying filesystem provided by the Unix kernel. They may also be implemented entirely in the server. The server has its own functions for resolving pathnames and following symbolic links which do not use the kernel's facility for following symbolic links. The shell starts up a new server process for each command the user enters. The shell and the file server are linked into the same executable and the shell uses the same filesystem object abstraction. The shell simply uses fork() to start a new server. User IDs are allocated by the setuid program `run-as-anonymous'. It picks IDs in the range 0x100000 to 0x200000 (configurable by changing config.sh), and opens lock files in the lock directory "/usr/lib/plash-chroot-jail/plash-uid-locks" so that the same UID is not allocated twice. The lock directory goes inside the chroot jail so that the sandboxed processes can also spawn processes with reduced authority (though this is not done yet). Therefore `chroot-jail' needs to go on a writable filesystem, so you may need to move it. The setuid program `gc-uid-locks' will garbage collect and remove UID lock files for UIDs that are no longer in use. It works by scanning the `/proc' filesystem to list currently-running processes and their UIDs. In future it will be run automatically by the shell. Installation ============ The simplest way to install Plash is to install the Debian or RPM packages. Installation is straightforward because these only depend on libreadline. See for the packages. Although Plash includes a copy of libc, this does not replace or interfere with your existing installation of libc. Plash installs it in a separate directory. Building: On Debian, you can run "debian/rules". Otherwise, you can do: ./make-glibc.sh ./make-dirs.sh ./make.sh libc ./make.sh shell ./docs/make-docs.sh (cd setuid && ./make-setuid.sh) You may need to set "CC=gcc-3.3". Filesystem semantics ==================== Symbolic links: If we pass a directory as an argument to a program, it may contain symbolic links to anywhere. Since processes may now have different namespaces, we have a choice of namespaces in which to resolve the destinations of the symbolic links. Do we resolve them in the user's namespace, or the process's namespace? If we resolve symlinks in the user's namespace, and we allow the process to create symlinks to arbitrary destinations, it could create a symlink to `/' and thereby grant itself access to all of the user's filesystem. Instead, we could try to restrict the ability of a process to create symlinks, so that it can only create symlinks to files and directories that it already has access to. But since symlinks are interpreted relative to their position in the filesystem, which can change, it would be difficult to make this robust. Furthermore, the problem of pre-existing symlinks remains. A user should be able to tell what files and directories they're granting access to based on the command invocation. Granting access also to files and directories that are symlinked to, perhaps from deep inside a directory, violates this, because there is little constraint on the destinations of symlinks. Resolving symlinks in the process's namespace makes more sense. It follows the normal semantics of symlinks under Unix, which is that symlinks are simply a convenience that *could* be implemented by the process itself rather than by the kernel. Ultimately, the solution is to do away with symbolic links and replace them with object references. Implementation: If we are to implement these semantics, we must be careful not to use the kernel's ability to follow symlinks. There is not a straightforward option for turning off following symlinks in the underlying filesystem. When we give a pathname such as `a/b/c' to the kernel, if `a/b' is a symbolic link the kernel will always follow it, interpreting it in its namespace. The approach used in the file server is to set the current working directory to each component of the pathname in turn. For each component, do: * lstat() on the leaf name. If it's a symlink, do readlink() and interpret the link. * Otherwise, if it's a directory, do open(leaf, O_NOFOLLOW | O_DIRECTORY). If O_NOFOLLOW or O_DIRECTORY are not supported, we can do fstat() to check that the object opened is the same as the one we lstat()'d (it may have changed between the system calls). * Do fchdir() to set the current directory to the directory. Obviously this requires more system calls than allowing the kernel to resolve symlinks. Note that the server must never send the clients FDs for directories. A client could use a directory FD to break out of its chroot jail. Parent directories: the semantics of `..': A directory may have different parent directories in different namespaces. Furthermore, a directory may appear multiple times in the same namespace, and so have multiple parents in that namespace. `..' does not fit well into a system based on object references. However, it is widely used by Unix programs, so we have to support it. Rather than using the `..' parent directory facility provided by the underlying filesystem, the file server interprets `..' itself. The semantics is that the parent of a directory is the directory that it was reached through, after symlinks have been expanded. This means that the filename resolver maintains a stack of directory object references. When resolving the pathname `/a/b/..', it will first push the root directory onto the stack, then directory objects for `/a' and `/a/b', and then it will pop `/a/b' off the stack, leaving `/a' at the top of the stack as the result. If `/a/b' is a symlink to `g/h', however, the filename resolver does not push `/a/b' onto the stack (since it's not a directory object). It pushes `/a/g' and then `/a/g/h' onto the stack. Then, when it interprets `..' in the pathname, it pops `/a/g/h' off the stack to leave `/a/g' (the result) at the top. Directory objects that correspond to a directory on the underlying filesystem are implemented as an open directory FD. The server represents the current working directory as one of these directory stacks. One of the consequences of these semantics is that if the current working directory is renamed or moved, the result of getcwd() will not reflect this. Remaining problems: The Unix kernel can be regarded as providing a set of capability registers (file descriptors) that can contain directory object references, along with a special capability register (the current working directory) relative to which pathnames are resolved. References can be copied from a normal register to the special register using fchdir(). References can be copied from the special register to the normal registers using open("."). Unfortunately, this model falls down in two places: * Directories with `execute' but not `read' permission cannot be opened with open(). One can chdir() into them, but not fchdir() into them. Arguably, Unix should let you open() such directories but not read their contents using the resulting FD. This could be worked around, but no workaround is implemented yet. * link() is unusual in that it takes two pathname arguments. It is difficult to use safely (without the kernel following symlinks) for linking a file into a different directory, because we have no guarantee that the source file (or destination) is the one we intended to link. Any check will be vulnerable to race conditions. Implementation overview ======================= Shared between client (libc) and server: region.[ch] This provides: * Region-based memory management: blocks of memory can be allocated from a region; the whole region is deallocated in one go. * An abstraction for building messages using concatenation. comms.[ch] -- Send and receive messages (with FDs) across sockets cap-protocol.[ch] -- Object-capability protocol, for transferring references to objects and FDs across a socket cap-call-return.c -- Implements the convention for receiving results from invoking an object using the object-capability protocol libc: make-link-def.sh Deals with hiding and renaming symbols in object files so that libc is linked properly. Interprets `EXPORT' declarations inside comments in C files. files-to-link.sh libc-comms.[ch] libc-connect.c -- provides the connect() and bind() functions libc-fork-exec.c -- provides fork() and execve() libc-utime.c -- utime(), utimes(), lutimes() libc-truncate.c -- truncate() libc-misc.c -- provides all the other functions to put into libc server: parse-filename.[ch] filesysobj.[ch] -- Filesystem object abstraction: files, dirs, symlinks. Also provides a more general object model for use via the object-capability protocol. filesysobj-real.c -- Implements "real" filesystem objects, ie. those based on objects implemented by the kernel filesysobj-fab.[ch] -- "Fabricated" objects, implemented entirely in the server, which may contain references to other objects. filesysslot.[ch] -- "Slot" objects: these are used to represent a single entry in a directory. resolve-filename.[ch] -- Traverses the filesystem to look up a pathname and return a filesystem object. fs-operations.[ch] -- Implements all of the Unix filesystem operations (open() etc.) in terms of the filesystem objects above. server.[ch] -- Earlier version of the protocol. shell: build-fs.[ch] -- Constructs a filesystem given the files that are to be included. shell.[ch] shell.gram -- Grammar (lexerless) for shell; produces a packrat parser (actually, the packrat bit isn't done: there's no caching of results). shell-globbing.c -- Deals with wildcard expansion in filenames. make-variants.pl, shell-variants.[ch] -- Type definition for shell's abstract syntax. setuid programs: run-as-nobody.c -- changes UID to "nobody" before running program. run-as-nobody+chroot.c -- changes UID to "nobody" and chroot()s before running program. run-as-anonymous.c -- allocates a new UID and chroot()s. gc-uid-locks.c -- deletes lock files for unused UIDs. other files: config.sh -- Says where files are installed install.sh debug.c utils.c Future work =========== link() and rename() are only partially implemented: they only work in the same-directory case. Opening directories to give directory FDs is not implemented (to be done properly, this requires changing all functions that deal with FDs). Setuid executables do not run setuid. This should be replaced by another mechanism. A number of common shell features are not yet implemented. Some features raise questions about the direction Plash should take as a language. Should it become a full programming language? Plash now handles job control. Another direction would be to replace the terminal emulator so that the shell itself creates a virtual terminal for each command it starts. These would be combined into one window the way they are now, but it would provide some isolation between programs (eg. preventing them from pretending to be the shell), and it would allow the shell to provide GUI-based features. Tracing facilities: displaying the server log, attaching strace and gdb to the server and the client. It would be convenient to make these accessible through a GUI. The X protocol allows X clients to interfere with each other. Implement a proxy for the X protocol to prevent this. Prevent the client from making system calls other than recvmsg, sendmsg etc. There are a number of possible mechanisms for doing this: * Use ptrace(). Unfortunately, this doesn't work securely with fork(). It's also very slow. * Systrace. Requires a kernel patch. It's more complicated than necessary, because it allows another process to handle the syscall. That makes it less maintainable and less likely to be available for the kernel version you want. * Andrea Archangeli's /proc/PID/seccomp which is in 2.6.something. This only allows read(), write() and close() to work (and maybe one syscall for signals). That's too restrictive. The patch looks very simple to modify, though, except perhaps for handling fork(). I'm not sure how easy it would be to adapt to Linux 2.4. * Ostia's kernel module. This has the advantage that it's a module, so it would be simpler to build and load. Provide a facility for building namespaces (filesystems) in the shell. Provide a C API for doing this. Powerboxes for GUI programs to request files.