Protocol for messages with file descriptors =========================================== Implemented by comms.c. The first protocol is used to send messages over a socket. It simply divides the stream into messages. Each message may contain data and file descriptors. Each message comprises: * int32: "MSG!" * int32: size of data payload in bytes (not necessarily word-aligned) * int32: number of FDs * data payload, padded to word (4 byte) boundary See the man pages sendmsg(2), recvmsg(2) and cmsg(3) for details about how file descriptors are sent across sockets. Object-capability protocol ========================== Implemented by cap-protocol.c. This is layered on top of the message protocol. It allows references to an arbitrary number of objects to be exported across a connection. Objects can be invoked, passing data, file descriptors and references to other objects as arguments. Object references can be dropped, allowing the storage associated with the reference to be reclaimed; the storage associated with the object itself can also potentially be reclaimed. There are two endpoints to a connection. Each may export object references to the other. The protocol is symmetric -- it doesn't distinguish between client and server. For the sake of explanation, however, let us call the endpoints A and B. Everything below would still hold true if A and B were swapped. At any given point, A is exporting a set of object references to B. Each reference has an integer ID. These are typically interpreted as indexes into an array (the `export table'), so that when A receives an invocation request from B, it can look up the object in the array and invoke it. The set of object references that A exports to B may be extended with new references by messages that A sends to B (but not by messages B sends to A). These object references may be removed by messages that B sends to A (but not by messages A sends to B). Messages in the protocol contain object IDs, which contain two parts. The lower 8 bits are a `namespace ID'. This indicates whether the reference is to an object exported by A or by B, and whether a newly-exported reference is single-use. The rest of the object ID is the reference ID (an index into an export table). The possible namespace IDs are: #define CAPP_NAMESPACE_RECEIVER 0 #define CAPP_NAMESPACE_SENDER 1 #define CAPP_NAMESPACE_SENDER_SINGLE_USE 2 The messages that A may send to B are: * "Invk" cap/int no_cap_args/int cap_args data + FDs Invokes an object X. X is denoted by the object ID `cap'. X must be an object that B exports to A, so `cap' may only use the RECEIVER namespace. (Since A is sending, B is the receiver.) `cap_args' is an array of object IDs of length `no_cap_args'. These denote objects to be passed as arguments to X. These object IDs may use any of the three namespace IDs: * For the RECEIVER namespace, this refers to an object that B exports to A. * For the SENDER namespace, this indicates that A has added a new reference to the set of objects it exports to B. From this point on, B may send A messages referring to this ID (except that B will refer to the object with the RECEIVER namespace instead of SENDER). * The SENDER_SINGLE_USE namespace works the same as SENDER, except that it indicates to B that the reference is single use. B may only invoke this object once. Once B invokes the object, the reference becomes invalid. (However, B may pass the object as an argument without this restriction.) The message may include file descriptors to pass as arguments to X. When B receives this message, it invokes X with the specified arguments. If `cap' is a reference that is exported as single-use, B removes the reference from its export table. * "Drop" cap/int Drops a reference. `cap' is an object ID that B exports to A, so `cap' may only use the RECEIVER namespace. When B receives this message, it removes the reference `cap' from its export table. B may also delete the object X that `cap' denotes if there are no other references to X. Closing the connection ---------------------- Violations: If either end receives a message that is illegal, such as messages that contain illegal object IDs, it may choose to terminate the connection. This would mean closing the file descriptor for the socket. Assuming there are no other copies of this file descriptor in the system (in this or other processes), the other end will get an error when it tries to read from its socket, and also regard the connection as broken. Having closed the connection, an endpoint is free to delete its export table, and possibly free the objects it contained references to. In general, endpoints are free to close the connection anyway, if they want to. When no references are exported from B to A or A to B, it is conventional to close the connection, because it is of no use: no messages can legally be sent on it. Conventionally, A will close the connection rather than sending a "Drop" message for the last reference that B exports to A, when A exports nothing to B. Conventions ----------- Initial state of a newly-created connection: A and B start off holding socket file descriptors connected to each other (typically created by socketpair()). A exports M references to B; these are given IDs 0 to M-1. B exports N references to A; these are given IDs 0 to N-1. The numbers M and N must be made known to both A and B by some means outside of the protocol, just as the file descriptors are obtained by some means outside of the protocol. If A and B have differing views about what M and N are, one will probably send messages that the other sees as a protocol violation, and the latter may close the connection. Of course, M and N and the file descriptors can be sent in invocations using the protocol. See the conn_maker object. Also see the PLASH_CAPS environment variable. Call-return: With most invocations, you want to receive a result (even if it's just an indication of success or failure). In these cases, an object X is invoked with a message starting with "Call". The first object argument is a `return continuation', C. When it has finished, X invokes C with arguments containing the results. What happens if C is never invoked? This might happen if a connection is broken. C will get freed in this case, perhaps as a result of a "Drop" message, and this can be used to indicate to the caller that the call failed. What happens if C is invoked more than once? C should simply ignore any invocations after the first one. A return continuation is typically exported as a single-use capability. This is not so much to stop it being invoked more than once (because subsequent invocations can easily be ignored), but more to prevent the build-up of exported references: * When A repeatedly calls B, B might fail to drop the references to the return continuations that A passes it after invoking them. This would cause A's export table to fill up with useless references. A could not legally re-use the IDs for these references according to the protocol. However, if A passes the return continuations to B as single-use references, B cannot legally use their IDs after invoking them, so A can re-use the IDs and free up space in its export table. (If B does invoke an already-invoked single-use reference, it is violating the protocol and A might close the connection as a result.) However, this was not the immediate motivation for adding single-use references to the protocol. More importantly: * libc.so and ld-linux.so (the dynamic linker) both need to make calls to objects in order to open files, etc. So they both need to pass return continuations, and allocate IDs for them. If the return continuations' IDs are invalidated after each call, libc.so and ld-linux.so can allocate the IDs without regard to each other. It is much simpler when they don't need to co-ordinate but can still share the same connection. Each return continuation can be exported with the same ID; these are the only objects exported from this end of the connection. The same issue arises when passing control to a new process image using "exec". Without single-use references, ld-linux.so might make a call and receive an "Invk" message as a result (but not wait for any further messages). Then libc.so might make a call, then listen for a result and receive a "Drop" message for a reference it never exported. libc.so would treat this as a protocol violation and shut down the connection. With single-use references, the "Drop" message is unnecessary, because it is implied. Future extensions ----------------- The protocol does not provide a facility for message pipelining, ie. letting A invoke the result of a call to B before the call returns (saving the time of a round trip). Such a facility involves letting A's messages add entries to B's export tables. A would be able to choose IDs for references that B exports. It would no longer be the case that B allocates all the IDs that it exports. PLASH_COMM_FD and PLASH_CAPS ============================ These environment variables are used to set up the connection and objects for standard services, like access to the filesystem. PLASH_COMM_FD contains the number of a file descriptor for a connection to a server. PLASH_CAPS says how many objects are exported by the server over the connection, and what they are. It is a semicolon-separated list of names for services. The index of a service name in the list is the object ID for the service. For example, "fs_op;conn_maker;;;something_else" says that conn_maker has object ID 1 and something_else has object ID 4. Standard services are: * fs_op * conn_maker * fs_op_maker * union_dir_maker * fab_dir_maker * return_cont (this is passed by the "capcmd" expression) conn_maker object ================= This has one method: "Mkco" M/int + (N objects) => "Okay" + FD + (M objects) This creates a new connection on which the N objects are exported. It returns "Okay" and a file descriptor for the new connection. The new connection also imports M objects. The method call returns these M objects. So far this is only used with M = 0. fs_op_maker object ================== This has one method: "Mkfs" + root_dir/obj => "Okay" + fs_op/obj This creates an fs_op object (see below) with root_dir as the root directory. The current working directory is initially unset; you can set it with the "Chdr" (chdir) method below. fs_op object ============ This object implements all the standard Unix filesystem calls that operate on pathnames: open(), mkdir(), unlink() and so on. You can construct one of these objects given a root directory. This object has one piece of state: the current working directory (cwd). This is allowed to be unset, in which case any operation that it relative to the cwd will return an error. Notation: * The request is given before "=>"; possible replies come after. * "+ FD" indicates that a message includes a file descriptor argument. * "+ foo/obj" indicates that a message includes an object reference. Methods: // duplicate the connection -- called before the fork() syscall // (now obsolete; will be removed) "Fork" => "RFrk" + FD "Copy" => "Okay" + fs_op/obj "Gdir" pathname => "Okay" + dir/obj Resolves `pathname' to get a directory, and returns the directory object. "Grtd" => "Okay" + dir/obj Same as <<"Gdir" "/">>. "Gobj" pathname => "Okay" + obj Resolved `pathname' to get any object; will follow symlinks. // open() call "Open" flags/int mode/int filename => "ROpn" + FD "RDfd" + FD + dir_stack/obj // This is returned when open() is used on a directory. // FD is for /dev/null, and the object is a dir_stack. "Fail" errno/int // stat() and lstat() calls "Stat" nofollow/int pathname => "RSta" stat "Fail" errno/int // readlink() call "Rdlk" pathname => "RRdl" string "Fail" errno/int // chdir() call "Chdr" pathname => "RSuc" "Fail" errno/int // fchdir() call: takes a dir_stack object as returned by open() "Fchd" + dir_stack/obj => "Okay" "Fail" errno/int // getcwd() call "Gcwd" => "RCwd" pathname "Fail" errno/int // list contents of directories: opendir() + readdir() + closedir() "Dlst" pathname => // same as `struct dirent' format: "RDls" (inode/int type/int name_size/int name)* "Fail" errno/int // access() call "Accs" mode/int pathname => "RAcc" "Fail" errno/int // mkdir() "Mkdr" mode/int pathname => "RMkd" "Fail" errno/int // chmod() call "Chmd" mode/int pathname => "RChm" "Fail" errno/int // utime()/utimes()/lutimes() calls "Utim" nofollow/int atime_sec/int atime_usec/int mtime_sec/int mtime_usec/int pathname => "RUtm" "Fail" errno/int // rename() call "Renm" newpath-length/int newpath oldpath => "RRnm" "Fail" errno/int // link() call "Link" newpath-length/int newpath oldpath => "RLnk" "Fail" errno/int // symlink() call "Syml" newpath-length/int newpath oldpath => "RSym" "Fail" errno/int // unlink() call "Unlk" pathname => "RUnl" "Fail" errno/int // rmdir() call "Rmdr" pathname => "RRmd" "Fail" errno/int // connect() on Unix domain sockets "Fcon" pathname + FD => "RFco" "Fail" errno/int // bind() on Unix domain sockets "Fbnd" pathname + FD => "RFbd" "Fail" errno/int // part of execve() call // The RExe result tells the client what it should pass to the exec syscall. // The client allocates a spare FD slot; it tells the server the number. // The server can then use this FD number in the arguments it returns. // The client receives an FD; it must copy it into that slot using "dup2". // This will be extended so that the server can also carry out the work of // the new process. // The RExo result returns an executable object which the client must invoke // with full arguments, including the root directory. "Exec" fd-number/int cmd-len/int cmd argc/int (arg-len/int arg)* => "RExe" cmd-len/int cmd argc/int (arg-len/int arg)* + FD "RExo" + CAP "Fail" errno/int Executable objects accept the following methods: // Test whether this is an executable object. // Executables that are just files will not respond to this. "Exep" => "Okay" "Exeo" ref/int data => "Okay" return_code/int The data is an array of pairs: * ("Argv", x): x is an array of strings representing argv * ("Env.", x): x is an array of strings representing the environment (usually each string is of the form "X=y") * ("Fds.", x): x is an array of (i, FD) * ("Root", obj): obj is the root directory * ("Cwd.", string): pathname of current working directory (this can be omitted, in which case process will have no defined cwd) * ("Pgid", int): process group ID to set for the new process (this is optional, but reading from the console won't work without setting it, and neither will Ctrl-C or Ctrl-Z) The invocation returns when the process started has exited. It returns the exit code that `wait' returns for the process. where: stat = dev ino mode nlink uid gid rdev size blksize blocks atime mtime ctime (all ints) Filesystem objects ================== Only the following is marshalled so far: // type "Otyp" => "Okay" type/int // stat "Osta" => "Okay" stat "Fail" errno/int