For the purposes of this paper we assume that each copy of the user application code to be installed on a compute node explicitly declares the data structures that are to be shared. Furthermore, we assume that such shared data structures are protected by locks, in order to guarantee mutual exclusion among the possibly many readers and writers, as well as at least two flags. The first flag indicates whether the data are valid to read and the second flag indicates whether the data can be overwritten.
When the process running on the compute node needs to read or write from its own shared space, it tests the flags without acquiring the lock; as a result it tests its own presumably cached copy of the flag. This obviously can be done efficiently.
When the compute process needs to issue a remote read or remote write request it places its request in a fifo queue in shared memory. In order to do so, it must successfully obtain the lock for the queue; it is assumed that this operation is blocking for the compute process. Once the request has been issued, the compute process can continue its execution.
In the case of blocking communication, where the process cannot have more than one outstanding request pending, the compute process waits for notification of completion of the issued request. In the case of nonblocking communication, a process may have multiple outstanding remote requests and thus it has to probe the appropriate shared memory flags for completion.