The basic abstraction provided to clients by the NMS system is that
of an NMS virtual unit, which is essentially an array of virtual
memory pages that behaves much like a disk and which can be accessed through
the usual Unix block device driver interface.
A single client system can access many virtual units simultaneously --
these are identified via separate minor device numbers in the usual way.
At any given time, an NMS virtual unit has a fixed size, but the system
permits a virtual unit to be resized at any time.
Besides the block device interface, the NMS system also provides a character
interface parallel to the block interface, so that each NMS virtual
unit has both a block and a character device associated with it.
The character interfaces are used to configure and control the virtual
units via ioctl() calls, and also serve as sources of
event streams, which provide to clients information about
NMS-related events that may be of use in performing active processing.
Our prototype will simultaneously support two distinct usage models:
Because of differences in the way system swap devices are handled in Linux, these two modes of use impose somewhat distinct requirements on the implementation. Although we expect the system to perform best under the "sophisticated" model, we feel that supporting the naive model is important to allow unmodified client applications to take advantage of network memory.
The protocol by which a process running on a client system obtains access to an NMS virtual unit is as follows. As mentioned above, each NMS virtual unit has both a block and a character device associated with it. We refer to the character device as the control device for the virtual unit. The control devices are exclusive-open devices, and opening the control device associated with a particular NMS virtual unit number causes the NMS system to allocate and initialize the virtual unit, and causes the opening process to become the controlling process for that virtual unit. The virtual unit will continue to exist and function as long as the controlling process continues to hold open the control device. Once the control device is closed, the virtual unit is deallocated, and any data stored in it is lost. Exactly how the controlling process is related to client applications that actually use the NMS virtual unit depends on the particular usage model. Under the "sophisticated" model, the controlling process might actually be the client application process itself. In this case, the client application itself assumes all responsibility for managing the virtual unit. In some cases, it may be more convenient for the virtual unit to be managed by a separate active processing daemon, distinct from the application process that is actually using the virtual unit. Finally, in the naive model in which the virtual unit is being used as a system-wide swap device, the controlling process would be a system daemon whose job is to maintain the virtual unit for this purpose.
In any of the above modes of use, an NMS virtual unit is allocated and
initialized when a process successfully opens the associated control device,
thereby becoming the controlling process for that NMS unit.
Once the control device has been successfully opened, the controlling
process will issue ioctl() calls to configure the desired
size of the virtual unit, as well as to set parameters such
as the desired replication strategy and degree of replication.
The NMS virtual unit is then ready for access,
either by client processes or the kernel, or both.
A client process wishing to access the virtual unit under the
sophisticated model will do so by opening the associated block device,
and then using the mmap() system call to map the virtual
unit into its address space.
It is possible for multiple processes on a single client system to
simultaneously access the same virtual unit. However, NMS data
is never shared between client processes running on different client
systems.
Under the naive model, NMS virtual units to be used for system swap
would be opened during system initialization by a dedicated daemon
designed for this purpose. Once the units have been opened and configured,
the daemon would arrange for the kernel to add the associated block
device to the system swap pool in the usual fashion.
The controlling process for an NMS unit also has access to an
event stream which is generated by the NMS subsystem in
the kernel. The purpose of this event stream is to make available
to the controlling process information about the occurrence of
events within the system, which it might need to know in order to
perform active memory service, to monitor paging performance,
or to do debugging. The event stream consists of a sequence of
packets, each of which contains a type, a length, a timestamp,
as well as possibly other data that depends on the type.
Examples of events that would be provided via this mechanism are:
occurrence of a page fault for a process, at a virtual address that
is mapped to an NMS unit, or arrival of a page of data from an
NMS server over the high-speed network.
The controlling process for an NMS unit accesses the event stream
using the read() system call on the control device.
The controlling process has the responsibility of reading the
event stream frequently enough that the kernel does not have to discard
events to avoid a buffer overrun.
The controlling process can perform ioctl() calls
on the control device in order to set filters that limit
the types of events that the kernel will send to the event stream.
(A link to a more detailed description of the various operations that can be made by the controlling process on the control device should go here. Also, a list of all events that the controlling process might elect to receive should go here.)
The overall organization of the network memory server is shown below:
The NMS system has the following components:
This module is responsible for most aspects of NMS operation on the client side. It implements the block and character pseudo-device driver interfaces, it has the responsibility for handling client page faults that require NMS service, and it handles pageout requests originating from the kernel swap daemon. It also generates the event streams consumed by the CSD and APDs, and it manages a small cache consisting of data pages recently pushed by servers in response to prepaging requests from APDs.
This process is responsible for those aspects of client-side operation for which kernel implementation is not essential or desirable. It maintains TCP connections with all servers that have been used for paging in the past, or are candidates for paging. It tracks the status of such servers, and is responsible for declaring them "down" when the TCP connection is broken.
This module is responsible for handling pagein and pageout requests arriving from client systems and responding to them with minimum possible latency. It consults the SSCM to determine whether pages are currently resident in the NMS cache. It also generates an event stream consumed by the SSD.
This process controls NMS operation in the kernel. It uses the event stream originating from the SSKM to track the contents of the NMS cache. It manages the backing store used by the NMS system, and it is responsible for deciding when to transfer data between the cache and backing store. It tracks the status of clients who own data stored on the server, and responds to requests from clients to free this data.
This module sequesters a large amount of system RAM on the server side and maintains it as a cache for data pages from clients. It provides a mechanism for quickly determining if a page referenced in a client pagein request is resident in the cache.
The HSN interfaces between the NMS system and the Myrinet API. They provide a datagram facility for communication between clients and servers over the high-speed network.
Active processing daemons are used for performance monitoring and debugging, as well as for experimental prepaging heuristics. They execute on a client or a server system, process an event stream originating from the kernel, communicate with active processing daemons elsewhere in the system, and take actions based on this information.
This module emulates the Myrinet API, and is used for debugging on systems that don't have Myrinet.