II. Design Overview

Application Interface

The basic abstraction provided to clients by the NMS system is that of an NMS virtual unit, which is essentially an array of virtual memory pages that behaves much like a disk and which can be accessed through the usual Unix block device driver interface. A single client system can access many virtual units simultaneously -- these are identified via separate minor device numbers in the usual way. At any given time, an NMS virtual unit has a fixed size, but the system permits a virtual unit to be resized at any time. Besides the block device interface, the NMS system also provides a character interface parallel to the block interface, so that each NMS virtual unit has both a block and a character device associated with it. The character interfaces are used to configure and control the virtual units via ioctl() calls, and also serve as sources of event streams, which provide to clients information about NMS-related events that may be of use in performing active processing.

Our prototype will simultaneously support two distinct usage models:

  1. Naive model: One or more NMS virtual units are used as swap devices in place of the normal disk swap devices. This mode of use permits non-NMS-aware clients to run under the NMS system without any recompilation or relinking.

  2. "Sophisticated" model: Client applications use the mmap() system call to map NMS virtual units into their address spaces. Paging operations on the range of virtual addresses covered by the mapping will then be forwarded to the NMS subsystem, instead of being handled by the normal system swap mechanism.

Because of differences in the way system swap devices are handled in Linux, these two modes of use impose somewhat distinct requirements on the implementation. Although we expect the system to perform best under the "sophisticated" model, we feel that supporting the naive model is important to allow unmodified client applications to take advantage of network memory.

The protocol by which a process running on a client system obtains access to an NMS virtual unit is as follows. As mentioned above, each NMS virtual unit has both a block and a character device associated with it. We refer to the character device as the control device for the virtual unit. The control devices are exclusive-open devices, and opening the control device associated with a particular NMS virtual unit number causes the NMS system to allocate and initialize the virtual unit, and causes the opening process to become the controlling process for that virtual unit. The virtual unit will continue to exist and function as long as the controlling process continues to hold open the control device. Once the control device is closed, the virtual unit is deallocated, and any data stored in it is lost. Exactly how the controlling process is related to client applications that actually use the NMS virtual unit depends on the particular usage model. Under the "sophisticated" model, the controlling process might actually be the client application process itself. In this case, the client application itself assumes all responsibility for managing the virtual unit. In some cases, it may be more convenient for the virtual unit to be managed by a separate active processing daemon, distinct from the application process that is actually using the virtual unit. Finally, in the naive model in which the virtual unit is being used as a system-wide swap device, the controlling process would be a system daemon whose job is to maintain the virtual unit for this purpose.

In any of the above modes of use, an NMS virtual unit is allocated and initialized when a process successfully opens the associated control device, thereby becoming the controlling process for that NMS unit. Once the control device has been successfully opened, the controlling process will issue ioctl() calls to configure the desired size of the virtual unit, as well as to set parameters such as the desired replication strategy and degree of replication. The NMS virtual unit is then ready for access, either by client processes or the kernel, or both. A client process wishing to access the virtual unit under the sophisticated model will do so by opening the associated block device, and then using the mmap() system call to map the virtual unit into its address space. It is possible for multiple processes on a single client system to simultaneously access the same virtual unit. However, NMS data is never shared between client processes running on different client systems. Under the naive model, NMS virtual units to be used for system swap would be opened during system initialization by a dedicated daemon designed for this purpose. Once the units have been opened and configured, the daemon would arrange for the kernel to add the associated block device to the system swap pool in the usual fashion.

The controlling process for an NMS unit also has access to an event stream which is generated by the NMS subsystem in the kernel. The purpose of this event stream is to make available to the controlling process information about the occurrence of events within the system, which it might need to know in order to perform active memory service, to monitor paging performance, or to do debugging. The event stream consists of a sequence of packets, each of which contains a type, a length, a timestamp, as well as possibly other data that depends on the type. Examples of events that would be provided via this mechanism are: occurrence of a page fault for a process, at a virtual address that is mapped to an NMS unit, or arrival of a page of data from an NMS server over the high-speed network. The controlling process for an NMS unit accesses the event stream using the read() system call on the control device. The controlling process has the responsibility of reading the event stream frequently enough that the kernel does not have to discard events to avoid a buffer overrun. The controlling process can perform ioctl() calls on the control device in order to set filters that limit the types of events that the kernel will send to the event stream.

(A link to a more detailed description of the various operations that can be made by the controlling process on the control device should go here. Also, a list of all events that the controlling process might elect to receive should go here.)

System Internals

The overall organization of the network memory server is shown below:

The NMS system has the following components:


Gene Stark
Last modified: Tue Jul 23 09:13:16 EDT 2002