A network memory server (NMS) is a device that provides clients with access to a large amount of RAM via a fast network memory paging service, in analogy to the way in which a network file server provides clients with access to a large amount of disk storage. The ready availability of high-speed (>1Gb/sec), low-latency (<10us) networking hardware now makes it feasible to use commodity components to construct a network memory server that can offer a memory paging service two orders of magnitude faster than paging to and from local disks. Such a server could provide a seemingly "infinite" memory resource for memory-hungry applications on client systems, in such a way that the client applications can execute nearly as fast using the network memory server as they would if all the memory were local to the client.
This document describes the design of a prototype NMS we intend to build. The target platform for the prototype NMS will consist of a cluster of 64-bit Alpha 21164-based PC-class machines, each equipped with at least 1GB of RAM and approx 10GB of disk backing store, and connected by a high-speed (1.2Gb/sec) Myrinet interconnect which will be dedicated to the network paging function. Each machine will also be equipped with a standard 100Mb/sec Fast Ethernet interface for general-purpose communication using IP. Server machines in the cluster will run under the RedHat Linux operating system. Initially, client machines will also run under RedHat Linux, though we have attempted to make the design flexible enough to permit clients to run other Unix systems.
Functionality of the system will be logically partitioned into clients, which are concerned with the execution of application programs that request virtual memory resources from the NMS and issue paging requests to it, and servers, which are concerned with managing physical memory resources, allocating virtual memory resources to clients, and responding to paging requests generated by clients. A single physical node of the NMS may act as a dedicated client, a dedicated server, or it may perform both client and server functions. However, we anticipate that in normal use, there will be dedicated server-only and client-only machines.
A key to achieving good performance from the NMS will be to achieve extremely low-latency paging over the network. As a design target for the prototype NMS, we hope eventually to obtain average paging latencies through the NMS system of no more than 200us, from the time a pagein request is passed to the NMS subsystem by the page fault handler on a client, to the time that the 8Kbyte page is subsequently made available to the page fault handler. In our design, we have avoided the introduction of major sources of latency in the critical path followed by a pagein request: (1) from the page fault handler until it is forwarded to a server via the high-speed network; (2) from the time a pagein request is received by a server system until the time the corresponding page of data is placed on the wire; (3) from the time a data page is received by a client system until the page becomes available to the page fault handler. In particular, in our design this critical path in the normal case is executed entirely in the bottom half of the kernel, and does not require any context switches.
Our design also attempts to make no copies of the data page in transit from the server to the client, other than the DMA copies necessary to transfer the data from the RAM on the server to the NIC card for transmission, and to transfer the data from the NIC card on the receiver into RAM. We are able to achieve the "zero copy" goal under Linux, assuming that NMS-aware client applications access the facility by using explicit mmap() calls to map NMS memory into their address spaces, and assuming a 4K-byte page size on client machines. For clients with 8K-byte page sizes, a "50% copy" is needed, due to the necessity of fragmenting the 8K packets to meet MTU limitations of the stock Myrinet code provided by Myricom. With modifications to the Myrinet control program, the MTU can be increased, so that "zero copy" can also be achieved for 8K-byte page sizes. If the NMS system is used in a naive way by a client system as "just another system swap device", then "one copy" is necessary, for both 4K-byte and 8K-byte page sizes, unless very intrusive changes are made to the Linux memory management code.
In spite of the attention we have paid to reducing obvious latencies in the design, we do not expect the 200us latency target for an 8K pagein to be achieved by the initial prototype. To achieve that target will probably require the introduction of additional latency-reduction tricks such as the "cut-through delivery" used in the Duke Trapeze system. We expect to be able to incorporate such tricks into our design, but are not targeting this for the initial prototype because modifications to the Myrinet Control Program firmware will be required. Our strategy is to first build a robust, working platform, and then introduce such modifications to reduce latencies.
An important goal of the project is to support research into so-called active memory services, in which customized memory services are designed to take advantage of application characteristics to hide paging latency and improve performance. The NMS prototype will support this goal, by providing flexible mechanisms by which memory-related events associated with client applications can be tracked, and by which prepaging of data from servers to clients can be effected by active processing daemons that can run on any node of the network.
Another important goal of our design is to make a robust, reliable system that can be used for "production" applications that might run for a long time. Achieving this goal requires that our design include mechanisms for handling server crashes and media failures, and that there be provisions for shutting down and bringing up servers without interrupting any applications that might be executing. In our design, replication of data among servers is used for these purposes. Our system will permit client applications to specify the type and degree of replication to be used in paging out data to servers. By mirroring data on more than one server, a client application can continue to execute when a server crashes or is shut down. In addition, the NMS subsystems on clients will track the status of servers to which data has been sent, and when a down server is detected, the client will automatically and transparently restore the specified degree of replication by making new copies of the affected data on other servers.