The following sections give an overview of various facets of the NMS system operation.
The NMS system makes use of two distinct types of communication channels: high-speed datagram communication over Myrinet using custom protocols, and low-speed stream and datagram communication over Ethernet using IP. All pagein and pageout requests and acknowledgements are sent via high-speed datagram communication. Besides the data being paged between clients and servers, these datagrams may contain additional piggybacked information, such as server load information, which must be updated frequently. Low-speed stream connections (using TCP/IP) are maintained by each client between itself and each server it is actively using to store NMS data. If the stream connection between a client and server fails, each will assume that the other has crashed. A client who believes that a server has crashed will initiate a procedure for restoring the required degree of replication of any data for which a replica might have been lost. This will be done by paging in the affected data from other servers that are not down, and then paging it out again to a new server. A server who believes that a client has crashed will take steps to flush all data stored on behalf of that client, to avoid a storage leak. Low-speed stream connections are also used by clients to instruct servers explicitly to free data pages stored on their behalf. Finally, low-speed datagram communications using UDP/IP multicast are used by active processing daemons to communicate information about events occurring on one host to other hosts where this information is needed.
The high-speed datagram service provides two different modes of transmission: reliable and unreliable. In unreliable transmission, datagrams may be lost or duplicated. In reliable transmission, sequence numbers and a timeout and retransmission scheme is used to guarantee that a datagram will be delivered exactly once. (I am not sure that the reliable tranmission mode is needed any more.)
Client and server hosts in the NMS system identify each other using globally unique two-byte (16-bit) host identifiers. Each client and server maintains a table that maps host identifiers to Myrinet and IP addresses. In the initial versions of the system, this table will be statically configured, but in later versions there will be a provision for dynamic updating of these tables using an ARP-like protocol.
NMS servers provide storage for data pages. The size of a page is not a predetermined constant; rather the size of each particular page stored on a server is determined by an NMS client at the time that page is transmitted to the server for storage. Typically, a client will transmit data in units of its own native page size, usually 4K bytes (Intel) or 8K bytes (Alpha). Each page of data stored on an NMS server is identified by a an eight-byte global page identifier (GPI), which is constructed as the concatenation of a two-byte client host ID, a two-byte NMS virtual unit number, and a four-byte index that identifies the page within the virtual unit, with a granularity equal to the native page size on that particular client. Servers maintain a mapping from GPIs to their own local storage indices, which allows them to retrieve a page by GPI. Clients always use GPIs to identify data pages stored on servers; the advantage of this scheme is that any particular data page will be identified by the same GPI on any server to which it is sent. Thus, a client does not need to maintain any mappings for locating data pages, other than a page location table (PLT) that takes (virtual unit, page index) information to an identifier for a server replica group, and a replica group table (RGT) that maps server replica group IDs to the list of host identifiers for the individual servers in the replica group. The tables required to perform these mappings are sufficiently small that they can be kept resident on the client systems. For example, the PLT on a client with 4K-byte native page size will require RAM equal to 1/2048 of the total NMS memory being managed.
NMS clients wishing to page out data to servers simply select a single server (no replication) or group of servers (replicated case) and send a pageout request containing the data to the selected servers. The pageout requests are sent "blindly", in the sense that the client does not know for sure in advance whether the selected servers have sufficient space to satisfy the request, or even whether the servers currently have network buffers available in which to receive the data. However, clients track current server loads at relatively frequent intervals, and use this information to scatter requests among servers in order to achieve load balancing. A server that is unable to handle a request due to resource limitations is permitted simply to drop the request, in which case the client will have to retransmit the request to a different server. As dropping of requests will cause serious degradation of performance, it is important that servers maintain a sufficient supply of network buffers for receiving pageout requests, and that the load balancing strategy used by clients does a good job of spreading the load out among the available servers.
A pageout request generated by a client will contain the GPI of the data page to be stored. The server will store the data by entering a (GPI, data) mapping into its cache. If there was a previous mapping for the indicated GPI, the associated data storage area is overwritten with the new data. If there was no previous mapping for the GPI, new storage is allocated. Transmission of a pageout request from a client is done by multicasting the pageout request to all members of a replica group as an unreliable datagram. Since pageout requests are unreliable, the client does not know for sure whether a server to whom it has sent a pageout request has in fact stored the data, however since entering a (GPI, data) mapping is an idempotent operation, the client is free to timeout and retransmit the request if it likes. Since the client must eventually be sure of having sent the page to a sufficient number of servers to satisfy replication requirements, once a server has stored data sent by a client, the server is required to send an acknowledgement back to the client. This acknowledgement will also be sent as an unreliable datagram. To avoid the possibility of confusing acknowledgements for old pageout requests with newer ones, each pageout request issued by a client will contain a sequence number, which will be returned by the server in the acknowledgement message to permit the matching up of requests with acknowledgements.
Once a server has received a pageout request from a client and saved the resulting (GPI, data) pair, it will retain this information indefinitely until one of the following three situations occurs: (1) the server receives explicit instructions from the client to delete either this particular pair, or more generally any (GPI, data) pair for this particular client or for a particular virtual unit on this particular client; (2) the server discovers that the NMS subsystem on the client has been reinitialized (perhaps by a crash or reboot); (3) the server determines that the client is down or incommunicado. Case (1) is the usual case. When an NMS virtual unit on a client is deallocated, the client will issue to any server storing data on behalf of that client and virtual unit specific instructions to delete all such data. These instructions are sent via reliable stream communication, so that the client can be sure that the server has flushed all data associated with a virtual unit that has been deallocated, before the client reallocates the virtual unit and begins generating additional, unrelated, paging traffic to it.
When a client has not received an acknowledgement for a pageout request to a particular server, it should instruct the server to delete the particular page of data since the client can never be sure whether that server has actually stored the data or not. Hence a page deletion message is necessary to avoid storage leaks on the server. After sending the page deletion message, client removes the server from the list of servers on which the page is replicated. If a server receives deletion instructions for a GPI for which it is not currently storing a mapping, it simply ignores the instructions.
The NMS system uses replication of data on multiple servers in order to tolerate server crashes, network failure, and media errors. In addition, replication makes it possible to shut down and bring up servers without interrupting the execution of long-running client applications; something which is an essential feature of a system intended for "production" use. The initial version of the system will probably support two-way replication to guard against failure or shutdown of any single server. We expect that subsequent versions of the system will support application-configurable N-way replication, as well as more sophisticated replication methods. In all cases, selection of the degree and method of replication, as well as the responsibility for carrying out the replication algorithm, is the sole responsibility of the client systems.
When pageout is required, the CSKM will select a replica group of servers to receive the pageout request. The selection of a replica group will take into account the current load on each of the available servers, in order to avoid overloading any particular server or group of servers. The pageout request will be multicast to all the servers in the replica group. This is possible because all servers will use the same GPI to identify the data. Servers who successfully enter the (GPI, data) pair in their cache will acknowledge its receipt. If an acknowledgement is not forthcoming from a particular server, the client will timeout and either retransmit the request to the same server, or else select another replica group that contains the servers from which acknowledgements have been received, and which does not contain the servers who didn't respond, and re-issue the request to the new servers. Once the client has received acknowledgements from all members of the selected replica group, it can regard the pageout as complete.
The choice of replication strategy and degree of replication will configurable on a per-virtual-unit basis by the controlling process for an NMS virtual unit. The actual implementation of the replication strategy is carried out by the CSKM as described above. Under certain circumstances, due either to the crash of a server or the controlling process requesting an increased degree of replication, the number of servers known to have copies of a particular data page may drop below the requested degree of replication for that page. When this situation is detected by the CSKM, it restores the degree of replication of the affected page by simply performing a pagein operation to retrieve one of the copies of the page that still exists, and then performing an immediate pageout operation on that page to achieve the proper number of replicas.
The central aspect of system operation is handling page faults on
client systems, in cases where the retrieval of data from servers is
required.
Occurrence of a page fault on a client system causes a trap to
the page fault handler in the usual way. The actions taken by the
page fault handler depend on the type of mapping associated with
the faulting virtual address.
For a client running Linux, there are two distinct situations to consider.
In the first situation, the faulting address is covered by a mapping
(vm_area_struct) that was created by using
mmap() to map an NMS virtual unit.
In this case, the page fault handler invokes the appropriate
method (swapin() or nopage())
n the vm_operations_struct object provided
by the NMS virtual unit, to request the CSKM to retrieve
the required data. The page fault handler provides the CSKM
with the virtual address of the fault, and other information
about the faulting process can be obtained from the current context.
In this case, the CSKM is free to return the data in
whatever page of memory it likes.
In the second type of fault situation, the faulting address is one that is backed by the normal system swap mechanisms. In this case, the page fault handler invokes the block I/O subsystem of Linux to check for the desired data in the buffer cache and to fetch the data if it is not resident. If an NMS virtual unit is being used as the swap device, then ultimately the buffer cache will request retrieval of the data by the CSKM. The information available to the CSKM is the NMS virtual unit number, the offset within that unit of the desired data, and the address of a buffer into which the data is to be placed. The virtual address causing the fault is not readily available, nor is the CSKM free to place the data somewhere other than the specified buffer. In addition, requests coming from the buffer cache are for 1K byte blocks, not full pages. The CSKM caches recently retrieved pages for a short period to avoid re-requesting the page several times in quick succession as each 1K-byte piece of the page is requested by the buffer cache. (QUESTION: Who figures out that all the 1K-byte pieces have been retrieved, and that it is OK to tell the servers to release the data?)
Another Linux feature that is relevant to page fault handling is the fact that when pages are paged out to a system swap device, Linux stores a swap device/block number pair that identifies the location of the paged-out page directly in the PTE for that page. On the Intel architecture, PTEs are 32 bits. One bit is reserved by the hardware for the validity flag, and the remaining 31 bits are partitioned into 24 bits for the block number and 7 bits to identify the swap device. On the Alpha, even though PTEs are 64 bits, the same fixed format is used, so that there are still only 24 bits for the block number. As swap blocks are 1K bytes long, the maximum size of a system swap device is limited to 16GB. In fact, there is currently a smaller limit of 2GB imposed on the size of a system swap device, based on the size of a bitmap used to record the allocation status of the blocks on the swap device.
In either of the two types of page-fault-handling situations described above, the CSKM is able to extract from the context the NMS virtual unit number from which the data is to be retrieved, as well as the offset within this virtual unit at which the data exists. This offset can be turned into a page index, resulting in a (virtual unit, page index) pair. The client then concatenates this pair with its own host identifier to produce the GPI of the page to be retrieved. The client checks its own small cache to see if perhaps the page was recently pushed from a server and can be returned without any I/O. If not, then it then consults its memory resident PLT to obtain the identifier of the replica group that was used when the data was last paged out. Another table is consulted to map the replica group identifiers to the list of host identifiers for the servers making up the replica group. One of the servers in the replica group is selected to receive the pagein request, and the request is transmitted over the high-speed network to that server. If the data does not arrive within a reasonable period, the client is free to reissue the request, either to the same server or to another server in the replica group.
When a server response to a pagein request is received by a client,
it is queued by the high-speed network module, which then arranges
for a bottom-half handler in the CSKM to be invoked to process the
message.
The high-speed network module arranges things so that the data is
DMA'ed directly into a page-size buffer by the Myrinet hardware.
(Actually, limitations on datagram size imposed by the Myricom-supplied
Myrinet Control Program (MCP) make this possible only for 4K-byte pages.
For 8K-byte pages to be DMA'ed directly into an 8K-byte page can be
done, but it requires custom modifications to the MCP.
So, in our initial prototype, we will have to transmit each 8K page
as two Myrinet datagrams, and to pay for one 4K-byte copy operation
to recombine the two halves of a page.)
The bottom-half handler that processes the message examines it to
determine whether it represents data solicited by a client
process which is now blocked in the page fault handler, or whether
it represents unsolicited data pushed as a result of a
prepaging request by an active processing daemon at the server.
In case of unsolicited data, the data is simply entered in the
CSKMs cache for possible later use.
In case of solicited data, identifying information that was sent
with the initial request to the server and returned by the server
with the response is used to identify the process awaiting the
data, and whether the request came via the buffer cache or
through a vm_operations_struct associated with a
memory-mapped region.
If the original request came through the buffer cache, then it was fragmented by the buffer cache module into 1K-byte blocks. The response that has just arrived is for the first of these 1K fragments. In this case, it is necessary to copy 1K bytes of data, from the just-received page into the particular target location specified by the buffer cache. This means that, in our prototype system, requests coming through the buffer cache will incur the overhead of one copy, from the receive buffer into the buffer cache. The complete page is retained for a short while in the CSKMs cache, so that the remaining fragments can be satisfied from the cache rather than requiring another communication with the server. As there is a chance the page will have to be expired from the CSKM cache before it can be determined with certainty that all fragments have been requested, when an NMS virtual unit is used as a swap device, either a page must be marked "dirty" when it is first accepted into the cache, or else servers must retain copies of the page for as long as the virtual unit remains active.
If the original request came via a memory-mapped region, then since the
swapin operation returns the new PTE to be entered in the
page tables for the faulting process, the page in which the data was
received can be mapped directly, without any copy being required.
Thus, on an Intel client, with a 4K-byte page size, our prototype will
be able to achieve zero copy overhead on the client side.
On an Alpha client, with an 8K-byte page size, our prototype will
have to copy half of each page, due to the limitations of the Myrinet
Control Program discussed above. With a modified MCP, we should be
able achieve the zero copy goal on the Alpha as well.
As already indicated, the CSKM maintains a cache of pages that were recently received from a server. Such pages arrived either in response to an explicit pagein request to the server, or else arrived unsolicited, as a result of a prepaging request made at the server. This cache has to be large enough that a solicited page requested via the buffer cache can remain long enough that the faulting process can copy out all its 1K-byte blocks. In addition, the size of this cache will also impact the effectiveness of "active processing," prepaging strategies. We expect that a complex replacement policy will not be necessary for the CSKM cache. One simple mechanism would be to maintain this cache as a simple LRU list. When free pages are required, the oldest pages are examined, "laundered" by paging them out to servers if they are dirty (if in fact it is possible for dirty pages to exist in this cache, it may well be that it is not), and then added to the free list. In contrast to this global replacement policy, a more useful policy from the point of view of active processing might be a local replacement policy that segregates the pages in the CSKM cache by process, so that processes that perform highly aggressive prepaging do not negatively impact the performance of more conservative processes. In any case, we expect that only a small number of simple replacement policies need be provided, and that these policies would be easily implemented in the kernel.
The handling of pagein requests at the server is also a critical aspect of system operation. When a pagein request arrives at a server, the high-speed network module queues this request and schedules a bottom-half handler in the SSKM to process it. When the bottom-half handler runs, it examines the request, and extracts the GPI of the requested page for use as a lookup key. First, the SSCM is consulted in order to determine quickly whether a data page associated with the specified GPI is currently resident in the cache. The SSCM performs this lookup by hashing the GPI and indexing into a hash table that contains one entry for each (GPI, data) pair with the data currently resident in main memory. If a resident page is found in the SSCM hash table, it is queued immediately for transmission back to the client. This sequence of operations is the normal case, and as it runs completely at interrupt level and no copies are made of the data, very little latency should be incurred.
Note that the amount of storage required for the SSCM hash table is large, but not excessive. For example, 8GB of RAM divided into 8KB pages would mean 1M pages. A hash table of 2M entries would give a comfortable loading factor of 50%. Assuming each entry in the table contains an 8-byte pointer means that the hash table would be 16MB. Each hash entry would require 8 bytes for the GPI, 8 or fewer bytes to indicate the physical address of the data if it is resident, and 8 bytes for a link to the next element of the same bucket, for a total of 24 bytes. The total amount of storage required by the table when 8GB of RAM is fully utilized is thus 16MB + (24*2MB) = 64MB.
If the SSCM reports that there is a data page associated with the specified GPI, but that it is currently located on secondary storage, then I/O is required to bring the data into main memory on the server, before it can be transmitted back to the client. Since substantial latency is now unavoidable, no further attempt is made to handle this request within the kernel, and the request is placed in the event stream destined for processing by the user-level SSD. The SSD maintains data structures with a complete record of the GPIs for which the server currently has data, together with information about the location of the data, either cached in main memory, or else swapped out to secondary storage. Unlike the SSCMs modest-sized hash table, whose size is dependent on the size of the main memory, the SSDs data structures depend on the total amount of virtual storage supported by the server. This might be eight or ten times larger than the amount of RAM. One way to maintain this table would be as a B-tree mapping GPIs to the associated data. Supposing the server manages 64GB of backing store, a table of 8M entries would be required. We might expect such a database to consume several hundred megabytes of virtual memory on the sever system.
When the server receives a pagein request forwarded from the SSKM,
it creates a separate thread to handle the request. A separate thread
is used because in general I/O will be required to looking up the
GPI and to bring the data in from the backing store. Using multiple
threads would permit the SSD to handle many such requests concurrently.
Special ioctl() calls are provided by the SSKM to the SSD
for the purpose of transferring data between the cache and the backing
store, and for causing a response to be transmitted back to the
client once the requested data is resident in main memory.
Paging out of NMS data from the client is somewhat different than paging in, for the following reasons: (1) Paging in is done by a client application process from within the page fault handler, whereas paging out is done by the kernel swap daemon. The kernel swap daemon generally will have access to less information about the way in which a memory page is used than the client application will. For example, the kernel swap daemon does not have easy access to the virtual address at which the page is mapped by a given process. (2) Although pageout traffic may consume a considerable amount of communications bandwidth, it differs from pagein in the sense that it is not a latency-critical operation. Thus, for pageout, we need not be overly concerned with latency-reduction techniques, as long as we can deliver the required throughput.
When the kernel swap daemon wants to evict a page that is backed
by an NMS virtual unit, there are two distinct situations, just
as there are in the pagein case:
(1) The NMS virtual unit is part of the normal system swap pool,
in which case the pageout request will arrive at the CSKM via
the system buffer cache.
(2) The NMS virtual unit is being used to back a region of virtual
address space that was mapped using the mmap() call.
In this case, the kernel swap daemon makes the pageout request
by calling the swapout() operation of the
associated vm_operations_struct.
Though the kernel swap daemon will in general not have easy access
to the virtual address(es) at which client process(es) have the
page mapped, in both case (1) and case (2) when the requests arrives
at the CSKM it will be possible to extract from the arguments to
the call the NMS virtual unit number and offset information,
so that the CSKM will be able to construct the GPI for the page.
Once the CSKM has constructed the GPI for the data page, it selects a server replica group to be used for pageout. The replica group is chosen to achieve the desired degree of replication of the data page, and also to achieve load balancing among the available servers. Load information is updated frequently by the CSKM as information arrives piggybacked on paging datagrams. A replica group identifier (RGI) is constructed to identify the particular replica group chosen. Since the RGI must be stored in the PLT, which must be maintained resident in kernel memory, it is important for the RGI to be small. We use a 16-bit RGI, the first byte of which identifies one of 256 possible banks of servers, where each bank can consist of up to eight servers, and the second byte of which serves as a bitmap to identify which of the servers in the bank have acknowledged storing copies of the page. The number of active banks and their membership is maintained by the CSD. Banks are permitted to have overlapping membership. A small system with eight servers would probably use just one bank. A somewhat larger system with, say, 64 servers, might define a number of overlapping banks to make it more likely that a bank can always be found that has a sufficiently large set of its members currently up. With completely disjoint banks, up to 2048 servers can be accomodated.
Once the replica group for pageout has been chosen, the CSKM records its 16-bit replica group ID in the PLT so that the page can be retrieved later. The pageout request is then multicast as an unreliable datagram over the high-speed network to all servers in the replica group. The CSKM then collects acknowledgements until all servers in the replica group have responded or until a timeout occurs. In the normal case, acknowledgements will be received from all servers, at which point the pageout operation is complete and the local copy of the data can be safely deleted. If a timeout occurs before acknowledgements have been received from all servers in a replica group, then the CSKM selects replacements for them from the same bank. The pageout request is then resent to the replacement servers. The nonresponding servers are deleted from the bitmap portion of the replica group ID, and events are queued for the CSD, which inform the CSD about the nonresponding servers. Since the servers might just be slow, when the CSD receives these events, it will request, over the TCP connection, that they delete any copies of the page that they happened to receive. A timestamp is used to identify the particular version of the page to be delete, to avoid race conditions.
If it is impossible to select sufficiently many replacement servers from the original bank to achieve the required degree of replication, then that bank is abandoned, events are queued for the CSD so that it can instruct servers in the bank to delete copies of the page, and the pageout is restarted with a new bank.
Basic ideas here: "load" on servers is measured by the number of free buffers available for receiving pageout requests. Servers piggyback load information on other messages, and also send periodic updates. According to Michael Bender, there are recent theoretical results that show that very good load balancing can be achieved if, for example, a set of candidate servers is selected at random, and then the least-loaded among the candidates are chosen to be the actual servers to receive the service request.
Every client in the system tracks the up/down status of all servers it has used or is considering using for paging operations. Every server in the system tracks the status of all clients for whom it has stored data. The correct functioning of the NMS system does not depend on there being any sort of agreement between hosts in the system as to which clients and servers are up or down.
Clients and servers use the following mechanism for tracking each other's status. When a client begins considering a server for use in paging operations, it initiates a TCP connection to this server over the Ethernet. This connection supplies a context for all further communication between the client and the server. As long as the connection exists, the client and server regard each other as "up". If either end detects that the connection is broken, then it determines that the other host is "down".
When a client or server detects that its TCP connection with a server is broken, it closes its own end of the connection and flags the other host as "down". When a client flags a server as "down", it unilaterally abandons all data that might have been stored on that server. This action may trigger paging activity to other servers in order to restore the required degree of replication of data that the "down" server had stored. When a server flags a client as down, it takes steps to flush all data it has stored on behalf of that client, to avoid storage leaks. The client may attempt to re-establish contact with the server, however when such new contact is made, both client and server regard each other as "new acquaintances", as if both had just contacted each other for the first time.
When a client or server at one end of a TCP connection declares the host at the other end to be "down", care must be taken to cease processing of any further paging requests or other messages that were sent in the context of the now-terminated session. In order to arrange this, when a TCP connection between a client and server is established, the two negotiate a session ID which is guaranteed to be distinct from any session ID previously used by either of the two. For example, this could be done by having the client and server exchange 64-bit values representing their individual idea of the current time of day, and then choosing the maximum of these values as the new session ID. The current session ID is then placed in every subsequent message issued by either of the two hosts, regardless of the destination. Messages arriving at a receiving host that contain a session ID older than that negotiated when the current TCP connection with the sender was established, are simply dropped. Messages arriving with newer session IDs are accepted. This scheme allows all messages, both point-to-point as well as multicast, that were sent in the context of the current TCP connection between the sender and receiver, to be accepted, and ensures that all messages sent in the context of a previous TCP connection will be dropped.
Since it is the CSD and the SSD who create maintain the TCP connection
between client and server, they are the entities responsible for
detecting when this connection has been broken and declaring the
other end down. The CSD and SSD inform the CSKM and SSKM about
changes in the up/down status of a peer, as well as changes in
session IDs, by ioctl() calls.
Active processing daemons receive an event stream from the kernel
of the host on which they run. They can set filters so that they
are not overloaded with lots of events they don't care about.
They communication with active processing daemons on other hosts
using IP multicast communication over Ethernet.
An active processing daemon that is the controlling process for an
NMS virtual unit can issue ioctl() calls on that
unit to request prepaging and set configuration options.