SAMSON Implementation Notes

These notes describe how the existing prototype implementation of the NMS system differs from the description in the Spring 1999 design document.

Client-side cache:
This is the biggest difference between the design document and the prototype implementation.

The original design of the client-side kernel module (CSKM) included "a small cache of pages that were recently prepaged from servers and which it is expected might be used again soon." This cache was originally envisioned as a relatively simple mechanism for retaining pages for a short period of time. In fact, this client-side cache portion of the CSKM has become the most central and complex portion of the entire system. The reason is that the original design did not adequately treat the issue of coherence of NMS pages. It was noted in the original design that race conditions affecting coherence were possible, but concrete mechanisms were not proposed for dealing with them.

In the prototype implementation, the client-side cache enforces coherence by serving as the central clearinghouse for pages in transit between the client and server systems. When the client initiates a paging operation, an entry is placed in the cache. Responses from server systems look up the corresponding entry in the cache and used information recorded there to alert the waiting client process. The client-side cache is also the main synchronization mechanism used by client processes waiting for responses from servers. When a client process wishes to block pending the arrival of responses from servers, it calls a routine provided by the client-side cache module which spins for a short time hoping for the response to arrive quickly. If the response is delayed exclusion on the CSKM is released, and the client process sleeps pending the arrival of the response using the usual kernel scheduling mechanism. No blocking occurs while exclusion is held on the CSKM other than via this mechanism.

The client-side cache has a timeout-based mechanism for expiring entries. When an entry is inserted into the cache, it is set to expire after a certain amount of time has gone by. The default timeout is set to 500ms. A "cleaner" task runs periodically (every 100ms) as a bottom-half handler. The cleaner is responsible for finding expired entries and alerting any client processes that are waiting for them. Synchronization issues are simplified by the stipulation that no pointers to cache entries are held outside the cache module, and that the cleaner task is the only entity that removes entries from the cache.

Synchronization within the CSKM and SSKM:
The original design did not treat the issue of synchronization within the CSKM or SSKM. The prototype implementation handles synchronization by using a three-layered approach, described below.

The lowest synchronization layer is the interrupt-handling level, which is concerned with handling HSN interrupts. The sole function of the HSN interrupt routines is to take a packet that has just arrived and place it on a queue for subsequent processing by a task that runs as a bottom-half handler (software interrupt routine). Access to these message queues is synchronized by globally disabling interrupts for the duration of the operation. The CSKM and SSKM also maintain various pools of pre-allocated data structures, to avoid excessive latencies and potential deadlocks associated with calls to the normal kernel allocator. Those pools of data structures that are accessed by interrupt handlers are also synchronized by globally disabling interrupts.

The highest synchronization layer consists of processes in the top half of the kernel, executing in the NMS code. To make the synchronization issues understandable, processes executing in this layer arrange for exclusive access to the NMS data structures as long as they are actively executing (i.e. not waiting for a reponse from a server). This exclusion is achieved by "pausing" processing of NMS messages coming from the HSN. Pausing is an inexpensive operation that amounts to incrementing a counter and disabling the bottom-half handler that processes incoming messages. When a process in the top half needs to block, processing of NMS messages is "resumed". When a process in the top half blocks, it is necessary (as usual) to re-check important information such as the location of a page and status of paging operations on a page, because the processing of messages might have resulted in changes to this information.

At the middle synchronization layer is the bottom-half handler that is responsible for processing incoming messages from the HSN. When this bottom-half handler runs, it has exclusive access to the NMS data structures. This is analogous to a normal interrupt handler in a device driver, and the synchronization is achieved in a similar way: the bottom-half handler has priority over top-half activities, and the top half disables the bottom half while the former is actively executing in the NMS code.

This design structure synchronization in the NMS code was chosen to keep things simple and understandable. It is possible that optimum performance might not be achieved by disabling the NMS client-side bottom-half handler any time a process is active in the top-half of the NMS code. Once the system is functioning reliably, we can experiment with more complex synchronization schemes that relax this requirement.

Memory allocation issues:
At the time we did the original design, we did not realize the significant difficulties that would result if calls to the normal kernel memory allocator were made from within the CSKM code. The problem is that, by its nature, the NMS system inherently creates a severe memory shortage on the client-side system. Furthermore, the NMS system itself is involved in alleviating the shortage by paging data out over the network. This creates all kinds of nasty deadlock possibilities and performance problems.

Here is an example of one kind of serious problem that we observed with early versions of the implementation. A process takes a page fault and enters the NMS code. The process of paging the data in from a server requires memory allocation, and the process calls the kernel memory allocator. In Linux, when the memory allocator is called with insufficient available memory, pageout is attempted to free up some memory. As a result of the pageout, the NMS code is re-entered recursively. This creates either a deadlock or a difficult synchronization issue. The Linux allocator takes flags to prevent recursion, but then one runs the risk that NMS pagein operations will fail frequently and unpredictably due to the memory shortage which is a normal byproduct of executing under NMS conditions.

We solved the allocation problems by maintaining private pools of pre-allocated memory for data structures critical to NMS operation. Note that simply setting up these pools doesn't immediately solve the problem, since on the client side there is a net flow of memory out of the NMS system as a result of pagein operations. Pageout operations do not compensate by releasing memory to the NMS system, because the pages that are cleaned by NMS pageout are freed not by the NMS system but rather at a higher level, either by a process in the top half that needed memory, or else by the kernel swap daemon. The way memory flows back into the NMS system is via allocation of pages in which to place data coming in over the high-speed network. However, under conditions of severe memory shortage, it is frequently impossible to allocate memory at interrupt level where the HSN code runs.

The solution we chose to the allocation problem is for processes entering the top half of the NMS code to "top off" the HSN free page pool before obtaining exclusion on the NMS data structures and proceeding with paging operations. A process in the top half will block at the entrance to the NMS pagein code until sufficient memory has been sequestered. This protocol tends naturally to throttle memory-consuming NMS pageing operations until memory has been freed by paging out. For data structures (such as the client-side cache entries) that are naturally recycled, rather than consumed as a result of NMS operations, we simply preallocate a sufficient quantity of the data structures when the NMS system is initialized.

Tracking NMS page status:
The NMS system needs to track the location of each page of data it manages. This is no problem between the time a page is presented for pageout and the time the page is subsequently paged in as a result of a page fault. However, the Linux VM system has no provision for notifying lower levels of the system when a clean page that is backed by a memory-mapped device is discarded. This means that it is possible for a pagein request to be presented to the NMS system for a page which the NMS system currently believes to be mapped in some local process' address space.

The solution we use for this problem is as follows: when a pagein request arrives for a page that was thought to be locally mapped, but for which a paged-out copy exists on server systems, the NMS system discards its belief that the page is local, and begins believing instead that the page is located remotely on servers. The pagein operation then proceeds normally.

Application interface:
The application interface is pretty much as described in the design document. We have so far focused our attention on the "sophisticated" model, which requires the application to mmap() the NMS virtual unit, since the "naive" model, in which the NMS units are used as swap devices, is less efficient and could potentially destabilize other activities on the client system if the NMS system is not completely reliable.

Server hardware:
The prototype we actually constructed uses Compaq ProLiant ML370 (Intel Pentium III) systems, each with 3GB RAM and Myrinet, as the server cluster, rather than Alpha-based systems. Client applications currently run on Alpha 21164-based workstations, with 1GB RAM. The reason for using Pentium III systems for the servers, rather than Alphas is we were not satisfied by the performance and reliability of the Alphas we purchased for development. In contrast to the client systems, which need a 64-bit architecture to support applications with larger than 4GB address space, it is not necessary for the server systems to be 64-bit machines.

Paging latencies:
Preliminary performance numbers for the protoype show an average paging latency of under 300us for an 8K page. Other than designing the system to avoid memory-to-memory copies and obvious sources of latency, we have not yet spent much effort in optimizing the performance.

One optimization we did find useful was for clients to wait for paging responses with a short spin loop, rather than blocking via the system scheduler. Calling the system scheduler can result in a very significant delay before a response is noticed and acted on, so we only do this if a response has not been received from the server after spinning for several hundred microseconds. The idea is to catch most responses in the spin loop, and only sleep if a response has been significantly delayed for some reason.

Event stream:
The design document describes an event stream mechanism, by which events generated in the kernel are transmitted to the user-level CSD or SSD daemon processes, and then possibly distributed to other nodes in the system. The event mechanism is currently only a stub consisting of a collection of macros for various events. At present, these macros simply generate debugging printout for the console or system log. Since the event mechanism was essential to some aspects of system operation (e.g. flushing unneeded pages on servers after an aborted pageout operation), we will soon have to implement more of the full event mechanism.

Hierarchical structure of page location table:
(issue: it was impossible to allocate arrays larger than 64K)

Page-out initiated both by user processes and swap daemon:
(issue: we originally thought pageout was the sole province of the kswapd, as in BSD. This is not true.)

Page fragmentation:
(issue: the Myrinet drivers had an MTU of 4160 bytes, which was only long enough for a 4K page payload. This made life complicated when paging betweeen a Pentium and an Alpha. We implemented a crude fragmentation scheme. Newer Myrinet versions support 9K MTUs, so we expect to be able to go back and remove the kludgy fragmentation code.)

UDP stub module:
(issue: we did originally implement a version of the HSN layer that used UDP/IP for debugging and development purposes. However, the Linux IP stack does not support well interfacing to other portions of the kernel and at interrupt level. So, once we got the Myrinet driver working, this code started to rot and has not been updated to the current Linux kernel version we are using.)

Access to more than 3GB of RAM on Pentium
(issue: Linux assumes all RAM mapped into kernel space. This makes it difficult to access more than 3GB RAM on a 32-bit system. FreeBSD should not have this problem, because it does not try to map all the physical memory at once.)

SAMSON Project
Last modified: Sun Jan 28 18:08:11 EST 2001