These notes describe how the existing prototype implementation of the NMS system differs from the description in the Spring 1999 design document.
The original design of the client-side kernel module (CSKM) included "a small cache of pages that were recently prepaged from servers and which it is expected might be used again soon." This cache was originally envisioned as a relatively simple mechanism for retaining pages for a short period of time. In fact, this client-side cache portion of the CSKM has become the most central and complex portion of the entire system. The reason is that the original design did not adequately treat the issue of coherence of NMS pages. It was noted in the original design that race conditions affecting coherence were possible, but concrete mechanisms were not proposed for dealing with them.
In the prototype implementation, the client-side cache enforces coherence by serving as the central clearinghouse for pages in transit between the client and server systems. When the client initiates a paging operation, an entry is placed in the cache. Responses from server systems look up the corresponding entry in the cache and used information recorded there to alert the waiting client process. The client-side cache is also the main synchronization mechanism used by client processes waiting for responses from servers. When a client process wishes to block pending the arrival of responses from servers, it calls a routine provided by the client-side cache module which spins for a short time hoping for the response to arrive quickly. If the response is delayed exclusion on the CSKM is released, and the client process sleeps pending the arrival of the response using the usual kernel scheduling mechanism. No blocking occurs while exclusion is held on the CSKM other than via this mechanism.
The client-side cache has a timeout-based mechanism for expiring entries. When an entry is inserted into the cache, it is set to expire after a certain amount of time has gone by. The default timeout is set to 500ms. A "cleaner" task runs periodically (every 100ms) as a bottom-half handler. The cleaner is responsible for finding expired entries and alerting any client processes that are waiting for them. Synchronization issues are simplified by the stipulation that no pointers to cache entries are held outside the cache module, and that the cleaner task is the only entity that removes entries from the cache.
The lowest synchronization layer is the interrupt-handling level, which is concerned with handling HSN interrupts. The sole function of the HSN interrupt routines is to take a packet that has just arrived and place it on a queue for subsequent processing by a task that runs as a bottom-half handler (software interrupt routine). Access to these message queues is synchronized by globally disabling interrupts for the duration of the operation. The CSKM and SSKM also maintain various pools of pre-allocated data structures, to avoid excessive latencies and potential deadlocks associated with calls to the normal kernel allocator. Those pools of data structures that are accessed by interrupt handlers are also synchronized by globally disabling interrupts.
The highest synchronization layer consists of processes in the top half of the kernel, executing in the NMS code. To make the synchronization issues understandable, processes executing in this layer arrange for exclusive access to the NMS data structures as long as they are actively executing (i.e. not waiting for a reponse from a server). This exclusion is achieved by "pausing" processing of NMS messages coming from the HSN. Pausing is an inexpensive operation that amounts to incrementing a counter and disabling the bottom-half handler that processes incoming messages. When a process in the top half needs to block, processing of NMS messages is "resumed". When a process in the top half blocks, it is necessary (as usual) to re-check important information such as the location of a page and status of paging operations on a page, because the processing of messages might have resulted in changes to this information.
At the middle synchronization layer is the bottom-half handler that is responsible for processing incoming messages from the HSN. When this bottom-half handler runs, it has exclusive access to the NMS data structures. This is analogous to a normal interrupt handler in a device driver, and the synchronization is achieved in a similar way: the bottom-half handler has priority over top-half activities, and the top half disables the bottom half while the former is actively executing in the NMS code.
This design structure synchronization in the NMS code was chosen to keep things simple and understandable. It is possible that optimum performance might not be achieved by disabling the NMS client-side bottom-half handler any time a process is active in the top-half of the NMS code. Once the system is functioning reliably, we can experiment with more complex synchronization schemes that relax this requirement.
Here is an example of one kind of serious problem that we observed with early versions of the implementation. A process takes a page fault and enters the NMS code. The process of paging the data in from a server requires memory allocation, and the process calls the kernel memory allocator. In Linux, when the memory allocator is called with insufficient available memory, pageout is attempted to free up some memory. As a result of the pageout, the NMS code is re-entered recursively. This creates either a deadlock or a difficult synchronization issue. The Linux allocator takes flags to prevent recursion, but then one runs the risk that NMS pagein operations will fail frequently and unpredictably due to the memory shortage which is a normal byproduct of executing under NMS conditions.
We solved the allocation problems by maintaining private pools of pre-allocated memory for data structures critical to NMS operation. Note that simply setting up these pools doesn't immediately solve the problem, since on the client side there is a net flow of memory out of the NMS system as a result of pagein operations. Pageout operations do not compensate by releasing memory to the NMS system, because the pages that are cleaned by NMS pageout are freed not by the NMS system but rather at a higher level, either by a process in the top half that needed memory, or else by the kernel swap daemon. The way memory flows back into the NMS system is via allocation of pages in which to place data coming in over the high-speed network. However, under conditions of severe memory shortage, it is frequently impossible to allocate memory at interrupt level where the HSN code runs.
The solution we chose to the allocation problem is for processes entering the top half of the NMS code to "top off" the HSN free page pool before obtaining exclusion on the NMS data structures and proceeding with paging operations. A process in the top half will block at the entrance to the NMS pagein code until sufficient memory has been sequestered. This protocol tends naturally to throttle memory-consuming NMS pageing operations until memory has been freed by paging out. For data structures (such as the client-side cache entries) that are naturally recycled, rather than consumed as a result of NMS operations, we simply preallocate a sufficient quantity of the data structures when the NMS system is initialized.
The solution we use for this problem is as follows: when a pagein request arrives for a page that was thought to be locally mapped, but for which a paged-out copy exists on server systems, the NMS system discards its belief that the page is local, and begins believing instead that the page is located remotely on servers. The pagein operation then proceeds normally.
mmap() the NMS virtual unit, since the "naive"
model, in which the NMS units are used as swap devices,
is less efficient and could potentially destabilize other activities
on the client system if the NMS system is not completely reliable.
One optimization we did find useful was for clients to wait for paging responses with a short spin loop, rather than blocking via the system scheduler. Calling the system scheduler can result in a very significant delay before a response is noticed and acted on, so we only do this if a response has not been received from the server after spinning for several hundred microseconds. The idea is to catch most responses in the spin loop, and only sleep if a response has been significantly delayed for some reason.