Overview
The high-speed network driver (HSN) is responsible for sending unreliable datagrams over the high-speed Myrinet network.
Interfaces
The HSN uses only one unreliable datagram interface with the CSKM and SSKM module. Sometimes a single physical node can perform as both client and server. Hence, the HSN interface should be able to support both CSKM and SSKM module at the same time.
Data Structures
The HSN maintains the following principal data structures.
Packet Size
One main issue of the HSN layer is the size of the packet it can handle. The maximum MTU size for the Myrinet Control Program provided with the Myrinet distribution is 8K byte. Hence, without modifying the Myrinet Control Program it is not possible to send a 8K page in a single packet (with each page there will be some control information, and together it will be more than 8K bytes). If the page size on the client machine is 8K, HSN layer handles the fragmentation operation of the page. Apart from fragmentation, it is the goal of the NMS server to avoid page copy whenever possible. So, the HSN layer should arrange the packet such that minimum copy operation needs to be performed.
Regarding page size, there can be two possibilities:
case 1: Client Page Size 4K, Server Page size 8K
case 2: Client Page Size 8K, Server Page size 8K
In case 1, there are two cases: page-in operation and page-out operation.
In case of page-in operation, server side SSKM layer delivers a 4K page along
with some control information (SSKM header) to the HSN layer. The SSKM layer
should make sure that the information is delivered as For page-out operation in case 1, client side SSKM layer delivers a 4K
page along with some control information to the HSN layer. This should be of
the format In case 2 again, we have page-in and page-out operation. For page-in
operation, server side SSKM layer deliver a 8K page along with the control
information to the HSN layer in <8K page, SSKM header> format. Since HSN
layer cannot transfer it as a single packet, it will fragment the message
into two packets The page-out operation of case 2, is very similar to page-in operation of
case 2. Client side CSKM layer delivers <8K page, CSKM header> which is
fragmented into two packets <4K page, HSN header> <4K page, CSKM header, HSN
header>. At the server side, the packets are received in two 8K size receive
buffers. The second half of the 8K page is again copied into the second half
of the first 8K buffer, and is given to the SSKM layer. In this case, again
there will be 50% copy for every page.
Buffer Management
The HSN layer should keep enough number of buffers ready for the
network interface to receive the packets. The HSN layer uses the native
page size of the machine for these buffers. The HSN layer can take the
responsibility of allocating these buffers (Was there a problem with that?)
In client side, when a page is received, CSKM has the responsibility of freeing
up that page if necessary, and returning it to the system pool of buffers. On
the server side, similarly, the HSN layer allocated the buffers, receives the
page in that buffer and gives it to SSKM. SSKM puts in directly in the cache.
It is again the responsibility of the server to free up a cache page, and
return it to the system pool of memory when it is no longer needed.
Multicast
For page-out operation with replication degree greater than 1, the client
needs to multicast the page to servers. Ideally, the HSN layer should support
multicast operation. However, the multicast groups in this case are changing
dynamically, and hence, to support multicast, it would be necessary to setup
the multicast group before any multicast operation. This would be quite time
consuming. Hence, the CSKM layer should should send a multicast message as
multiple unicast message to the HSN layer.
HSN Operation
Initialization
Refill the receive queue with enough
buffers.
Interrupt handler
When HSN layer receives
an interrupt, it first checks if there is any pending packet in the
interface. If there are multiple of them, interrupt handler processes
only first few so that it does not create a bottleneck. For each
pending packet it calls the lower level routine receive_packet()
Sending a Datagram
When the CSKM/SSKM layer gives a datagram to the HSN layer, the HSN first puts the datagram
in the pending message queue, and calls the lower level routine
output_pkt()
Internal Operation()
receive_packet()
receive_packet() first checks the fragment number and last fragment field
of the packet. If this is the last fragment of a packet, and the fragment
number is 0, it implies that the message contains only one packet. In that
case, HSN layer gets rid of the HSN header, and delivers the packet to the
upper layer (CSKM/SSKM) thorough the interrupt function provided by that layer.
output_packet()
This function performs the low level operations to transfer the packet to myrinet interface. It first checks if a device is busy because another output_packet() function is going on. If the device is not busy, it first dequeues the message. It creates two gather buffers, one containing the nms_data and other containing the headers. The HSN puts the data first and the header last so as two avoid costly copy operations when the data is 4K page. Putting the data first
is a way to make sure that the data is always page aligned.
Note: This HSN driver design
closely follows the design of the driver provided with the myrinet package.
1: Host ID
2: Unique host sequence number
3: fragment number (0/1)
4: Last fragment?
The last field is necessary to distinguish a message with single packet
from a message with two packets.
when the message is received at the client side, client has allocated 8K
size buffers for receiving messages. Hence, the two packets will be stored in
two 8K buffers. Notice that we now need to perform a copy operation to copy
the second half of the 8K page into the free second half of the first 8K
buffer (we can get rid of the HSN header of the first packet). This makes the
fragmentation operation totally transparent to the CSKM/SSKM layer. One
problem with this approach though is the fact that in naive mode i.e. using
buffer cache, CSKM needs to perform copy of the whole 8K page anyway from the
HSN buffer to the kernel buffer cache. Hence, for the second half of the 8K
page, this will involve multiple copy operation).
Clears all the buffers in the send
queue to FREE status.
Before returning from the interrupt
handler, the HSN checks if there is any pending message in pending
message queue. If there is a message, it calls the routine output_pkt()
If this packets is part of a bigger message, HSN first checks the
fragment queue to figure out if the other part of the message has been
received. If yes, HSN layer combines the two packets into a single message
by getting rid of the first HSN header, and copying the second half of 8K page
to the second half of the first buffer, and by copying the CSKM/SSKM header to
the beginning of the second buffer. It then delivers these two buffers as a
single packet to the upper layer. If the other part of the message is not
present, HSN layer enqueues this packet in the fragment queue in the hope that
the other half will be received soon. Notice that HSN layer needs to set a
timer with every packet in the fragment queue, so that if the other part gets
lost, HSN can delete the received half as well after sometime.
Finally, the HSN puts this message
in the send queue and marks the status as BUSY. It also goes through
all the buffers in the send queue to detect the ones that have
completed the DMA operation. The status of these buffers are
set to FREE, and the function provided by the CSKM/SSKM module
is called to indicate the completion of the send. The
CSKM/SSKM module can then free the data corresponding to that message.
After sending a packet, output_packet()
checks if there is any message in the pending message queue. If there is
a message, it tries to send that message.
Last modified: Friday May 7 12:50:29 EDT 1999