High-Speed Network Driver

Overview

The high-speed network driver (HSN) is responsible for sending unreliable datagrams over the high-speed Myrinet network.

Interfaces

The HSN uses only one unreliable datagram interface with the CSKM and SSKM module. Sometimes a single physical node can perform as both client and server. Hence, the HSN interface should be able to support both CSKM and SSKM module at the same time.

Data Structures

The HSN maintains the following principal data structures.

  • A pending message queue to store that messages that are delivered from the CSKM/SSKM layer and are waiting to be put on the myrinet interface using myriApiSend() call.
  • A send queue containing all the messages getting transferred to the myrinet interface i.e. for which myriApiSend() has been called but the DMA to the network interface is not complete
  • A receive queue containing a pool of buffers to be used by the myrinet interface when a packet is received
  • A fragment queue containing the packets which are part of the bigger message so that the HSN can combine the packets.
  • Packet Size

    One main issue of the HSN layer is the size of the packet it can handle. The maximum MTU size for the Myrinet Control Program provided with the Myrinet distribution is 8K byte. Hence, without modifying the Myrinet Control Program it is not possible to send a 8K page in a single packet (with each page there will be some control information, and together it will be more than 8K bytes). If the page size on the client machine is 8K, HSN layer handles the fragmentation operation of the page. Apart from fragmentation, it is the goal of the NMS server to avoid page copy whenever possible. So, the HSN layer should arrange the packet such that minimum copy operation needs to be performed.

    Regarding page size, there can be two possibilities:
    case 1: Client Page Size 4K, Server Page size 8K
    case 2: Client Page Size 8K, Server Page size 8K

    In case 1, there are two cases: page-in operation and page-out operation. In case of page-in operation, server side SSKM layer delivers a 4K page along with some control information (SSKM header) to the HSN layer. The SSKM layer should make sure that the information is delivered as instead of . This is to make sure that when the message arrives at the client side, the page is received at 4K page boundary and no extra copy operation is needed. The HSN layer adds its own control information to the end of the message and transfers it to the client as . At the client side, HSN layer allocates 4K size buffers. Hence, the page is received in one buffer and the rest of headers is received in another buffer.

    For page-out operation in case 1, client side SSKM layer delivers a 4K page along with some control information to the HSN layer. This should be of the format . Again the HSN layer adds its header to this message and transfers < page, CSKM header, HSN header> to the server side. Server side HSN layer allocates 8K size buffers to receive the packets. Now we cannot avoid copy all together. It is expected that server side cache will keep two 4K client pages in one 8K server page to save space. In that case, if the server cache decides to put this 4K page in first half of a 8K page, there will be no copy. However, if it decides to put the page in the second half of the 8K page, there will be a copy operation. Hence, in this case there will be around 50% copy.

    In case 2 again, we have page-in and page-out operation. For page-in operation, server side SSKM layer deliver a 8K page along with the control information to the HSN layer in <8K page, SSKM header> format. Since HSN layer cannot transfer it as a single packet, it will fragment the message into two packets and . Notice that there is no guarantee that the two packets will arrive as consecutive packets to the client. Hence HSN layer needs to put some extra information in the HSN header, so that it can combine the two packets at the client side. Following informations are sufficient to handle this:
    1: Host ID
    2: Unique host sequence number
    3: fragment number (0/1)
    4: Last fragment?
    The last field is necessary to distinguish a message with single packet from a message with two packets.
    when the message is received at the client side, client has allocated 8K size buffers for receiving messages. Hence, the two packets will be stored in two 8K buffers. Notice that we now need to perform a copy operation to copy the second half of the 8K page into the free second half of the first 8K buffer (we can get rid of the HSN header of the first packet). This makes the fragmentation operation totally transparent to the CSKM/SSKM layer. One problem with this approach though is the fact that in naive mode i.e. using buffer cache, CSKM needs to perform copy of the whole 8K page anyway from the HSN buffer to the kernel buffer cache. Hence, for the second half of the 8K page, this will involve multiple copy operation).

    The page-out operation of case 2, is very similar to page-in operation of case 2. Client side CSKM layer delivers <8K page, CSKM header> which is fragmented into two packets <4K page, HSN header> <4K page, CSKM header, HSN header>. At the server side, the packets are received in two 8K size receive buffers. The second half of the 8K page is again copied into the second half of the first 8K buffer, and is given to the SSKM layer. In this case, again there will be 50% copy for every page.

    Buffer Management

    The HSN layer should keep enough number of buffers ready for the network interface to receive the packets. The HSN layer uses the native page size of the machine for these buffers. The HSN layer can take the responsibility of allocating these buffers (Was there a problem with that?) In client side, when a page is received, CSKM has the responsibility of freeing up that page if necessary, and returning it to the system pool of buffers. On the server side, similarly, the HSN layer allocated the buffers, receives the page in that buffer and gives it to SSKM. SSKM puts in directly in the cache. It is again the responsibility of the server to free up a cache page, and return it to the system pool of memory when it is no longer needed.

    Multicast

    For page-out operation with replication degree greater than 1, the client needs to multicast the page to servers. Ideally, the HSN layer should support multicast operation. However, the multicast groups in this case are changing dynamically, and hence, to support multicast, it would be necessary to setup the multicast group before any multicast operation. This would be quite time consuming. Hence, the CSKM layer should should send a multicast message as multiple unicast message to the HSN layer.

    HSN Operation

    Initialization

    Refill the receive queue with enough buffers.
    Clears all the buffers in the send queue to FREE status.

    Interrupt handler

    When HSN layer receives an interrupt, it first checks if there is any pending  packet in the interface. If there are multiple of them, interrupt handler processes only first few so that it does not create a bottleneck. For each pending packet it calls the lower level routine receive_packet()
    Before returning from the interrupt handler, the HSN checks if there is any pending message in pending message queue. If there is a message, it calls the routine output_pkt()

    Sending a Datagram

    When the CSKM/SSKM layer gives a datagram to the HSN layer, the HSN first puts the datagram in the pending message queue, and calls the lower level routine output_pkt()

    Internal Operation()

    receive_packet()

    receive_packet() first checks the fragment number and last fragment field of the packet. If this is the last fragment of a packet, and the fragment number is 0, it implies that the message contains only one packet. In that case, HSN layer gets rid of the HSN header, and delivers the packet to the upper layer (CSKM/SSKM) thorough the interrupt function provided by that layer.
    If this packets is part of a bigger message, HSN first checks the fragment queue to figure out if the other part of the message has been received. If yes, HSN layer combines the two packets into a single message by getting rid of the first HSN header, and copying the second half of 8K page to the second half of the first buffer, and by copying the CSKM/SSKM header to the beginning of the second buffer. It then delivers these two buffers as a single packet to the upper layer. If the other part of the message is not present, HSN layer enqueues this packet in the fragment queue in the hope that the other half will be received soon. Notice that HSN layer needs to set a timer with every packet in the fragment queue, so that if the other part gets lost, HSN can delete the received half as well after sometime.

    output_packet()

    This function performs the low level operations to transfer the packet to myrinet interface. It first checks if a device is busy because another output_packet() function is going on. If the device is not busy, it first dequeues the message. It creates two gather buffers, one containing the nms_data and other containing the headers. The HSN puts the data first and the header last so as two avoid costly copy operations when the data is 4K page. Putting the data first is a way to make sure that the data is always page aligned.
    Finally, the HSN puts this message in the send queue and marks the status as BUSY. It also goes through all the buffers in the send queue to detect the ones that have completed the DMA operation. The status of these buffers are set to FREE, and the function provided by the CSKM/SSKM module is called to indicate the completion of the send. The CSKM/SSKM module can then free the data corresponding to that message.
    After sending a packet, output_packet() checks if there is any message in the pending message queue. If there is a message, it tries to send that message.


    Note: This HSN driver design closely follows the design of the driver provided with the myrinet package.



    Last modified: Friday May 7 12:50:29 EDT 1999