Forget the benchmarks, look at real performance. Two other papers by Gilad, Single Points of Performance and Optimum Connectivity in the Multi-core Environment, expand on this idea.
Scientists, engineers and analysts in virtually every field are turning to high performance computing to solve todayâs vital and complex problems. Simulations are increasingly replacing expensive physical testing, as more complex environments can be modeled and in some cases, fully simulated.
High-performance computing encompasses advanced computation over parallel processing, enabling faster execution of highly compute intensive tasks such as climate research, molecular modeling, physical simulations, cryptanalysis, geophysical modeling, automotive and aerospace design, financial modeling, data mining and more. HPC clusters become the most common building blocks for high-performance computing, not only because they are affordable, but because they provide the needed flexibility and deliver superior price/performance compared to proprietary symmetric multiprocessing (SMP) systems, with the simplicity and value of industry standard computing.
 Real-world application performance depends on the performance of the various clusterâs key elements â the processor, the memory, and the interconnect. The interconnect controls the data transfer between servers, and has a high influence on the CPU efficiency and memory utilization.

Transport offload interconnect architectures, unlike the âon-loadingâ ones, eliminate the need of dealing with the protocol processing within the CPU and therefore increasing the number of cycles dedicated toward computational tasks. If the CPU is busy moving data and handling network protocol processing, it is unable to perform computational work, and the overall productivity of the system is severely degraded.
The memory copy overhead includes the resources required to copy data buffers from the network device to the kernel memory and then from the kernel memory to the application memory. This approach requires multiple memory accesses before the data is placed in its final destination. While it is not a major problem for small data transfers, it is a big problem for larger data transfers. This is where the interconnect zero-copy capabilities eliminates the memory bandwidth bottleneck without involving the CPU in the network data transfer.

The interconnect bandwidth and latency have traditionally been used as two metrics for assessing the performance of the systemâs interconnect fabric. However, these two metrics are typically not sufficient to determine the performance of real world applications. Typical real-world applications send messages ranging from 64 Byte to 4 Megabyte using not only point-to-point communication but a diverse mixture of communication patterns, including collective and reduction patterns in the case of MPI. In some cases, interconnect vendors create artificial benchmarks, such as message rate, and apply bombastic marketing slogans to these benchmarks â such as âHypermessagingâ. Message rate is yet another single point in the point-to-point bandwidth graph. If the traditional interconnect bandwidth indicates the maximum available bandwidth (single point), message rate indicates the bandwidth for message size of zero or 2 bytes.
The single points of data, give some indication for the interconnect performance, but are far from describing the real world application performance. The interactive combination of those points, together with others (CPU overhead, zero copy etc.), will determine the overall ability of the connectivity solution.
The difference between theoretical power and what is actually delivered is measured as processor efficiency. The more CPU cycles used to get the data out the door by âfilling the wireâ due to protocol and data transfer inefficiencies, the less cycles are available for the application. When comparing latencies of different interconnects, one needs to pay attention to the interconnect architecture. 1usec latency âon-loadingâ interconnect versus 2usec latency âoff-loadâ solution is similar to a case when one needs to decide between two cars that show the same horsepower (i.e. CPU). Both engines are capable of 200 miles per hour, but the first car, due to âon-loadingâ, limits the actual engine power to 75 miles per hour (the engine power must be used for other tasks). The Second car has no limitations on the engine, but its wheels can tolerate only 150 miles per hour. The knowledge on the wheels tolerance (i.e. latency), as a single point of data, is definitely misleading.
There are attempts to provide real world application performance while comparing different interconnects, but in most cases the âcomparisonâ is biased and by using different systems and/or conditions, which makes a true comparison difficult. There have been recent cases comparing 10-Gigabit Ethernet to InfiniBand. While InfiniBand adapters were tested with PCIe x4 (that is limited to ~700MByte/sec bandwidth (due to limitations in the current available systems), the 10 Gigabit Ethernet cards were PCI-X, that is capable to higher bandwidth (~850-900MByte/s). Other cases compare InfiniBand PCIe x4 to other interconnects with PCIe x8 host interface (the only valid conclusion one can make is that PCIe x8 has more lanes than PCIe x4). Another paper compared QLogic InfiniPath on Intel 3GHz CPU based system to Mellanox InfiniBand on 2.2GHz Opteron based system. Any attempt to compare different interconnects in those manners is deceptive.
 
  
