Ethernet v.s. Infiniband

I am trying to understand the performance difference between Ethernet and Infiniband.

I am testing a program which requires lots of communications between GPUs on different nodes. I tested it on two clusters. One of them used Infiniband as the communication backbone while the other one used Ethernet. My program ran much slower on the one with Ethernet. Will upgrading the Ethernet to 10GbE help? Or is the communications more sensitive to the low latency property of Infiniband?

Where two-way communication is required (i.e. pretty much all practical use cases in compute applications), high latency will have a negative impact on effective throughput. This is a reason why low-latency interconnects, and Infiniband in particular, dominate at the high end the supercomputer space.

Any real life case will likely also involve cost (or cost effectiveness) as a decision motivator, so as a first step you might want to profile your application on the two clusters you mentioned, paying particular attention to communication patterns. There may be various node characteristics that significantly impact the relative performance of the two clusters (e.g. amount of system memory per node, CPU/GPU balance in a node), so make sure you control for such effects as tightly as possible instead of chalking up performance differences simply to the characteristics of the interconnect.

This is outside my area of expertise, but the following (slightly dated) comparison of 10GbE and IB by the HPC Advisory Council shows how the impact of lower latency IB can differ substantially by application and that is why it is important to understand the characteristics of your app(s):

As far as I know, the HPC Advisory Council is an organization that counts pretty much all major equipment suppliers in the HPC space (including NVIDIA) among their members, so this likely provides a fair comparison.