MCX516A-CCAT low performance on Supermicro X10DRH-CT


I’m running roughly 8 machines through a single Mellanox Onyx MSN2100, some machine are X11DPH-T machines with dual Xeon Gold 6144 and a few are X10DRH-CT with Xeon E5-2640 v3.

I’m running debian 10 on all of them, using upstream packages for rdma/roce and drivers. All devices have the same settings through the switch, and all are running with the same configurations on each NIC. I’ve tested each fibre terminator and cable to ensure that the issue doesn’t originate from them, and they can all pull near-line rate (100G) if they are on the X11 systems.

Now, the rub. Whenever I try a performance test a performance test (iperf or ib_send_bw) I’m getting slow speeds if any of the tested servers is a X10 board.

ib_send_bw gets a VERY consistent 27 Gb/sec when running on at least one X10 machine.

On the X11 machines with identical settings, it’s getting 97 Gb/s.

Similarly, iperf gets around 15 Gb/sec on a single worker, and tops out in aggregate speeds for multiple workers at around 30 Gb/s.

I’m fairly perplexed by the massive performance difference between the systems, with the X10 series pulling along only 30% of the X11’s numbers. I’ve turned off hyperthreading, made sure the IRQ affinity was spread across the NUMA node the card is on, and made sure the CPU has all the clocks it needs, but still cant break past the 27 Gb/sec. I even profiled ib_send_bw on perf to try and see if what stalling in a particular routine, but the only information I gained from it was that the high-performing systems spent a LOT more time in pthread_spin_lock and unlock compared to the slower systems (108973 samples v. 5116 samples).

What else can I do to troubleshoot this performance loss and hopefully figure out how to get the performance I’d expect out of the X10 systems? I’ve about hit the end of my expertise/supply of readily available support documentation on the web.

Thanks for your time!