ConnectX-7 NIC in DGX Spark

So far I’ve only been able to get just over 18 GB/s across both 200Gb links with NVCC all_gather_perf, with four rail RoCE v2. I’m not sure why you would use LACP or XOR bonding here, since I think getting the lowest latency with RoCE v2 is really important, and using RDMA multirail should provide superior performance to bonding the links.

To me it’s clear (whether you’re getting 18 or 26 GB/s etc) that the ConnectX-7 200Gb NICs have a bottleneck in the SOC PCIe design (x4 maximum width, per link). The second (P2) pair of logical links I think are a hack to try get more bandwidth, but they seem to run much slower (when using multi-rail), and are in the #2 bus, which I understand is at the GPU end of the C2C link, rather than the CPU end, so the added latency of transversing the bus just destroys performance (e.g. NVCC refuses to use them in ring algo, unless I get really hacky, in my tests).

Something interesting is that RDMA latency by default is around 1-1.5ms between two Sparks, which is really horrible. This seems to be related to the power management C-states that are set to be really sleepy by default on these units.

sudo cpupower idle-set -D 0
sudo cpupower frequency-set -g performance

Disabling this brings ping times between the Sparks down to around .05 ms, which is substantial.

Before and after:

ceradmin@sean1:~$ ping -I enp1s0f0np0 192.168.102.2 -c 3
PING 192.168.102.2 (192.168.102.2) from 192.168.102.1 enP2p1s0f1np0: 56(84) bytes of data.
64 bytes from 192.168.102.2: icmp_seq=1 ttl=64 time=1.34 ms
64 bytes from 192.168.102.2: icmp_seq=2 ttl=64 time=0.915 ms
64 bytes from 192.168.102.2: icmp_seq=3 ttl=64 time=0.984 ms

ceradmin@sean1:~$ ping -I enp1s0f0np0 192.168.102.2 -c 4
PING 192.168.102.2 (192.168.102.2) from 192.168.102.1 enP2p1s0f0np0: 56(84) bytes of data.
64 bytes from 192.168.102.2: icmp_seq=1 ttl=64 time=0.108 ms
64 bytes from 192.168.102.2: icmp_seq=2 ttl=64 time=0.028 ms
64 bytes from 192.168.102.2: icmp_seq=3 ttl=64 time=0.026 ms
64 bytes from 192.168.102.2: icmp_seq=4 ttl=64 time=0.022 ms

I’m very much still learning here (as I assume we all are), so I may have made some incorrect assumptions. I’m following this thread closely.

3 Likes