DGX A100, when 8 IB network cards use ib_write_bw to test the bandwidth at the same time, the rate decreases, which is not expected

I have a DGX A100 machine, and each machine has 8 200G single-port IB network cards. When I use the following command to test the bandwidth between the two at the same time, I found that I did not get what I expected. I thought it was Eight network cards can reach the rate of 200Gb/s at the same time, but the actual rate is greatly reduced.
They are connected to a QM8790 at the same time (use 200G AOC line)
Client:run perftest multi devices -d mlx5 0,mlx5 1,mlx5 2 -C 0,1,2 -cmd “ib write bw --report gbits” --remote 172.32.255.21


I think it may be a problem with the way I use the command, so I am eager to get some guidance, or the correct way to test the bandwidth of 16 IB network cards on two DGXs at the same time

please help me

Hi,

Have you tried pinning the HCA devices to the nearest cores?

Please use ‘mst status -v’ to see each device’s local NUMA;

Check the cores assigned to the NUMA using lscpu (or similar);

Utilize the closest core/s for the HCA.

-c, --cores Pin each device to a specific core using taskset

i.e. --cores 0,1 - This will pin dev1 command to core 0 and dev2 command to core 1

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.