The cudaMemcpy speed between two A100Gpus is slow

When using an A100 GPU to train a deep learning neural network, the training is normal. But using two A100 Gpus to train at the same time, the Python process freezes.
I use the to test the speed of cudaMemcpy and cudaMemcpyAsync, results are shown below:

The result of command :“nvidia-smi topo -m” is shown below:

My server has two Gpus and two cpus. Each CPU is connected to a GPU.
Why the cudaMemcpy performance is only 0.25GB/s between two A100 GPUS?How to solve this problem?
I desperately want you to help me.

I would first encourage you to make sure that your server has the latest firmware installed.

This behavior is a function of the server design as well as the CPUs involved. Having each PCIE GPU connected to separate CPUs is not optimal for this type of work.

If you want the best possible performance:

  1. Put both GPUs in the system so they are attached to the same CPU, preferably in adjacent slots for item 2 below.
  2. Purchase and install the NVLink bridges (a set of 3) for these GPUs.

Whether or not the NVLink bridges can be used (ie. there is mechanical clearance) is a function of your server design. Not all servers that accept A100 GPUs can meet these requirements (adjacent slots, with clearance for NVLink bridges).