The cudaMemcpy speed between two A100Gpus is slow

379319054 · December 22, 2021, 5:07pm

When using an A100 GPU to train a deep learning neural network, the training is normal. But using two A100 Gpus to train at the same time， the Python process freezes.
I use the simpleP2P_PingPong.cu to test the speed of cudaMemcpy and cudaMemcpyAsync, results are shown below:

The result of command :“nvidia-smi topo -m” is shown below:

My server has two Gpus and two cpus. Each CPU is connected to a GPU.
Why the cudaMemcpy performance is only 0.25GB/s between two A100 GPUS?How to solve this problem?
I desperately want you to help me.

Robert_Crovella · December 22, 2021, 5:37pm

I would first encourage you to make sure that your server has the latest firmware installed.

This behavior is a function of the server design as well as the CPUs involved. Having each PCIE GPU connected to separate CPUs is not optimal for this type of work.

If you want the best possible performance:

Put both GPUs in the system so they are attached to the same CPU, preferably in adjacent slots for item 2 below.
Purchase and install the NVLink bridges (a set of 3) for these GPUs.

Whether or not the NVLink bridges can be used (ie. there is mechanical clearance) is a function of your server design. Not all servers that accept A100 GPUs can meet these requirements (adjacent slots, with clearance for NVLink bridges).

Topic		Replies	Views
Can cudaMemcpy() use both pci and nvlink? General Topics and Other SDKs	0	253	February 5, 2024
Multi-GPU Training time is slower than single-GPU CUDA Programming and Performance	0	489	February 2, 2023
cudaMemcpy on 9800GTX2 CUDA Programming and Performance	10	11747	December 18, 2008
Slow memcpy performance in dual-CPU, 10 GPU system CUDA Programming and Performance cuda , nsight , gpu	24	2543	January 18, 2023
NvLink (V100) GPU - Hardware	4	2170	October 12, 2021
cudaMemcpy sometimes very slow CUDA Programming and Performance	1	1051	May 21, 2018
A100 simplemulticopy CUDA Programming and Performance	14	287	August 23, 2024
A100 Nvlink on Dual CPU System CUDA Programming and Performance	1	952	January 28, 2022
Erratic multi-gpu bandwidth CUDA Programming and Performance	8	2785	June 25, 2015
Using NVlink bridge makes big impact on the training speed on 2x RTX 2080 (multi GPU training with p2p) CUDA Programming and Performance	5	4139	January 20, 2023

The cudaMemcpy speed between two A100Gpus is slow

Related topics