How to reduce the overhead from cudaStreamSynchronize?

menesss · June 10, 2021, 2:30am

Hi I am working on distributed training project. On each GPU device, I have a vector of recv_buff used to receive the data from other device via point-to-point communication, whose size is equal to the total number of GPUs. After the data is ready, I need to copy the data to a vector of output, which is allocated by Tensorflow.
Currently, I implement these functions likes this:

CUDA_CALL(cudaMemcpyAsync ( void* output, const void* recv_buff, size_t count, cudaMemcpyKind cudaMemcpyDeviceToDevice, cudaStream_t *copy_stream))
CUDA_CALL(cudaStreamSynchronize ( *stream ) )

However, after profiling, I found that the overhead is super large. If I have 64 GPU processes in total, I have to call cudaMemcpyAsync for 64 times and do the cudaStreamSynchronize for 64 times.

Is there any ways to optimize my implementation?

menesss · June 10, 2021, 2:34am

@Robert_Crovella Hi Robert, do you have any suggestions?

Robert_Crovella · June 10, 2021, 3:27am

use NCCL

perhaps more generally, you want to think about overlapping communication with computation

Topic		Replies	Views
Long overhead with cuStreamSynchronize with OMPI CUDA-GDB pcie	2	1072	August 20, 2021
Long overhead with cuStreamSynchronize with OMPI Profiling Linux Targets nsight , openmpi	13	1688	September 15, 2021
cudaStreamSynchronize(a_stream) simpleStreams CUDA Programming and Performance	2	2452	December 2, 2010
I am curious about why some program become faster when calling cudaStreamSynchronize during program running. CUDA Programming and Performance	3	861	January 21, 2019
Copy-Compute Overlap Performance CUDA Programming and Performance	4	1136	January 19, 2019
stream synchronize problem CUDA Programming and Performance	2	794	August 28, 2017
is there need a streamsynchronize() between kernels and CULA function when use cuda stream? CUDA Programming and Performance	1	481	October 2, 2017
cudaStreamSynchronize CUDA Programming and Performance cuda	4	195	July 17, 2024
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2585	January 18, 2023
cudaDeviceSynchronize 50x slower on TK1 Jetson TK1	2	1053	August 7, 2016

How to reduce the overhead from cudaStreamSynchronize?

Related topics