How to reduce the overhead from cudaStreamSynchronize?

Hi I am working on distributed training project. On each GPU device, I have a vector of recv_buff used to receive the data from other device via point-to-point communication, whose size is equal to the total number of GPUs. After the data is ready, I need to copy the data to a vector of output, which is allocated by Tensorflow.
Currently, I implement these functions likes this:

CUDA_CALL(cudaMemcpyAsync ( void* output, const void* recv_buff, size_t count, cudaMemcpyKind cudaMemcpyDeviceToDevice, cudaStream_t *copy_stream))
CUDA_CALL(cudaStreamSynchronize ( *stream ) )

However, after profiling, I found that the overhead is super large. If I have 64 GPU processes in total, I have to call cudaMemcpyAsync for 64 times and do the cudaStreamSynchronize for 64 times.

Is there any ways to optimize my implementation?

@Robert_Crovella Hi Robert, do you have any suggestions?

use NCCL

perhaps more generally, you want to think about overlapping communication with computation