Hi I am working on distributed training project. On each GPU device, I have a vector of
recv_buff used to receive the data from other device via point-to-point communication, whose size is equal to the total number of GPUs. After the data is ready, I need to copy the data to a vector of output, which is allocated by Tensorflow.
Currently, I implement these functions likes this:
CUDA_CALL(cudaMemcpyAsync ( void* output, const void* recv_buff, size_t count, cudaMemcpyKind cudaMemcpyDeviceToDevice, cudaStream_t *copy_stream)) CUDA_CALL(cudaStreamSynchronize ( *stream ) )
However, after profiling, I found that the overhead is super large. If I have 64 GPU processes in total, I have to call cudaMemcpyAsync for 64 times and do the cudaStreamSynchronize for 64 times.
Is there any ways to optimize my implementation?