Assume I have a large number (dozens to hundreds) of consecutive H2D (Host-to-Device) cudaMemcpyAsync operations, along with concurrent D2H (Device-to-Host) cudaMemcpyAsync operations. Each copy operation involves the same number of bytes, but the amount of data per copy is small, insufficient to fully utilize the memory bandwidth and not enough to hide the launch overhead. Additionally, the source and destination addresses are not sequential, meaning that it’s not possible to combine these copies into a single cudaMemcpyAsync call.
Is there a way to coalesce these copy operations and leverage the DMA copy engine so that compute kernels, D2H, and H2D transfers can overlap efficiently?
Thank you for your reply.
However, I believe that launching an equal number of D2D (Device-to-Device) cudaMemcpyAsync operations would bring the same overhead. The “gather all data in a single buffer” approach you mentioned—does this refer to doing that?
Are these data chunks equally spaced? Multiple same-size data chunks with fixed spacing can be copied in a single cudaMemcpy2DAsync call. The spacing can be different on host and device side.
2D copies can be quite inefficient in their use of physical memory, so I would expect at best incremental performance gains if you go down that route.
If the spacing of the data chunks is irregular, the calls cannot be combined. If this portion is crucial to performance, re-think at a higher conceptual level, e.g. algorithmic level.
Thank you for your response. In my scenario, the source and destination pointers of these data blocks are not equally spaced; they are irregular and change dynamically during runtime. I will reconsider my design based on your suggestions. Thank you again for your help.
With combined buffers, you need only a single cudaMemcpy to transfer all data, and a kernel to copy data from the combined buffer to the individual pointers and vice-versa.
cuda::std::array<int*, 100> h_ptr_array; //assume 1 int per buffer
cuda::std::array<int*, 100> d_ptr_array;
//copy all h_ptrs to d_ptrs
std::vector<int> h_contiguous(100);
for(int i = 0; i < 100; i++) h_contiguous[i] = h_ptr_array[i][0];
int* d_contiguous; cudaMalloc(&d_contiguous, sizeof(int) * 100);
cudaMemcpy(d_contiguous, h_contiguous, sizeof(int) * 100, host to device)
copy_kernel<<<...>>>(d_ptr_array, d_contiguous); // copy data from the combined buffer to the individual buffers
If all host pointers point to pinned host memory, you don’t even need a memcpy and use a kernel to directly copy between h_ptr_array and d_ptr_array.
Thank you. This is a good approach, especially if copy_kernel<<<...>>>(d_ptr_array, d_contiguous) can be implemented efficiently. I will try it out later. Thank you again for your help.