Zero-copy from host to device decreases cudaMemcpyAsync device to host performance

cheinger · January 27, 2020, 3:26am

Hi all,

I have run into some performance problems when running a zero-copy kernel that transfers data from the host to the device concurrently with DMA copies (cudaMemcpyAsync) from device to host. More specifically, if I run cudaMemcpyAsync by itself or even concurrently with another cudaMemcpyAsync that transfers data in the opposite direction, each transfer shows ~12GB/s in the visual profiler. However, if I replace the cudaMemcpyAsync which transfers from host to device with a zero-copy kernel, then the cudaMemcpyAsync device to host drops to ~8-9GB/s.

Here is the example code:

__global__ void zeroCopyHostToDevice(unsigned int* d_odata, const unsigned int* __restrict__ h_idata, unsigned int memSize32)
{
    const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= memSize32)
    {
        return;
    }
    d_odata[ idx ] = h_idata[ idx ];
}

...
    const unsigned int nstreams = 4;
    cudaStream_t streams[nstreams];
...
    unsigned int memSize32 = memSize / 4;
    dim3 block(256);
    dim3 grid((memSize32 + block.x - 1) / block.x);

    for (unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
    {
        zeroCopyHostToDevice<<<grid, block, 0, streams[i % nstreams]>>>((unsigned int*)d_idata, (const unsigned int*)h_odata, memSize32);
        checkCudaErrors(cudaMemcpyAsync(h_odata, d_idata, memSize, cudaMemcpyDeviceToHost, streams[i % nstreams]));
    }
...

My assumption is that to perform the zero-copy reads, the GPU issues PCIe requests to the host which is occupying bandwidth in the same direction as the DMA writes to the host. Is this a correct assumption or is there another explanation?

Thanks!

Topic		Replies	Views
cudaMemcpyDeviceToHost speed how to improve speed CUDA Programming and Performance	3	12502	June 13, 2008
cudaMemcpy host->device and device->host speed CUDA Programming and Performance	6	15182	April 29, 2014
cudaMemcpyDeviceToHost taking much time? CUDA Programming and Performance	3	2653	July 15, 2009
Device to Host memcpy How do i make this faster? CUDA Programming and Performance	2	2510	February 6, 2008
cudaMemcpyDeviceToHost 3x slower than cudaMemcpyHostToDevice CUDA Programming and Performance	1	882	January 9, 2019
Why cudaMemcpyDeviceToHost is too slowly? CUDA Programming and Performance	1	605	November 16, 2021
cudaMemcpyAsync decrease the data transfer performance? CUDA Programming and Performance	0	4227	February 1, 2010
Copy back to host lasts much longer than copy to device, why? CUDA Programming and Performance	3	677	December 11, 2013
cudaMemcpy2DAsync a lot slower than cudaMemcpy normally CUDA Programming and Performance	6	113	August 22, 2024
Is there any way to copy data from device to host more efficiently in this case? CUDA Programming and Performance	4	902	December 14, 2018

Zero-copy from host to device decreases cudaMemcpyAsync device to host performance

Related topics