Zero-copy from host to device decreases cudaMemcpyAsync device to host performance

Hi all,

I have run into some performance problems when running a zero-copy kernel that transfers data from the host to the device concurrently with DMA copies (cudaMemcpyAsync) from device to host. More specifically, if I run cudaMemcpyAsync by itself or even concurrently with another cudaMemcpyAsync that transfers data in the opposite direction, each transfer shows ~12GB/s in the visual profiler. However, if I replace the cudaMemcpyAsync which transfers from host to device with a zero-copy kernel, then the cudaMemcpyAsync device to host drops to ~8-9GB/s.

Here is the example code:

__global__ void zeroCopyHostToDevice(unsigned int* d_odata, const unsigned int* __restrict__ h_idata, unsigned int memSize32)
{
    const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= memSize32)
    {
        return;
    }
    d_odata[ idx ] = h_idata[ idx ];
}

...
    const unsigned int nstreams = 4;
    cudaStream_t streams[nstreams];
...
    unsigned int memSize32 = memSize / 4;
    dim3 block(256);
    dim3 grid((memSize32 + block.x - 1) / block.x);

    for (unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
    {
        zeroCopyHostToDevice<<<grid, block, 0, streams[i % nstreams]>>>((unsigned int*)d_idata, (const unsigned int*)h_odata, memSize32);
        checkCudaErrors(cudaMemcpyAsync(h_odata, d_idata, memSize, cudaMemcpyDeviceToHost, streams[i % nstreams]));
    }
...

My assumption is that to perform the zero-copy reads, the GPU issues PCIe requests to the host which is occupying bandwidth in the same direction as the DMA writes to the host. Is this a correct assumption or is there another explanation?

Thanks!