cudaMemcpyAync with pageable memory overlap with kernal

I just started using cuda, and I learned that cudamemcpyaync needs pinned memory to overlap with kernal, but I found that when my program only has H2D and kernal, even if it uses pageable memory, it overlaps with kernal. Wondering if I misunderstood something.

The code is like this

  auto async_func1 = [&](float *a)
  {
    memset(a, 0, bytes);
    for (int i = 0; i < nStreams; ++i)
    {
        int offset = i * streamSize;
        cudaMemcpyAsync(&d_a[offset], &a[offset], streamBytes,
                        cudaMemcpyHostToDevice, stream[i]);
        kernel<<<streamSize / blockSize, blockSize, 0, stream[i]>>>(d_a, offset);

    }
    for (int i = 0; i < nStreams; ++i)
    {
      cudaStreamSynchronize(stream[i]);
    }
  };

if the transfer size (streamBytes) is small enough, the pageable transfer can proceed roughly as if the memory were pinned.

More specifically, with pageable memory, the first step in the process is to copy the data to a runtime-maintained pinned buffer. Then a DMA transfer to device memory is initiated. Once in the runtime-maintained pinned buffer, the transfer proceeds as if it were pinned:

The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.

As a programmer, this behavior is difficult to rely on. Therefore the programming advice is to use a pinned buffer, when you want a copy operation to overlap with a compute operation.

1 Like

Thank you very much for your help

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.