Idea: a new memcpy from device to host for gain performance

Sometimes a kernel shrink the data count inside a data-array. In this case we will calculate an array with counts in the device side too. After that we want to copy the data-array to the host side.
The problem is:

  • I have to copy all data (without shrinking) from the array,
  • or I have to copy the counts from the device to the host first, then synchronize, then can copy the data-array, then synchronize again to use that on the host side.
    This is not effective if the original data-array is very big and after the processing the data will be shrunk to a little size. These cause unnecessary copy (more delay) or lots of synchronization (delay too).
    We know the data will be ready after when the kernel will be finished, so if we have a copy which can works with the counts-data which is in the gpu (we can calculate the size to this array), then we can overlap more codes.

For example:

after the kernel run:
the int data-array-GPU[39]: //the shrank data (output)
1 3 5 7 9 - - - -
1 5 9 - - - - - -
2 6 7 9 - - - - -
the count-array: { 5, 3, 4 }
the size-array: { 5 * sizeof(int), 3
sizeof(int), 4 * sizeof(int) } (so: { 20, 12, 16 } )

kernel<3, 32, streamX>(data-array, count-array, size-array, stride(9) );
for(int i = 0; i < 3; ++i)
newMemcpyAsyncDevToHost( data-array-CPU + i * stride,
data-array-GPU + i * stride * sizeof(int),
size-array + i * sizeof(size_t), //size-array is a ptr in the device to the value, which will be known just after the kernel run.
streamX); //streamX: cuda stream, the same as at the kernel
… (overlapped codes)
//use the data on host side:

If you are willing to map the memory, you can drive the copy from the GPU side (have kernel code store into host memory).
Since cudaMemcpyAsync() already requires page-locked memory, the additional overhead of also mapping the memory is reduced compared to a generic cudaMemcpy()…

I had a similar problem. Most of the data was resident on the GPU. Throughput was limited by output bandwidth, clocking in at ~ 15G/sec, near the PCI bus limit.

I solved this by having the CPU-side make an informed guess at the actual output size.

If the guess was too small, it did a synchronous copy to get the rest of the data. Obviously, this is expensive, so the guess was strongly biased towards being too large. This was ok for my particular problem because there was a huge margin - multiple orders of magnitude - between the full and required output sizes. Guessing too large still reduced the output bandwidth by 99%. I also had some empirical data to guide the size heuristic.

If you don’t have that kind of margin, or have nothing to inform the size heuristic, then this method probably won’t work for you.

managed memory

let cpu page faults drive the return data copy.