Sometimes a kernel shrink the data count inside a data-array. In this case we will calculate an array with counts in the device side too. After that we want to copy the data-array to the host side.
The problem is:
- I have to copy all data (without shrinking) from the array,
- or I have to copy the counts from the device to the host first, then synchronize, then can copy the data-array, then synchronize again to use that on the host side.
This is not effective if the original data-array is very big and after the processing the data will be shrunk to a little size. These cause unnecessary copy (more delay) or lots of synchronization (delay too).
We know the data will be ready after when the kernel will be finished, so if we have a copy which can works with the counts-data which is in the gpu (we can calculate the size to this array), then we can overlap more codes.
For example:
after the kernel run:
the int data-array-GPU[39]: //the shrank data (output)
1 3 5 7 9 - - - -
1 5 9 - - - - - -
2 6 7 9 - - - - -
the count-array: { 5, 3, 4 }
the size-array: { 5 * sizeof(int), 3 sizeof(int), 4 * sizeof(int) } (so: { 20, 12, 16 } )
kernel<3, 32, streamX>(data-array, count-array, size-array, stride(9) );
for(int i = 0; i < 3; ++i)
{
newMemcpyAsyncDevToHost( data-array-CPU + i * stride,
data-array-GPU + i * stride * sizeof(int),
size-array + i * sizeof(size_t), //size-array is a ptr in the device to the value, which will be known just after the kernel run.
streamX); //streamX: cuda stream, the same as at the kernel
}
… (overlapped codes)
cudasync();
//use the data on host side:
…