The following function sequentially performs a prefix sum or scan on dim arrays of integers.
// dataPtr points the start of an array or arrays of integers for which the prefix sum is to be computed
// Example: [x1, x2 ... xN, y1, y2 ... yN, z1, z2 ... zN]
//
void func( int* radixPtr , int* dataPtr , short size , short dim )
{
void* tmpStorage = 0;
size_t offset = 0 , tmpStorageSize = 0;
cub::DeviceScan::ExclusiveSum( tmpStorage , tmpStorageSize , dataPtr , radixPtr , size );
allocateDeviceMemory( &tmpStorage , tmpStorageSize ); // wrapper for cudaMalloc()
// Sequential. TODO: parallelize
for( ushort i = 0; i < dim; ++i )
{
cub::DeviceScan::ExclusiveSum( tmpStorage , tmpStorageSize , dataPtr + offset , radixPtr + offset , size );
offset += size;
}
}
As each array is independent it ought to be possible to compute the prefix sums in parallel instead of in series. How would I go about doing this? I know streams are an option. What I’d like to know is if there are others.
Kernels are executed serially per stream. The only way to have two kernels executing simultaneously is via different streams. I’m not even sure GPUs are capable of this though. As of now, I’m only aware that you can execute a kernel and copy data to and from the device at the same time.
Kernels are executed serially per stream. The only way to have two kernels executing simultaneously is via different streams. I’m not even sure GPUs are capable of this though. As of now, I’m only aware that you can execute a kernel and copy data to and from the device at the same time.
The scans computations are independent and have nothing to do with each other, apart from the input arrays being contiguous. Assuming the problem isn’t too large or its running on a high-end GPU, is there be some way to dispatch these ‘tasks’ and have CUDA execute as many as possible in parallel or at worst serially? Is this something that can be done with CUB or do I have to use streams. I am not objecting to the use of streams. I just want to be sure that I’m reaching for the simplest possible solution.