Computing Prefix Sum/Scan on different arrays (with CUB) in parallel

The following function sequentially performs a prefix sum or scan on dim arrays of integers.

// dataPtr points the start of an array or arrays of integers for which the prefix sum is to be computed
// Example: [x1, x2 ... xN, y1, y2 ... yN, z1, z2 ... zN]
// 
void func( int* radixPtr , int* dataPtr , short size , short dim )
{
    void* tmpStorage = 0;
    size_t offset = 0 , tmpStorageSize = 0;

    cub::DeviceScan::ExclusiveSum( tmpStorage , tmpStorageSize , dataPtr , radixPtr , size );
    allocateDeviceMemory( &tmpStorage , tmpStorageSize );  // wrapper for cudaMalloc()

    // Sequential. TODO: parallelize
    for( ushort i = 0; i < dim; ++i )
    {
        cub::DeviceScan::ExclusiveSum( tmpStorage , tmpStorageSize , dataPtr + offset , radixPtr + offset , size );
        offset += size;
    }
}

As each array is independent it ought to be possible to compute the prefix sums in parallel instead of in series. How would I go about doing this? I know streams are an option. What I’d like to know is if there are others.

Kernels are executed serially per stream. The only way to have two kernels executing simultaneously is via different streams. I’m not even sure GPUs are capable of this though. As of now, I’m only aware that you can execute a kernel and copy data to and from the device at the same time.

Kernels are executed serially per stream. The only way to have two kernels executing simultaneously is via different streams. I’m not even sure GPUs are capable of this though. As of now, I’m only aware that you can execute a kernel and copy data to and from the device at the same time.

For what purpose? what benefit are you expecting from that?

If size is large, then the device scan should occupy the device, and there is likely little or no benefit to exposing more parallelism.

If size is small then cub provides primitives that could work at the block level or even at the warp level, that could be used to parallelize.

Speed.

The scans computations are independent and have nothing to do with each other, apart from the input arrays being contiguous. Assuming the problem isn’t too large or its running on a high-end GPU, is there be some way to dispatch these ‘tasks’ and have CUDA execute as many as possible in parallel or at worst serially? Is this something that can be done with CUB or do I have to use streams. I am not objecting to the use of streams. I just want to be sure that I’m reaching for the simplest possible solution.

BTW, the all the data is already on GPU memory.

it’s easy - run <<>> instances of the kernel instead of sequential “for” loop

Yeah, but won’t I have to specify a different stream when I do that?