Computing Prefix Sum/Scan on different arrays (with CUB) in parallel

Olumide · August 17, 2017, 11:44pm

The following function sequentially performs a prefix sum or scan on dim arrays of integers.

// dataPtr points the start of an array or arrays of integers for which the prefix sum is to be computed
// Example: [x1, x2 ... xN, y1, y2 ... yN, z1, z2 ... zN]
// 
void func( int* radixPtr , int* dataPtr , short size , short dim )
{
    void* tmpStorage = 0;
    size_t offset = 0 , tmpStorageSize = 0;

    cub::DeviceScan::ExclusiveSum( tmpStorage , tmpStorageSize , dataPtr , radixPtr , size );
    allocateDeviceMemory( &tmpStorage , tmpStorageSize );  // wrapper for cudaMalloc()

    // Sequential. TODO: parallelize
    for( ushort i = 0; i < dim; ++i )
    {
        cub::DeviceScan::ExclusiveSum( tmpStorage , tmpStorageSize , dataPtr + offset , radixPtr + offset , size );
        offset += size;
    }
}

As each array is independent it ought to be possible to compute the prefix sums in parallel instead of in series. How would I go about doing this? I know streams are an option. What I’d like to know is if there are others.

MutantJohn · August 18, 2017, 12:12am

Kernels are executed serially per stream. The only way to have two kernels executing simultaneously is via different streams. I’m not even sure GPUs are capable of this though. As of now, I’m only aware that you can execute a kernel and copy data to and from the device at the same time.

MutantJohn · August 18, 2017, 12:13am

Kernels are executed serially per stream. The only way to have two kernels executing simultaneously is via different streams. I’m not even sure GPUs are capable of this though. As of now, I’m only aware that you can execute a kernel and copy data to and from the device at the same time.

Robert_Crovella · August 18, 2017, 2:53am

For what purpose? what benefit are you expecting from that?

If size is large, then the device scan should occupy the device, and there is likely little or no benefit to exposing more parallelism.

If size is small then cub provides primitives that could work at the block level or even at the warp level, that could be used to parallelize.

Olumide · August 18, 2017, 9:15am

Speed.

The scans computations are independent and have nothing to do with each other, apart from the input arrays being contiguous. Assuming the problem isn’t too large or its running on a high-end GPU, is there be some way to dispatch these ‘tasks’ and have CUDA execute as many as possible in parallel or at worst serially? Is this something that can be done with CUB or do I have to use streams. I am not objecting to the use of streams. I just want to be sure that I’m reaching for the simplest possible solution.

BTW, the all the data is already on GPU memory.

BulatZiganshin · August 18, 2017, 1:48pm

it’s easy - run <<>> instances of the kernel instead of sequential “for” loop

Olumide · August 21, 2017, 3:05pm

Yeah, but won’t I have to specify a different stream when I do that?

Topic		Replies	Views
Device Level Prefix Sum? CUDA Programming and Performance	7	2080	March 11, 2014
Compute Cumulative Frequency CUDA Programming and Performance	5	5047	July 13, 2009
How to put specific elements from one array to another array use CUDA? CUDA Programming and Performance cuda	6	1436	October 30, 2022
Kernel call costs? CUDA Programming and Performance	6	3554	July 6, 2009
How to implement calculation pipeline via CUDA streams ? CUDA Programming and Performance	3	6541	January 17, 2013
Parallel prefix-scan with multiple blocks of gpu core CUDA Programming and Performance	6	15877	October 6, 2011
Is there a block equivalent to cub::DeviceSegmentedReduce CUDA Programming and Performance	7	1230	October 10, 2023
Is it possible to execute kernels in parallel CUDA Programming and Performance	9	4573	February 6, 2009
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1462	September 14, 2017
parallel scan CUDA Programming and Performance	8	3433	August 11, 2009

Computing Prefix Sum/Scan on different arrays (with CUB) in parallel

Related topics