Device Level Prefix Sum?

sedona · March 10, 2014, 12:53am

I’ve have a sizeable kernel that does various compression manipulations on image data in almost all of the available shared memory. At one point, I do a segmented prefix sum on some of the data. The kernel then goes on to further manipulations.
Fine, everything works, but I’m all too aware that my home-grown segmented scan is years old, and prone to bank conflicts and inefficiencies. But I’m at a loss as to how to make use of more modern implementations from deep in the kernel. I can’t leave the kernel without incurring the overhead of sending all the data into the global memory first, so CUDPP or Thrust seem not to be options. But nor do I want the pain of pulling apart CUDPP (or going back to theory) and doing my own device level scan calls mid kernel. Am I missing something obvious here?

Robert_Crovella · March 10, 2014, 12:59am

CUB can do a block level prefix sum, and provides primitives that can be called from within a kernel:

[url]http://nvlabs.github.io/cub/classcub_1_1_block_scan.html[/url]

Gregory_Diamos · March 10, 2014, 1:05am

CUB (CUB: Main Page) is an attempt at solving similar problems where you want to be able to embed a well-known collective operation inside of an existing kernel. It doesn’t include a segmented scan, but it does include optimized scan primitives and maybe you could modify it into a segmented scan. It would also be useful to take a look at the interface to see if it is a good fit for your problem. Presumably it would be possible to extend CUB with an implementation of segmented scan in the future.

CUDA dynamic parallelism could also be an option if your arrays are big enough.

sedona · March 10, 2014, 3:15am

Ah yes I’d forgotten about cub, thanks. But I can’t see any support for segmented scans?

sedona · March 10, 2014, 3:16am

(sorry missed replay above)

sedona · March 10, 2014, 3:20am

Thanks Gregory, most helpful. Frustratingly I can see lots of test segmented scan code in the CUB development branches from a few years back, and even some probably relevant structures in the current release (“SegmentedOp”). I’ll look deeper and see what’s involved.

Uncle_Joe · March 11, 2014, 1:32am

The CUDPP library has device level scans. It used to be part of the NVIDIA SDK examples, but they dropped it in favor of thrust, which doesn’t have device level functions.

allanmac · March 11, 2014, 1:34am

@sedona, the NVLabs Modern GPU library might also be worth looking at. It has a segmented reduction that might give you some ideas for implementing a segmented scan.

Topic		Replies	Views
Computing Prefix Sum/Scan on different arrays (with CUB) in parallel CUDA Programming and Performance	6	1867	August 21, 2017
Cuda Prefix Scan CUDA Programming and Performance	2	1381	March 28, 2017
Is there a block equivalent to cub::DeviceSegmentedReduce CUDA Programming and Performance	7	1230	October 10, 2023
high performance prefix sum / scan function in CUDA, looking for thrust, cuDPP library alterative CUDA Programming and Performance	3	2987	September 2, 2013
How to put specific elements from one array to another array use CUDA? CUDA Programming and Performance cuda	6	1436	October 30, 2022
Parallel prefix-scan with multiple blocks of gpu core CUDA Programming and Performance	6	15877	October 6, 2011
Compute Cumulative Frequency CUDA Programming and Performance	5	5047	July 13, 2009
compute segmented sum using CUDA CUDA Programming and Performance	8	2443	January 3, 2018
Is CUDA suit for gradual calculating? CUDA Programming and Performance	3	804	April 1, 2015
prefix_sum, can not syncthreads CUDA Programming and Performance	1	430	February 22, 2017

Device Level Prefix Sum?

Related topics