Device Level Prefix Sum?

I’ve have a sizeable kernel that does various compression manipulations on image data in almost all of the available shared memory. At one point, I do a segmented prefix sum on some of the data. The kernel then goes on to further manipulations.
Fine, everything works, but I’m all too aware that my home-grown segmented scan is years old, and prone to bank conflicts and inefficiencies. But I’m at a loss as to how to make use of more modern implementations from deep in the kernel. I can’t leave the kernel without incurring the overhead of sending all the data into the global memory first, so CUDPP or Thrust seem not to be options. But nor do I want the pain of pulling apart CUDPP (or going back to theory) and doing my own device level scan calls mid kernel. Am I missing something obvious here?

CUB can do a block level prefix sum, and provides primitives that can be called from within a kernel:

[url]http://nvlabs.github.io/cub/classcub_1_1_block_scan.html[/url]

CUB (CUB: Main Page) is an attempt at solving similar problems where you want to be able to embed a well-known collective operation inside of an existing kernel. It doesn’t include a segmented scan, but it does include optimized scan primitives and maybe you could modify it into a segmented scan. It would also be useful to take a look at the interface to see if it is a good fit for your problem. Presumably it would be possible to extend CUB with an implementation of segmented scan in the future.

CUDA dynamic parallelism could also be an option if your arrays are big enough.

Ah yes I’d forgotten about cub, thanks. But I can’t see any support for segmented scans?

(sorry missed replay above)

Thanks Gregory, most helpful. Frustratingly I can see lots of test segmented scan code in the CUB development branches from a few years back, and even some probably relevant structures in the current release (“SegmentedOp”). I’ll look deeper and see what’s involved.

The CUDPP library has device level scans. It used to be part of the NVIDIA SDK examples, but they dropped it in favor of thrust, which doesn’t have device level functions.

@sedona, the NVLabs Modern GPU library might also be worth looking at. It has a segmented reduction that might give you some ideas for implementing a segmented scan.