high performance prefix sum / scan function in CUDA, looking for thrust, cuDPP library alterative

rkarolew · September 1, 2013, 6:38pm

I’m looking for high performance multiscan / multi prefix-sum (many rows in a one kernel execution) function for my project in CUDA.

I’ve tried the one from Thrust library but it’s a way too slow. Also thrust functions crash after being compiled with nvcc debug flags (-g -G).

After my failure with Thrust I focused on cuDPP library which used to be a part of CUDA toolkit. The cuDPP performance is really good but the library is not up to date with latest cuda 5.5 and there are some global memory violation issues in cudppMultiScan() function while debugging with memory checker.
same issue: Redirecting to Google Groups
(cuda 5.5, nsight 3.1, visual studio 2010, gtx 260 cc 1.3)

Does anybody have any idea what to use instead of these two libraries?

R.

CudaaduC · September 1, 2013, 7:09pm

scans and reductions are the first things learned in GPU computing, so why not write your own?

Also you generally should not use the -g and -G flags, and you have a very old GPU(which cannot use the new optimized libraries effectively).

JFSebastian · September 1, 2013, 8:32pm

CudaaduC is definitely right.

The -g -G options are used when compiling a CUDA program for debugging. They disable most compiler optimizations and include symbolic debugging information in the object file to enable debuggin the application.

To write your own prefix scan, you may refer to

The scan example of the CUDA SDK;
Chapter 13 of N. Wilt, “The CUDA Handbook”;
Chapter 6 of S. Cook, “CUDA Programming, A Developer’s Guide to Parallel Computing with GPUs”;
[url]http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/scan/doc/scan.pdf[/url]

To do multi prefix-sum you can launch many times the same kernel or try to achieve cuncurrency by CUDA streams, although I do not know it this will effectively work for your card.

HannesF99 · September 2, 2013, 2:36pm

From our experience, we generally prefer CUDPP (or writing a specialized routine) over thrust library, as in our use cases CUDPP provided better performance ( GPU: GTX 480). You could also take a look at the ‘CUB’ library CUB: Main Page.

@CudaaduC/JFSebastian: A possible motivation for not writing everything anew is that usually (at least in a company) you want to get a project done (some algorithm ported) in a certain time, so it helps really if you have the most basic building blocks (i consider scan/redution/compaction etc. as these) already availabe in a performant way …

Topic		Replies	Views
Someone famaliar with Cudpp? CUDA Programming and Performance	1	5434	March 4, 2008
about implementation of efficient reduction CUDA Programming and Performance	4	3160	March 20, 2008
Cuda Prefix Scan CUDA Programming and Performance	2	1381	March 28, 2017
Device Level Prefix Sum? CUDA Programming and Performance	7	2080	March 11, 2014
Thrust prefix sum slower than CPU? CUDA Programming and Performance	3	1875	February 11, 2020
Computing Prefix Sum/Scan on different arrays (with CUB) in parallel CUDA Programming and Performance	6	1867	August 21, 2017
Bugged code in website CUDA Programming and Performance	6	1552	October 8, 2015
Best way to get the min value from an array CUDA Programming and Performance	3	3722	March 4, 2008
cuda thrust function CUDA Programming and Performance	5	1791	April 29, 2017
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1462	September 14, 2017

high performance prefix sum / scan function in CUDA, looking for thrust, cuDPP library alterative

Related topics