high performance prefix sum / scan function in CUDA, looking for thrust, cuDPP library alterative

I’m looking for high performance multiscan / multi prefix-sum (many rows in a one kernel execution) function for my project in CUDA.

I’ve tried the one from Thrust library but it’s a way too slow. Also thrust functions crash after being compiled with nvcc debug flags (-g -G).

After my failure with Thrust I focused on cuDPP library which used to be a part of CUDA toolkit. The cuDPP performance is really good but the library is not up to date with latest cuda 5.5 and there are some global memory violation issues in cudppMultiScan() function while debugging with memory checker.
same issue: https://groups.google.com/forum/#!topic/cudpp/5f4iPT8cPJ8
(cuda 5.5, nsight 3.1, visual studio 2010, gtx 260 cc 1.3)

Does anybody have any idea what to use instead of these two libraries?

R.

scans and reductions are the first things learned in GPU computing, so why not write your own?

Also you generally should not use the -g and -G flags, and you have a very old GPU(which cannot use the new optimized libraries effectively).

CudaaduC is definitely right.

The -g -G options are used when compiling a CUDA program for debugging. They disable most compiler optimizations and include symbolic debugging information in the object file to enable debuggin the application.

To write your own prefix scan, you may refer to

  1. The scan example of the CUDA SDK;
  2. Chapter 13 of N. Wilt, “The CUDA Handbook”;
  3. Chapter 6 of S. Cook, “CUDA Programming, A Developer’s Guide to Parallel Computing with GPUs”;
  4. http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/scan/doc/scan.pdf

To do multi prefix-sum you can launch many times the same kernel or try to achieve cuncurrency by CUDA streams, although I do not know it this will effectively work for your card.

From our experience, we generally prefer CUDPP (or writing a specialized routine) over thrust library, as in our use cases CUDPP provided better performance ( GPU: GTX 480). You could also take a look at the ‘CUB’ library http://nvlabs.github.io/cub/.

@CudaaduC/JFSebastian: A possible motivation for not writing everything anew is that usually (at least in a company) you want to get a project done (some algorithm ported) in a certain time, so it helps really if you have the most basic building blocks (i consider scan/redution/compaction etc. as these) already availabe in a performant way …