efficient stream compaction ?

asm · December 10, 2009, 7:51pm

Hi !

I have following problem:

my program consists of two kernel launches,
in the first launch each block outputs a single value which can either be 0 or not
(here 0 indicates that the computation was unsuccessful and the result should be ignored)
the number of blocks in the first kernel is about 50 - 100 thousand
while 0 result occurs very rarely: typically, 1-5 blocks output 0

in the second kernel, the number of blocks is about 50-100 and I work only with non-zero entries.
Therefore, I need some way to eliminate zeros from the input sequence…

one solution I can think of is to run stream compaction kernel between these two kernel launches but this would incur
an extra kernel call which is undesirable also probably inefficient because the number of zero entries is utterly small compared to the overall data size

another solution could be to eliminate zero entries right in the first kernel launch: i.e., a block writes out its result to global memory only if
the computation produces a non-zero result, and the memory write index is controlled by a global variable which gets incremented (atomically)
each time the write to global memory occurs…
however atomic operations are expensive, so I am not perfectly sure that this is a good solution

I’d highly appreciate if anyone could suggest a better solution to this problem ?

thanks

Cygnus_X1 · December 10, 2009, 10:03pm

A kernel call does not take that much time.
An algorithm that I am currently implementing involves launching 600 kernels in less than 100ms. I am currently in process of reducing the number of calls, yes, but one extra kernel call is not that bad ;)
There are some very fast compaction algorithms available, check these out:
paper: [url=“http://www.cse.chalmers.se/~billeter/papers.html”]http://www.cse.chalmers.se/~billeter/papers.html[/url]
template library: [url=“http://www.cse.chalmers.se/~billeter/pub/pp/index.html”]http://www.cse.chalmers.se/~billeter/pub/pp/index.html[/url] (same authors)

An extra kernel call could be saved through a hack for a GPU-wide synchronisation bar but it is against the programming model of CUDA and some internal bugs of the driver/hardware require careful settings when using them.

tmurray · December 10, 2009, 10:10pm

The only time an extra kernel call is really a performance penalty is when you have to wait for the first kernel to complete, copy something back, and conditionally launch another kernel based on the result of the copy.

asm · December 11, 2009, 5:18pm

ok I see, thanks for replies,

i will look through the stream compaction algorithm and tell you my experiences

tmurray · December 11, 2009, 6:28pm

Someone suggested to me that you may find this paper helpful:

[url=“http://www.cse.chalmers.se/~billeter/pub/pp/paper.pdf”]http://www.cse.chalmers.se/~billeter/pub/pp/paper.pdf[/url]

Cygnus_X1 · December 11, 2009, 7:27pm

That is the same paper I pointed to 3 posts earlier :)

tmurray · December 11, 2009, 7:41pm

whoops! shows what happens when I just blatantly repost my email :)

Topic		Replies	Views
'coompacting' result array CUDA Programming and Performance	4	4462	May 13, 2010
Stream compaction without host synchronization CUDA Programming and Performance cuda	2	77	November 16, 2025
How to parallel a seirial code CUDA Programming and Performance	4	797	March 16, 2018
how to parallel this simple algorithm? someone has met this before? CUDA Programming and Performance	2	7333	December 3, 2010
Removing elements from a global array written across blocks CUDA Programming and Performance	5	1606	June 15, 2009
Funny development problem CUDA Programming and Performance	2	1207	March 13, 2013
Optimizing workload with Large amounts of computations but small amount of results CUDA Programming and Performance	1	528	March 25, 2018
#cudpp compaction how to do compaction of an array with flags in it CUDA Programming and Performance	0	2023	March 11, 2012
Stream Compaction Sample code needed CUDA Programming and Performance	2	4215	July 13, 2011
Algorithm Strategy brainstroming.. Please help me choose the best algo.. CUDA Programming and Performance	5	1767	December 26, 2009

efficient stream compaction ?

Related topics