efficient stream compaction ?

Hi !

I have following problem:

my program consists of two kernel launches,
in the first launch each block outputs a single value which can either be 0 or not
(here 0 indicates that the computation was unsuccessful and the result should be ignored)
the number of blocks in the first kernel is about 50 - 100 thousand
while 0 result occurs very rarely: typically, 1-5 blocks output 0

in the second kernel, the number of blocks is about 50-100 and I work only with non-zero entries.
Therefore, I need some way to eliminate zeros from the input sequence…

one solution I can think of is to run stream compaction kernel between these two kernel launches but this would incur
an extra kernel call which is undesirable also probably inefficient because the number of zero entries is utterly small compared to the overall data size

another solution could be to eliminate zero entries right in the first kernel launch: i.e., a block writes out its result to global memory only if
the computation produces a non-zero result, and the memory write index is controlled by a global variable which gets incremented (atomically)
each time the write to global memory occurs…
however atomic operations are expensive, so I am not perfectly sure that this is a good solution

I’d highly appreciate if anyone could suggest a better solution to this problem ?

thanks

A kernel call does not take that much time.
An algorithm that I am currently implementing involves launching 600 kernels in less than 100ms. I am currently in process of reducing the number of calls, yes, but one extra kernel call is not that bad ;)
There are some very fast compaction algorithms available, check these out:
paper: [url=“http://www.cse.chalmers.se/~billeter/papers.html”]http://www.cse.chalmers.se/~billeter/papers.html[/url]
template library: [url=“http://www.cse.chalmers.se/~billeter/pub/pp/index.html”]http://www.cse.chalmers.se/~billeter/pub/pp/index.html[/url] (same authors)

An extra kernel call could be saved through a hack for a GPU-wide synchronisation bar but it is against the programming model of CUDA and some internal bugs of the driver/hardware require careful settings when using them.

The only time an extra kernel call is really a performance penalty is when you have to wait for the first kernel to complete, copy something back, and conditionally launch another kernel based on the result of the copy.

ok I see, thanks for replies,

i will look through the stream compaction algorithm and tell you my experiences

Someone suggested to me that you may find this paper helpful:

[url=“http://www.cse.chalmers.se/~billeter/pub/pp/paper.pdf”]http://www.cse.chalmers.se/~billeter/pub/pp/paper.pdf[/url]

That is the same paper I pointed to 3 posts earlier :)

whoops! shows what happens when I just blatantly repost my email :)