Hi !

I have following problem:

my program consists of two kernel launches,

in the first launch each block outputs a single value which can either be 0 or not

(here 0 indicates that the computation was unsuccessful and the result should be ignored)

the number of blocks in the first kernel is about 50 - 100 thousand

while 0 result occurs very rarely: typically, 1-5 blocks output 0

in the second kernel, the number of blocks is about 50-100 and I work only with *non-zero* entries.

Therefore, I need some way to eliminate zeros from the input sequence…

one solution I can think of is to run stream compaction kernel between these two kernel launches but this would incur

an extra kernel call which is undesirable also probably inefficient because the number of zero entries is utterly small compared to the overall data size

another solution could be to eliminate zero entries right in the first kernel launch: i.e., a block writes out its result to global memory *only* if

the computation produces a non-zero result, and the memory write index is controlled by a global variable which gets incremented (atomically)

each time the write to global memory occurs…

however atomic operations are expensive, so I am not perfectly sure that this is a good solution

I’d highly appreciate if anyone could suggest a better solution to this problem ?

thanks