I have an array of 128 positive and negative ints in shared memory array and I would like CUDA to return as quickly as possible the sign of the leftmost non-negative number (+1, -1 or 0 if all numbers are zero).
Which would be the proper algorithm to use? I suspect it’s a modified reduction algorithm, but how would it have to be modified so it fits the problem description?
On compute capability 2.x you can probably come up with a fast implementation using [font=“Courier New”]__ballot()[/font] and [font=“Courier New”]__clz()[/font].
Hmm, I target compute 1.1 and later. My original data of 128 ints is in array A which must not be destroyed. So I use another array B of half the size to store intermediates. The idea is to successively compare two neighbor values and take the sign of the right if the left argument is 0. Repeat until a single value is left. Everything is unrolled like in the reduction CUDA samples. The volatile trick forces 2tid and 2tid+1 into registers, where I expect them to stay and get reused.
The 2-way bank conflicts are unfortunate. But I can’t do it like in the SDK samples because I always need to compare immediate neighbors. The EMUSYNC is for emulation mode only (CUDA SDK 2.3), just like in the SDK samples.
To avoid the 2-way band conflict you could reorder the B array in shared memory. However __brev() is 2.x stuff as well, so I’d guess you won’t end up any faster.