Reduction + Threadfence = does not work! Related to : GPU raytracer

For some reason, I need to count the rays that are in a specific state, and store the intermediary values in some kind of pyramidal structure (helping me find the nth ray in log2 time).

So the operation I need to perform is :

1 0 1 1 1 0 1 1 (A set of 8 rays, the 1s represent rays in the state of interest)

It can easily be done by calling a kernel log2(nb_rays) times, but by using __threadfence() and __syncthreads() it should be doable in one single kernel. However, it seems the __threadfence() and/or __syncthreads() don't work correctly in this case... and I can't figure out why! (for already too many hours!)

So far, here's how it behaves:

If I call the kernel multiple times, it ends up giving the right answer (after around 20 calls!)
If I sandwitch the global mem write between many (5) __threadfence() it gives up the right value after only 5 calls!
I tried marking empty elements with -1, and do a while(read!=-1) (active wait)... gives the right sum after 7 calls!

PS : I know shared memory should help with speed, and it should be doable in 2 passes for 1M rays. But I'd still love to know what I'm doing wrong here!

Thanks in advance!

gtx 580 + cuda 4.1 VS 2010.

__global__ void kernel_reduc2(int* vrayreduc_nv, int vrayreducsize)
[indent]	volatile int* vrayreduc= vrayreduc_nv;

	int id = threadIdx.x + threadIdx.y*blockDim.x + blockIdx.x*blockDim.x*blockDim.y;

	int id0last =0;
	int id0 = 0;
	int size_w = vrayreducsize/2;

	int lcnt=0;

[indent]		id0+= size_w;

		if(id>=size_w)return;				//continue;	// Half the threads will return each loop. Only id==0 will make it to the last sum.
		vrayreduc[id + id0] = vrayreduc[ 2*id + id0last] + vrayreduc[ 2*id + id0last +1];

		id0last = id0;




This looks like a typical reduction scheme. Take a look at the CUDA reduction documentation on how to do this better.

Also, you could also try doing reductions using ArrayFire. The following code does reduction calling ArrayFire ‘sum()

using namespace af;

int main(){

   array A = randu(4,4); // Array of size 4x4 initialized with random data

   print(A); // Display contents of A

   array B = sum(A); // Does reduction along rows

   array C = sum(A,1);// Does reduction along columns




If you call __syncthreads() or __threadfence() within some conditional code (i.e., code which not all threads will reach), the resulting behavior is undefined. You’ll need to modify your control flow a bit to avoid that.

Thanks a lot,

I heard about __syncthreads() within conditional code. Is it really a problem if a thread returns in between? I’d like to do some more testing around this… or maybe someone already did?

Now I’ve found the answer to my question (indeed in the cuda reduction paper!):

page 4: “But CUDA has no global synchronization.”

And they explain why:

__threadfence() allows your writes to be visible by all threads running at that specific time on the GPU. The GPU can hold a limited number of living threads, and other threads can be waiting to be spawned. In that case, it’s oubviously not possible to expect global synchrnonization (like I did!), and there is no way for the first running threads to read what the last one wrote!

And no, the reduction I’m looking for is not just a typical reduction scheme; I need to keep the intermediary values.

Then you’re looking for a “scan”, not a reduction. There’s an example project for this in the CUDA SDK, and I believe Thrust includes “scan” functionality as well.


I am not sure what is the problem in your case, but I have successfully used the example from the programming guide (section with the threadfence() functions) for doing a sum.

First there is summation (reduction operation) inside each block, then use thread fence to write the results (one number for one block to the global array). Now there is an array of size equal to the number of blocks, which I sum again to give one number using on block only.