__syncthreads() code execution hangs


I am computing the dot product, similar to the example (nvidia projects).

// Tree - like reduction

	if (thx < i){

  for(int stride = i / 2; stride > 0; stride >>= 1){


  	//shared_h[thx] += shared_h[stride + thx];



In my version the vector lengths must not be a power of two, so that I put the condition thx < i, as the tree like reduction needs vector lengths equal to the power of two.

The problem is that the code hangs when the number of threads exceeds 16.

Why is that?

In the Programming Guide it says that

I am not really sure what that means.

Thanks in advance.


It means that the __syncthreads() MUST be called by all threads in the entire block. You have the syncthreads inside an if, so some threads don’t get there. You can fix it by putting the if (thx < i) inside the for loop.


this makes sense!

okay then i have another question:

how come that in some of the program code there is a conditional like

if(thx == 0){

 do sth.


like in the example of the scalar product. I thought that the order of how warps are executed is not determined.

thx in advance

In the scalar product example, the entire block calculates only a single result. It’s bad practice to have multiple threads writing to the same memory location, so the if (thx == 0) is there to make sure that only one thread performs the memory write.

It is true that the order of warp execution is undefined, so the if (thx == 0) could have race condition issues. In the scalarProd example, there has been a __syncthreads() call to make sure all threads are caught up, and then accumResult is updated. Since thread 0 is writing the value from accumResult[0], there cannot be any race condition to access it since thread 0 also updated accumResult[0] a few lines of code up!

In any of the examples that use if (thx == 0), you should see syncthreads used in appropriate locations to prevent race conditions.

I see thx a lot !

I have another question:

I am writing a qr decomposition for smaller matrix sizes. I start to read from global memory to the processing and the try to write back after the calculation to the same global memory, as the qr factorization is an iterative process.

I have checked my calculation for a single iteration step and I get weird errors. But when I write to another global memory location, then there seems to be no calculation error. I have checked all the intermediate results and they are correct.

Is there any amount of time needed so I can write back to the same memory ?

Dont have a clue whats wrong !

thanks in advance,