__syncthreads() code execution hangs

sicb0161 · September 18, 2007, 8:17am

Hello,

I am computing the dot product, similar to the example (nvidia projects).

// Tree - like reduction

	if (thx < i){

  for(int stride = i / 2; stride > 0; stride >>= 1){

  	__syncthreads();

  	//shared_h[thx] += shared_h[stride + thx];

  }

	}

In my version the vector lengths must not be a power of two, so that I put the condition thx < i, as the tree like reduction needs vector lengths equal to the power of two.

The problem is that the code hangs when the number of threads exceeds 16.

Why is that?

In the Programming Guide it says that

I am not really sure what that means.

Thanks in advance.

Cem

MisterAnderson42 · September 18, 2007, 12:51pm

It means that the __syncthreads() MUST be called by all threads in the entire block. You have the syncthreads inside an if, so some threads don’t get there. You can fix it by putting the if (thx < i) inside the for loop.

sicb0161 · September 18, 2007, 2:19pm

wonderfull,

this makes sense!

okay then i have another question:

how come that in some of the program code there is a conditional like

if(thx == 0){

 do sth.

}

like in the example of the scalar product. I thought that the order of how warps are executed is not determined.

thx in advance

MisterAnderson42 · September 18, 2007, 2:34pm

In the scalar product example, the entire block calculates only a single result. It’s bad practice to have multiple threads writing to the same memory location, so the if (thx == 0) is there to make sure that only one thread performs the memory write.

It is true that the order of warp execution is undefined, so the if (thx == 0) could have race condition issues. In the scalarProd example, there has been a __syncthreads() call to make sure all threads are caught up, and then accumResult is updated. Since thread 0 is writing the value from accumResult[0], there cannot be any race condition to access it since thread 0 also updated accumResult[0] a few lines of code up!

In any of the examples that use if (thx == 0), you should see syncthreads used in appropriate locations to prevent race conditions.

sicb0161 · September 20, 2007, 4:26pm

In the scalar product example, the entire block calculates only a single result. It’s bad practice to have multiple threads writing to the same memory location, so the if (thx == 0) is there to make sure that only one thread performs the memory write.

It is true that the order of warp execution is undefined, so the if (thx == 0) could have race condition issues. In the scalarProd example, there has been a __syncthreads() call to make sure all threads are caught up, and then accumResult is updated. Since thread 0 is writing the value from accumResult[0], there cannot be any race condition to access it since thread 0 also updated accumResult[0] a few lines of code up!

In any of the examples that use if (thx == 0), you should see syncthreads used in appropriate locations to prevent race conditions.

[snapback]252574[/snapback]

I see thx a lot !

I have another question:

I am writing a qr decomposition for smaller matrix sizes. I start to read from global memory to the processing and the try to write back after the calculation to the same global memory, as the qr factorization is an iterative process.

I have checked my calculation for a single iteration step and I get weird errors. But when I write to another global memory location, then there seems to be no calculation error. I have checked all the intermediate results and they are correct.

Is there any amount of time needed so I can write back to the same memory ?

Dont have a clue whats wrong !

thanks in advance,

Cem

Topic		Replies	Views
A stupid question on __syncthread() function CUDA Programming and Performance	5	5310	May 17, 2022
Strange __syncthreads behavior CUDA Programming and Performance	2	1044	January 21, 2014
Any hang demo when a __syncthreads() used in conditional code? CUDA Programming and Performance cuda	3	286	June 9, 2023
__syncthreads thread syncronization CUDA Programming and Performance	7	18591	October 27, 2009
How to implement stack CUDA Programming and Performance	5	887	July 27, 2017
Problem with __syncthreads() It does not work for threads > 64 CUDA Programming and Performance	4	893	April 11, 2012
Syncthreads hangs although called exactly 3 times by all threads CUDA NVCC Compiler	5	739	February 26, 2025
Semantics of __syncthreads CUDA Programming and Performance	18	18007	January 2, 2008
Is __syncthreads() abnormal in for loop? CUDA Programming and Performance	3	3054	November 14, 2013
race conditioning stopping hang time CUDA Programming and Performance	5	2417	September 4, 2008

__syncthreads() code execution hangs

Related topics