32 thread block doesn't need _syncthreads()?

since each warp is exactly SIMD, if each block has exactly 32 threads (a warp), is that we don’t need __syncthreads()?
take if…else…converge… for example, all threads in the warp will take the same path, so there’s no need add “__syncthreads()” after converge. Right?

You may still need it. Things don’t seem to converge back after a divergent for loop.

isn’t that somewhat contradictory to what Mark Harris say in his slides from the Supercomputing 07 Tutorial? (slide 56).

simple example:

i want to double every third array element and store the result in reversed order

__shared__ float array[32];

//distributed load of array data

array[TID] = gmem[TID];

if((TID % 3) == 0)


   array[TID] *= 2; 


gmem[TID] = array[31-TID];

so do i need __syncthreads() if i use simple if-statements like the above one?

will more complex calculations make a __syncthreads() necessary

what’s the technical background?



Iinteresting question. I suppose this should be the case, as each warp has only one instruction decode unit you might indeednot need syncthreads().

Any divergences like that the compiler/hardware will automagically sync them for you. Including for loops and any kind of branch. You don’t have to worry about it at all (unless said divergence is really causing you significant performance problems…, but you still don’t need to sync).

__syncthreads() is a BLOCK WIDE barrier only to be used to avoid shared memory race conditions.

thanks a lot for quick replies,

__shared__ unsigned sArray[256];

__global__ void test_kernel (unsigned* gMem)


  sArray[threadIdx.x] = gMem[threadIdx.x];

 for(unsigned i=0; i < threadIdx.x; i++)


    sArray[threadIdx.x] += 1;


 gMem[threadIdx.x] = sArray[31 - threadIdx.x];


really works. that saves my project…

True! You dont need __syncthreads() for 32 threads per block. It saves lot of time actually and results in faster implementation.

It also means that you dont worry about race conditions and double-buffering solutions.

I re-coded the binomial tree implementation of NVIDIA SDK for 32 threads and got a speed up of 1.3x (265 ms before reduced to 195 ms using 32 threads).

Check this forum link:


I have posted the changed code towards the end of the page.

I feel compelled to post a warning on this thread.

NVIDIA makes no guarantee that the warp size of future GPU architectures will always remain 32 threads. While it’s probable that it will remain 32 for quite a while, we can’t provide guarantees.

Therefore, your code should really use cudaGetDeviceProperties() to query the warp size of the present GPU. Then make sure you always use __syncthreads() in code that has shared-memory dependencies between threads not in the same warp (in other words write your code so it sets the granularity of its SIMD computations to equal the warp size from cudaGetDeviceProperties()).

If you don’t do this, your code may break on future architectures.


Thanks for the warning Mark.

And as you hinted, I think, programmers can still be happy with 1 WARP and use no __syncthreads(). All you need to do is to change the “block.x” to the WARP_SIZE and get going.

Usually, I think it is a bad idea to write code such that your kernel depends on your “blockdim” and “griddim”. But yes, certain applications use special dimensions to get their computation optimal. Those apps have to be extra careful about their kernel code itself.
But those applications which are written independent of block and grid dimensions, can merely set the blockdim correctly (as you said) while launching the kernel.

Once again, Thanks for the input.

Another note:

There is one situation where you can’t rely on correct behavior without __syncthreads(). That is when shared variables are cached in registers until a __syncthreads() is hit. Syncthreads has a second meaning: flush all cached shared variables.

That is, if there is a shared variable called “x”, and there is a spin wait on x, then there must be a __syncthreads, or each thread will see its own copy of x and the loop will never exit. Another way around this is to declare x as volatile, forcing the shared variable to be coherent in shared memory.


Sure, Thanks! The 1.1 CUDA manual has a note on “volatile”. (1.0 manual does not). If you have only one WARP in your block, You dont need that “volatile” too.


You actually need “volatile” for 32 threads. I got enlightened by the posts below. Read on…

Actually you might need the volatile even for one warp, for the reason he states: shared memory values can be temporarily cached in registers.

because they are stored in registers to perform computation with them if I understood correctly?

That is because the 1.0 compiler did not honor the volatile keyword :))

Only the 1.1 manual talks about “volatile” and 1.0 does NOT.

Also note that the PTX 1.1 ISA manual talks about “ld.volatile” and “st.volatile” instructions. So, it is possible that LOADS generated by VOLATILE keyword can again be filtered when PTX code is actually translated. So, it is better to stick with “volatile” on a 1.1 environment.

Wumpus, You are right! You might require “volatile” even in case of 32 threads. That depends on the application. The programmer has to decide on that.

Thanks for correcting.

For the sake of every1’s clarity:

A possible example would be:

sarray[threadIdx.x] = 25;


sarray[threadIdx.x + 1] = steps[threadIdx.x] + 5;


if (sarray[threadIdx.x] == 25)




Now, if the 25 that was stored was cached in a register, the compiler would still use it to compare with the IF statement and IF statement would give in – which would b wrong.

You need a “volatile” in such cases.

Yes, because 1.0 did not honor the keyword.