32 thread block doesn't need _syncthreads()?

yk_cadcg · September 17, 2007, 3:52pm

Hi,
since each warp is exactly SIMD, if each block has exactly 32 threads (a warp), is that we don’t need __syncthreads()?
take if…else…converge… for example, all threads in the warp will take the same path, so there’s no need add “__syncthreads()” after converge. Right?
Thanks!

asadafag · September 18, 2007, 2:53am

You may still need it. Things don’t seem to converge back after a divergent for loop.

tomschi · January 31, 2008, 2:51pm

isn’t that somewhat contradictory to what Mark Harris say in his slides from the Supercomputing 07 Tutorial? (slide 56).

simple example:

i want to double every third array element and store the result in reversed order

__shared__ float array[32];

//distributed load of array data

array[TID] = gmem[TID];

if((TID % 3) == 0)

{

   array[TID] *= 2; 

}

gmem[TID] = array[31-TID];

so do i need __syncthreads() if i use simple if-statements like the above one?

will more complex calculations make a __syncthreads() necessary

what’s the technical background?

thanks,

tomschi

wumpus · January 31, 2008, 2:56pm

Iinteresting question. I suppose this should be the case, as each warp has only one instruction decode unit you might indeednot need syncthreads().

MisterAnderson42 · January 31, 2008, 3:03pm

Any divergences like that the compiler/hardware will automagically sync them for you. Including for loops and any kind of branch. You don’t have to worry about it at all (unless said divergence is really causing you significant performance problems…, but you still don’t need to sync).

__syncthreads() is a BLOCK WIDE barrier only to be used to avoid shared memory race conditions.

tomschi · January 31, 2008, 3:39pm

thanks a lot for quick replies,

__shared__ unsigned sArray[256];

__global__ void test_kernel (unsigned* gMem)

{

  sArray[threadIdx.x] = gMem[threadIdx.x];

 for(unsigned i=0; i < threadIdx.x; i++)

  {

    sArray[threadIdx.x] += 1;

  }

 gMem[threadIdx.x] = sArray[31 - threadIdx.x];

}

really works. that saves my project…

Sarnath · February 1, 2008, 6:39am

True! You dont need __syncthreads() for 32 threads per block. It saves lot of time actually and results in faster implementation.

It also means that you dont worry about race conditions and double-buffering solutions.

I re-coded the binomial tree implementation of NVIDIA SDK for 32 threads and got a speed up of 1.3x (265 ms before reduced to 195 ms using 32 threads).

Check this forum link:

[url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/index.php?showtopic=54875[/url]

I have posted the changed code towards the end of the page.

Mark_Harris · February 3, 2008, 4:34pm

I feel compelled to post a warning on this thread.

NVIDIA makes no guarantee that the warp size of future GPU architectures will always remain 32 threads. While it’s probable that it will remain 32 for quite a while, we can’t provide guarantees.

Therefore, your code should really use cudaGetDeviceProperties() to query the warp size of the present GPU. Then make sure you always use __syncthreads() in code that has shared-memory dependencies between threads not in the same warp (in other words write your code so it sets the granularity of its SIMD computations to equal the warp size from cudaGetDeviceProperties()).

If you don’t do this, your code may break on future architectures.

Thanks,
Mark

Sarnath · February 4, 2008, 6:09am

Thanks for the warning Mark.

And as you hinted, I think, programmers can still be happy with 1 WARP and use no __syncthreads(). All you need to do is to change the “block.x” to the WARP_SIZE and get going.

Usually, I think it is a bad idea to write code such that your kernel depends on your “blockdim” and “griddim”. But yes, certain applications use special dimensions to get their computation optimal. Those apps have to be extra careful about their kernel code itself.
But those applications which are written independent of block and grid dimensions, can merely set the blockdim correctly (as you said) while launching the kernel.

Once again, Thanks for the input.

Mark_Harris · February 6, 2008, 11:26am

Another note:

There is one situation where you can’t rely on correct behavior without __syncthreads(). That is when shared variables are cached in registers until a __syncthreads() is hit. Syncthreads has a second meaning: flush all cached shared variables.

That is, if there is a shared variable called “x”, and there is a spin wait on x, then there must be a __syncthreads, or each thread will see its own copy of x and the loop will never exit. Another way around this is to declare x as volatile, forcing the shared variable to be coherent in shared memory.

Mark

Sarnath · February 7, 2008, 7:40am

Sure, Thanks! The 1.1 CUDA manual has a note on “volatile”. (1.0 manual does not). If you have only one WARP in your block, You dont need that “volatile” too.

PS:

You actually need “volatile” for 32 threads. I got enlightened by the posts below. Read on…

wumpus · February 7, 2008, 9:05am

Actually you might need the volatile even for one warp, for the reason he states: shared memory values can be temporarily cached in registers.

DenisR · February 7, 2008, 9:11am

because they are stored in registers to perform computation with them if I understood correctly?

DenisR · February 7, 2008, 9:17am

That is because the 1.0 compiler did not honor the volatile keyword External Media

Sarnath · February 7, 2008, 9:35am

Only the 1.1 manual talks about “volatile” and 1.0 does NOT.

Also note that the PTX 1.1 ISA manual talks about “ld.volatile” and “st.volatile” instructions. So, it is possible that LOADS generated by VOLATILE keyword can again be filtered when PTX code is actually translated. So, it is better to stick with “volatile” on a 1.1 environment.

Sarnath · February 7, 2008, 9:41am

Wumpus, You are right! You might require “volatile” even in case of 32 threads. That depends on the application. The programmer has to decide on that.

Thanks for correcting.

For the sake of every1’s clarity:

A possible example would be:

sarray[threadIdx.x] = 25;

 .....

sarray[threadIdx.x + 1] = steps[threadIdx.x] + 5;

 ......

if (sarray[threadIdx.x] == 25)

{

       ......

}

Now, if the 25 that was stored was cached in a register, the compiler would still use it to compare with the IF statement and IF statement would give in – which would b wrong.

You need a “volatile” in such cases.

DenisR · February 7, 2008, 9:51am

Yes, because 1.0 did not honor the keyword.

czxu · January 21, 2024, 5:26pm

Interesting topic, I tried on Ampere GPU with a reduced sum application, using 32 threads per block, and it seems that syncthreads() are needed

__global__ void global_sum_reduce_kernel(float * arr, float * sum_result) {
    // 32 threads per block, 64 data per block
    unsigned int arr_size = 64;
    const unsigned int data_id = blockIdx.x * arr_size + threadIdx.x;

    __shared__ __align__(2 * 1024) char smem[32 * 4];
    float * arr_s = reinterpret_cast<float *>(smem);
    // load global
    arr_s[threadIdx.x] = arr[data_id] + arr[data_id + arr_size / 2];
    arr_size /= 2;
    __syncthreads();
    while (arr_size > 1) {
        if (threadIdx.x < arr_size / 2) arr_s[threadIdx.x] += arr_s[threadIdx.x + arr_size / 2];
        arr_size /= 2;
        __syncthreads();
    }
    // the result is in arr_s[0];
    if (threadIdx.x == 0) atomicAdd(sum_result, arr_s[0]);
    return;
}

and in Host side:

    const unsigned int thread_num = 32;
    const unsigned int block_num = arr_size/(thread_num*2);
    global_sum_reduce_kernel<<<block_num, thread_num>>>(arr_device, sum_result_device);

The two syncthreads() are both needed or the sum result will be incorrect.

Is this an example for what Mark said about “shared memory cached on register”?

Robert_Crovella · January 21, 2024, 8:38pm

Yes, a memory fence is needed at each step in a typical shared sweep style reduction.

When accessing shared memory for communication between threads, you usually need both an execution barrier to ensure that the read happens after the relevant write, and you also need a memory fence to ensure that the data written is visible to other threads.

In the “old-style” deprecated methodology (“warp-synchronous” programming) that was sometimes used, you could get rid of the syncthreads, but you would then need to mark the shared data as volatile, so that the visibility was provided for. (the “execution barrier” functionality was then “provided for” via the warp-synchronous expectation.) This mention of volatile also already occurs in this thread, in Mark’s comments.

Topic		Replies	Views
Why does single warp need syncthreads? CUDA Programming and Performance	2	1938	January 24, 2012
Semantics of __syncthreads CUDA Programming and Performance	18	18041	January 2, 2008
parallel scan without syncthreads CUDA Programming and Performance	11	7171	November 2, 2010
__syncthreads thread syncronization CUDA Programming and Performance	7	18630	October 27, 2009
__syncthreads() limitation.. Help please.. CUDA Programming and Performance	16	7103	January 4, 2009
__syncthreads question CUDA Programming and Performance	9	2050	September 30, 2009
are threads of a warp really sync? CUDA Programming and Performance	2	795	August 3, 2011
Relaxed __syncthreads() proposal. CUDA Programming and Performance	15	11591	January 2, 2011
Does __syncthreads not work across multiple warps? CUDA Programming and Performance	9	3324	April 30, 2014
Is syncthreads required within a warp? CUDA Programming and Performance	10	12479	November 8, 2013

32 thread block doesn't need _syncthreads()?

Related topics