about the __syncwarp() in P100

roychenjin · June 5, 2018, 5:40pm

I know __syncwarp() is a new feature in CUDA 9, according to programming guide, it should only work for Volta instead of P100. However, when I implemente a reduction operation within the warp,

if(threadIdx.x < WARP_SIZE / 2)
		if(vector[threadIdx.x] > vector[threadIdx.x + WARP_SIZE / 2])
		{
			vector[threadIdx.x] =  vector[threadIdx.x + WARP_SIZE / 2];
			index[threadIdx.x] = index[threadIdx.x + WARP_SIZE / 2];
		}
	__syncwarp();	
	if(threadIdx.x < WARP_SIZE / 4)
		if(vector[threadIdx.x] > vector[threadIdx.x + WARP_SIZE / 4])
		{
			vector[threadIdx.x] =  vector[threadIdx.x + WARP_SIZE / 4];
			index[threadIdx.x] = index[threadIdx.x + WARP_SIZE / 4];
		}
	__syncwarp();	
	if(threadIdx.x < WARP_SIZE / 8)
		if(vector[threadIdx.x] > vector[threadIdx.x + WARP_SIZE / 8])
		{
			vector[threadIdx.x] =  vector[threadIdx.x + WARP_SIZE / 8];
			index[threadIdx.x] = index[threadIdx.x + WARP_SIZE / 8];
		}
	__syncwarp();		
	if(threadIdx.x < WARP_SIZE / 16)
		if(vector[threadIdx.x] > vector[threadIdx.x + WARP_SIZE / 16])
		{
			vector[threadIdx.x] =  vector[threadIdx.x + WARP_SIZE / 16];
			index[threadIdx.x] = index[threadIdx.x + WARP_SIZE / 16];
		}
	__syncwarp();	
	if(threadIdx.x < WARP_SIZE / 32)
		if(vector[threadIdx.x] > vector[threadIdx.x + WARP_SIZE / 32])
		{
			vector[threadIdx.x] =  vector[threadIdx.x + WARP_SIZE / 32];
			index[threadIdx.x] = index[threadIdx.x + WARP_SIZE / 32];
		}

when delete __syncwarp(), the results are incorrect in p100 (CUDA 9.0, but results are correct in k80 and 780ti (CUDA 8.0). In my opinion, __syncwarp() is unnecessary for p100, so why this happens? Thank you for your help

njuffa · June 5, 2018, 6:02pm

You are showing only a snippet, not compileable code. Presumably your code implements a reduction using shared memory storage. If so, the code has inter-thread data dependencies on that shared memory data. These dependencies are implicit, so the compiler has no knowledge of them and in the absence of the __syncwarp() calls may re-arrange the non-dependent memory accesses as it sees fit. This may or may not result in machine code that behaves as intended.

The classical way to implement the reduction correctly, compliant with the CUDA execution model, is to use a __syncthreads() call after each stage of the reduction. The classical, faster but hacky, way to implement the reduction is to declare the shared memory storage volatile which indirectly forces the compiler to arrange the loads and stores in code order.

Robert_Crovella · June 5, 2018, 6:23pm

Just to amplify on the point raised by njuffa:

You may be focused on synchronized execution. But the issue may actually be the effect that the function call (__syncwarp() or __syncthreads()) has as a memory barrier. The compiler is free to optimize shared memory locations into registers. This will break the situations where inter-thread communication occurs in shared memory. The memory barrier effect of e.g. __syncthreads() can prevent this “breakage”.

[url]Programming Guide :: CUDA Toolkit Documentation

“Executing __syncwarp() guarantees memory ordering among threads participating in the barrier. Thus, threads within a warp that wish to communicate via memory can store to memory, execute __syncwarp(), and then safely read values stored by other threads in the warp.”

This could possibly explain the p100 behavior.

This optimization of shared-into-register is an optional item for the compiler. It may choose to do it or not do it. This could explain the behavior on CUDA 8 on k80.

roychenjin · June 5, 2018, 11:23pm

Thank you for your answer, I agree your opinion, my question is that is __syncwarp() works for GPUs other than Volta, say P100 or K80 if CUDA 9 is employed? Thank you

txbob:

Just to amplify on the point raised by njuffa:

You may be focused on synchronized execution. But the issue may actually be the effect that the function call (__syncwarp() or __syncthreads()) has as a memory barrier. The compiler is free to optimize shared memory locations into registers. This will break the situations where inter-thread communication occurs in shared memory. The memory barrier effect of e.g. __syncthreads() can prevent this “breakage”.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions

“Executing __syncwarp() guarantees memory ordering among threads participating in the barrier. Thus, threads within a warp that wish to communicate via memory can store to memory, execute __syncwarp(), and then safely read values stored by other threads in the warp.”

This could possibly explain the p100 behavior.

This optimization of shared-into-register is an optional item for the compiler. It may choose to do it or not do it. This could explain the behavior on CUDA 8 on k80.

Robert_Crovella · June 6, 2018, 12:30am

Where does it say that in the programming guide? I think that is a false claim. I’ve already given you a link to the programming guide section which defines syncwarp, and it explicitly discusses the expected behavior for syncwarp on architectures other than sm_70.

roychenjin · June 6, 2018, 11:51am

Starting with the Volta architecture, Independent Thread Scheduling allows a warp to remain diverged outside of the data-dependent conditional block. An explicit __syncwarp() can be used to guarantee that the warp has reconverged for subsequent instructions.

please search for __syncwarp for that sentence

roychenjin · June 6, 2018, 11:56am

Q1 BTW, could you please kindly point out the difference between synchronized execution and memory barrier?
Q2 I know memory barrier is necessary if I use shared memory to do reduction operations, but if the array size is less than 32, to be specific, one warp can hanlde the array, is memory barrier still necessary. because I think the threads in a warp is synchronous.

Thank you

txbob:

Just to amplify on the point raised by njuffa:

You may be focused on synchronized execution. But the issue may actually be the effect that the function call (__syncwarp() or __syncthreads()) has as a memory barrier. The compiler is free to optimize shared memory locations into registers. This will break the situations where inter-thread communication occurs in shared memory. The memory barrier effect of e.g. __syncthreads() can prevent this “breakage”.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions

“Executing __syncwarp() guarantees memory ordering among threads participating in the barrier. Thus, threads within a warp that wish to communicate via memory can store to memory, execute __syncwarp(), and then safely read values stored by other threads in the warp.”

This could possibly explain the p100 behavior.

This optimization of shared-into-register is an optional item for the compiler. It may choose to do it or not do it. This could explain the behavior on CUDA 8 on k80.

roychenjin · June 6, 2018, 11:58am

Thank you for your answering, I know memory barrier is necessary if I use shared memory to do reduction operations if the length of shared memory is large than warp size and need to br processed by several warps, but if the array size is less than 32, to be specific, one warp can hanlde the array, is memory barrier still necessary. because I think the threads in a warp is synchronous.

You are showing only a snippet, not compileable code. Presumably your code implements a reduction using shared memory storage. If so, the code has inter-thread data dependencies on that shared memory data. These dependencies are implicit, so the compiler has no knowledge of them and in the absence of the __syncwarp() calls may re-arrange the non-dependent memory accesses as it sees fit. This may or may not result in machine code that behaves as intended.

The classical way to implement the reduction correctly, compliant with the CUDA execution model, is to use a __syncthreads() call after each stage of the reduction. The classical, faster but hacky, way to implement the reduction is to declare the shared memory storage volatile which indirectly forces the compiler to arrange the loads and stores in code order.

Robert_Crovella · June 6, 2018, 1:11pm

It’s true that Independent Thread Scheduling is a Volta feature. The next sentence says a __syncwarp() may also be important in this context. That sentence does not say __syncwarp() is a volta-only feature. The first sentence and the second sentance are separate ideas. I’ve already given you a doc reference that describes __syncwarp() behavior on architectures prior to volta.

Robert_Crovella · June 6, 2018, 1:18pm

I’ve already given you a documentation link that describes both execution barriers and memory barriers. An execution barrier synchronizes execution. For example it guarantees that warps have reconverged. A memory barrier causes the results of thread activity to be visible in memory to other threads. Without a memory barrier, even though your code appears to be updating shared memory, it may only have results stored in registers local to that thread. The memory barrier forces those results to actually be written out to memory, where they will be visible to other threads.

Yes, a memory barrier may be necessary even in the case of synchronized execution.

roychenjin · June 6, 2018, 1:29pm

Thanks so much. I get the difference between the execution and memory barrier. You are right, my issue should be due to the memory barrier. Let me explain it and please kindly point out whether it is correct.

If I want to sum the values in shared memory (length = 32), although a warp is executed synchronously, the may write to shared memory from register slightly different speed or time depending on the optimization of compiler. Therefore, the sum value is undefined. When using syncthread() or syncwarp(), the warp will stop until all the data successfully write to shared memory and then resume. This issue does not happen in k80 and 780 ti in CUDA8.0 does not mean that my code is correct.

Is it correct? Thank you for your help. it really helps

txbob:

Q1 BTW, could you please kindly point out the difference between synchronized execution and memory barrier?
Q2 I know memory barrier is necessary if I use shared memory to do reduction operations, but if the array size is less than 32, to be specific, one warp can hanlde the array, is memory barrier still necessary. because I think the threads in a warp is synchronous.

Thank you

I’ve already given you a documentation link that describes both execution barriers and memory barriers. An execution barrier synchronizes execution. For example it guarantees that warps have reconverged. A memory barrier causes the results of thread activity to be visible in memory to other threads. Without a memory barrier, even though your code appears to be updating shared memory, it may only have results stored in registers local to that thread. The memory barrier forces those results to actually be written out to memory, where they will be visible to other threads.

Yes, a memory barrier may be necessary even in the case of synchronized execution.

Robert_Crovella · June 6, 2018, 3:35pm

Yes, that is approximately correct. Without a memory barrier (or the use of volatile keyword) the compiler can choose when it writes data out to shared memory. Results may sit in a thread-local register until the instruction to write the data to shared memory actually occurs in the instruction stream.

Topic		Replies	Views
Problem with correct branching within a warp CUDA Programming and Performance	23	15634	May 28, 2009
Synchronization, threadfence, random memory access beginner questions CUDA Programming and Performance	7	2625	April 9, 2012
32 thread block doesn't need _syncthreads()? CUDA Programming and Performance	18	13704	January 21, 2024
global to shared mem loads and sync CUDA Programming and Performance	26	11440	February 21, 2008
Why __syncwarp is necessary in undivergent warp reduction? CUDA Programming and Performance	6	3141	April 1, 2022
Race condition within warp CUDA Programming and Performance	9	2904	September 20, 2016
CUDA Kernel self-suspension ? Can a CUDA Kernel conditionally suspend its execution ? CUDA Programming and Performance	46	45184	April 17, 2011
CUDA 8 RC code generation related to warps updating __shared__ memory CUDA Programming and Performance	6	1462	August 20, 2016
order of execution in a divergent warp CUDA Programming and Performance	27	4588	January 17, 2012
questions about thread execution & volatile CUDA Programming and Performance	19	16894	December 29, 2008

about the __syncwarp() in P100

Related topics