about the __syncwarp() in P100

I know __syncwarp() is a new feature in CUDA 9, according to programming guide, it should only work for Volta instead of P100. However, when I implemente a reduction operation within the warp,

if(threadIdx.x < WARP_SIZE / 2)
		if(vector[threadIdx.x] > vector[threadIdx.x + WARP_SIZE / 2])
		{
			vector[threadIdx.x] =  vector[threadIdx.x + WARP_SIZE / 2];
			index[threadIdx.x] = index[threadIdx.x + WARP_SIZE / 2];
		}
	__syncwarp();	
	if(threadIdx.x < WARP_SIZE / 4)
		if(vector[threadIdx.x] > vector[threadIdx.x + WARP_SIZE / 4])
		{
			vector[threadIdx.x] =  vector[threadIdx.x + WARP_SIZE / 4];
			index[threadIdx.x] = index[threadIdx.x + WARP_SIZE / 4];
		}
	__syncwarp();	
	if(threadIdx.x < WARP_SIZE / 8)
		if(vector[threadIdx.x] > vector[threadIdx.x + WARP_SIZE / 8])
		{
			vector[threadIdx.x] =  vector[threadIdx.x + WARP_SIZE / 8];
			index[threadIdx.x] = index[threadIdx.x + WARP_SIZE / 8];
		}
	__syncwarp();		
	if(threadIdx.x < WARP_SIZE / 16)
		if(vector[threadIdx.x] > vector[threadIdx.x + WARP_SIZE / 16])
		{
			vector[threadIdx.x] =  vector[threadIdx.x + WARP_SIZE / 16];
			index[threadIdx.x] = index[threadIdx.x + WARP_SIZE / 16];
		}
	__syncwarp();	
	if(threadIdx.x < WARP_SIZE / 32)
		if(vector[threadIdx.x] > vector[threadIdx.x + WARP_SIZE / 32])
		{
			vector[threadIdx.x] =  vector[threadIdx.x + WARP_SIZE / 32];
			index[threadIdx.x] = index[threadIdx.x + WARP_SIZE / 32];
		}

when delete __syncwarp(), the results are incorrect in p100 (CUDA 9.0, but results are correct in k80 and 780ti (CUDA 8.0). In my opinion, __syncwarp() is unnecessary for p100, so why this happens? Thank you for your help

You are showing only a snippet, not compileable code. Presumably your code implements a reduction using shared memory storage. If so, the code has inter-thread data dependencies on that shared memory data. These dependencies are implicit, so the compiler has no knowledge of them and in the absence of the __syncwarp() calls may re-arrange the non-dependent memory accesses as it sees fit. This may or may not result in machine code that behaves as intended.

The classical way to implement the reduction correctly, compliant with the CUDA execution model, is to use a __syncthreads() call after each stage of the reduction. The classical, faster but hacky, way to implement the reduction is to declare the shared memory storage volatile which indirectly forces the compiler to arrange the loads and stores in code order.

Just to amplify on the point raised by njuffa:

You may be focused on synchronized execution. But the issue may actually be the effect that the function call (__syncwarp() or __syncthreads()) has as a memory barrier. The compiler is free to optimize shared memory locations into registers. This will break the situations where inter-thread communication occurs in shared memory. The memory barrier effect of e.g. __syncthreads() can prevent this “breakage”.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions

“Executing __syncwarp() guarantees memory ordering among threads participating in the barrier. Thus, threads within a warp that wish to communicate via memory can store to memory, execute __syncwarp(), and then safely read values stored by other threads in the warp.”

This could possibly explain the p100 behavior.

This optimization of shared-into-register is an optional item for the compiler. It may choose to do it or not do it. This could explain the behavior on CUDA 8 on k80.

Thank you for your answer, I agree your opinion, my question is that is __syncwarp() works for GPUs other than Volta, say P100 or K80 if CUDA 9 is employed? Thank you

Where does it say that in the programming guide? I think that is a false claim. I’ve already given you a link to the programming guide section which defines syncwarp, and it explicitly discusses the expected behavior for syncwarp on architectures other than sm_70.

Starting with the Volta architecture, Independent Thread Scheduling allows a warp to remain diverged outside of the data-dependent conditional block. An explicit __syncwarp() can be used to guarantee that the warp has reconverged for subsequent instructions.

https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html
please search for __syncwarp for that sentence

Q1 BTW, could you please kindly point out the difference between synchronized execution and memory barrier?
Q2 I know memory barrier is necessary if I use shared memory to do reduction operations, but if the array size is less than 32, to be specific, one warp can hanlde the array, is memory barrier still necessary. because I think the threads in a warp is synchronous.

Thank you

Thank you for your answering, I know memory barrier is necessary if I use shared memory to do reduction operations if the length of shared memory is large than warp size and need to br processed by several warps, but if the array size is less than 32, to be specific, one warp can hanlde the array, is memory barrier still necessary. because I think the threads in a warp is synchronous.

It’s true that Independent Thread Scheduling is a Volta feature. The next sentence says a __syncwarp() may also be important in this context. That sentence does not say __syncwarp() is a volta-only feature. The first sentence and the second sentance are separate ideas. I’ve already given you a doc reference that describes __syncwarp() behavior on architectures prior to volta.

I’ve already given you a documentation link that describes both execution barriers and memory barriers. An execution barrier synchronizes execution. For example it guarantees that warps have reconverged. A memory barrier causes the results of thread activity to be visible in memory to other threads. Without a memory barrier, even though your code appears to be updating shared memory, it may only have results stored in registers local to that thread. The memory barrier forces those results to actually be written out to memory, where they will be visible to other threads.

Yes, a memory barrier may be necessary even in the case of synchronized execution.

Thanks so much. I get the difference between the execution and memory barrier. You are right, my issue should be due to the memory barrier. Let me explain it and please kindly point out whether it is correct.

If I want to sum the values in shared memory (length = 32), although a warp is executed synchronously, the may write to shared memory from register slightly different speed or time depending on the optimization of compiler. Therefore, the sum value is undefined. When using syncthread() or syncwarp(), the warp will stop until all the data successfully write to shared memory and then resume. This issue does not happen in k80 and 780 ti in CUDA8.0 does not mean that my code is correct.

Is it correct? Thank you for your help. it really helps

Yes, that is approximately correct. Without a memory barrier (or the use of volatile keyword) the compiler can choose when it writes data out to shared memory. Results may sit in a thread-local register until the instruction to write the data to shared memory actually occurs in the instruction stream.