Strange behaviour of __syncthreads()

Marco_Ribero · January 26, 2017, 7:14pm

Hi,
I was developing a CUDA library(about CUDA tuning,maths,…) when I noticed a strange fact: it seems that in some cases the kernel execution terminates without waiting all its threads.
In the following code (for “performance test”), launched with 1024 threads in one block,

__global__ void testSynchSynchthreads()
{
	__shared__ long vect[2048];
	const int id = threadIdx.x;
	const int idPrev = (blockDim.x + threadIdx.x - 32) % blockDim.x;
	vect[id] = id * 58 + 5;
	for(long i=10000000L;i>=0;i--)
	{
		vect[id] = vect[idPrev] + 1;
		if (threadIdx.x<64) __syncthreads();
	}
	//asm volatile("bar.sync %0;" : : "r"(4) : "memory");	
}

the execution terminates in less than 2 seconds, without waiting firsts two warps.
Enabling the last line(asm volatile…) the execution hangs as expected, due to first 2 warps.
Any explanation?

I’ve used the bar.sync in order to don’t share the barrier(I’ve also swapped syncthreads and sync.bar, with same behaviour).

Thanks for your time,
Marco

tera · January 26, 2017, 7:41pm

Is there anything particular you are trying to achieve, other than hanging your kernel by relying on undocumented behavior?

Greg · January 27, 2017, 2:16am

The kernel has no side effect so the compiler may turn this into an empty kernel.

Marco_Ribero · January 27, 2017, 7:41pm

@Greg :good point. In my case, I’ve putted two eventRecord around kernel launch, time measured was more than 1 second (so no kernel deleted by compiler)

@tera: nothing particular, just trying to know better “hidden behaviour” of CUDA

tera · January 27, 2017, 8:19pm

Devices from Fermi architecture onwards might be using an active warp count instead of the total warp count to prevent deadlocks of this kind with early exiting threads. IIRC compute capability 1.x devices would indeed deadlock on such code.

Marco_Ribero · January 29, 2017, 6:27pm

@tera: thank you very much for your reply, it makes sense. In first scenario (line 12 commented) is possible that warps 0 and 1 are executed after the termination of all other warps, so no stall. This scenario depends from warps’ schedulation.

When the line 12 is not commented, it is executed with warp count=(1024/32). Also line 9 always see 1024/32 active warps(no warps already terminated)

Topic		Replies	Views
A stupid question on __syncthread() function CUDA Programming and Performance	5	4715	May 17, 2022
How can I be certain my Kernel runs with 32 threads in one block and thus perfect synchrony? (ie. via __syncthreads()) CUDA Programming and Performance	15	40	August 21, 2024
atomicCAS issue (possible deadlock) CUDA Programming and Performance	5	3229	October 26, 2011
Bug in emulation mode and __syncthreads()? Kernel stops abruptly CUDA Programming and Performance	2	2961	May 14, 2009
Semantics of __syncthreads CUDA Programming and Performance	18	17981	January 2, 2008
Strange __syncthreads behavior CUDA Programming and Performance	2	1041	January 21, 2014
Kernel Launch: number of blocks CUDA Programming and Performance	1	1690	May 21, 2009
Concurrent kernel execution on Fermi ? CUDA Programming and Performance	5	4483	October 5, 2010
Synchronization CUDA Programming and Performance	2	1403	April 21, 2010
shared memory performance kernel execution timings with one block CUDA Programming and Performance	3	3168	May 6, 2007

Strange behaviour of __syncthreads()

Related topics