Strange behaviour of __syncthreads()

Hi,
I was developing a CUDA library(about CUDA tuning,maths,…) when I noticed a strange fact: it seems that in some cases the kernel execution terminates without waiting all its threads.
In the following code (for “performance test”), launched with 1024 threads in one block,

__global__ void testSynchSynchthreads()
{
	__shared__ long vect[2048];
	const int id = threadIdx.x;
	const int idPrev = (blockDim.x + threadIdx.x - 32) % blockDim.x;
	vect[id] = id * 58 + 5;
	for(long i=10000000L;i>=0;i--)
	{
		vect[id] = vect[idPrev] + 1;
		if (threadIdx.x<64) __syncthreads();
	}
	//asm volatile("bar.sync %0;" : : "r"(4) : "memory");	
}

the execution terminates in less than 2 seconds, without waiting firsts two warps.
Enabling the last line(asm volatile…) the execution hangs as expected, due to first 2 warps.
Any explanation?

I’ve used the bar.sync in order to don’t share the barrier(I’ve also swapped syncthreads and sync.bar, with same behaviour).

Thanks for your time,
Marco

Is there anything particular you are trying to achieve, other than hanging your kernel by relying on undocumented behavior?

The kernel has no side effect so the compiler may turn this into an empty kernel.

@Greg :good point. In my case, I’ve putted two eventRecord around kernel launch, time measured was more than 1 second (so no kernel deleted by compiler)

@tera: nothing particular, just trying to know better “hidden behaviour” of CUDA

Devices from Fermi architecture onwards might be using an active warp count instead of the total warp count to prevent deadlocks of this kind with early exiting threads. IIRC compute capability 1.x devices would indeed deadlock on such code.

@tera: thank you very much for your reply, it makes sense. In first scenario (line 12 commented) is possible that warps 0 and 1 are executed after the termination of all other warps, so no stall. This scenario depends from warps’ schedulation.

When the line 12 is not commented, it is executed with warp count=(1024/32). Also line 9 always see 1024/32 active warps(no warps already terminated)