Hi,
I was developing a CUDA library(about CUDA tuning,maths,…) when I noticed a strange fact: it seems that in some cases the kernel execution terminates without waiting all its threads.
In the following code (for “performance test”), launched with 1024 threads in one block,
__global__ void testSynchSynchthreads()
{
__shared__ long vect[2048];
const int id = threadIdx.x;
const int idPrev = (blockDim.x + threadIdx.x - 32) % blockDim.x;
vect[id] = id * 58 + 5;
for(long i=10000000L;i>=0;i--)
{
vect[id] = vect[idPrev] + 1;
if (threadIdx.x<64) __syncthreads();
}
//asm volatile("bar.sync %0;" : : "r"(4) : "memory");
}
the execution terminates in less than 2 seconds, without waiting firsts two warps.
Enabling the last line(asm volatile…) the execution hangs as expected, due to first 2 warps.
Any explanation?
I’ve used the bar.sync in order to don’t share the barrier(I’ve also swapped syncthreads and sync.bar, with same behaviour).
@Greg :good point. In my case, I’ve putted two eventRecord around kernel launch, time measured was more than 1 second (so no kernel deleted by compiler)
@tera: nothing particular, just trying to know better “hidden behaviour” of CUDA
Devices from Fermi architecture onwards might be using an active warp count instead of the total warp count to prevent deadlocks of this kind with early exiting threads. IIRC compute capability 1.x devices would indeed deadlock on such code.
@tera: thank you very much for your reply, it makes sense. In first scenario (line 12 commented) is possible that warps 0 and 1 are executed after the termination of all other warps, so no stall. This scenario depends from warps’ schedulation.
When the line 12 is not commented, it is executed with warp count=(1024/32). Also line 9 always see 1024/32 active warps(no warps already terminated)