Syncthreads and Stalling Kernels

It is my understanding that it is a requirement that all threads in a block must hit the same number of syncthreads. What is the expected behavior if this is not the case?

I stumbled upon this issue by accident. In a kernel where I do not follow this rule, it works fine on a 280 GTX, but on an old 8800 GTX the application hangs on trying to read memory back to the host after running this kernel (A) and never appears to recover, I must kill the application. The single modification in code B allows the code to run fine. I had someone try this on an 8800 GS and both ran fine for them. I’m not sure if this is a driver issue, or an architecture issue.

A.

if(tid<length)

	{

	   for (unsigned int s=1; s < blockDim.x; s *= 2) 

	  { 

		  int index = 2 * s * threadIdx.x;

		  if (index < blockDim.x && index+s <= length) 

			sdata[index] += sdata[index+s];	

			 __syncthreads();

	  } 

  }

B.

for (unsigned int s=1; s < blockDim.x; s *= 2) 

  { 

	if(tid<length)

	{

	  int index = 2 * s * threadIdx.x;

	  if (index < blockDim.x && index+s <= length) 

		sdata[index] += sdata[index+s];

	}	

	__syncthreads();

  }

Any ideas?

Not only the same number of __syncthreads, it is best to ensure that all threads execute the all same __syncthreads(). If this is not done, just assume that your kernel is going to deadlock and never return - it should do so on all hardware generations, I’m not sure why you get different behavior on different cards.

Not only the same number of __syncthreads, it is best to ensure that all threads execute the all same __syncthreads(). If this is not done, just assume that your kernel is going to deadlock and never return - it should do so on all hardware generations, I’m not sure why you get different behavior on different cards.

AFAIK the exact behaviour is undefined and you cant build on your kernel going on. This is not really an issue but I know that on my GTX 285 this was unproblematic, too. Just try to avoid it for sakes of portability.

AFAIK the exact behaviour is undefined and you cant build on your kernel going on. This is not really an issue but I know that on my GTX 285 this was unproblematic, too. Just try to avoid it for sakes of portability.

Makes sense, still curious about what the nuts and bolts explanation of why a kernel deadlocks over this, and what changed in the newer cards that avoids the issue apparently?

Also, it would be nice if the compiler had some way of noticing syncthreads that will not be reached by all threads and warning or failing at compile time. Doesn’t sound like a horrible challenge for a compiler writer, but what do I know?

Makes sense, still curious about what the nuts and bolts explanation of why a kernel deadlocks over this, and what changed in the newer cards that avoids the issue apparently?

Also, it would be nice if the compiler had some way of noticing syncthreads that will not be reached by all threads and warning or failing at compile time. Doesn’t sound like a horrible challenge for a compiler writer, but what do I know?

Uh, that’s the halting problem.

Uh, that’s the halting problem.

So add it to your To Do list. That way, whenever someone comes along with a task you wish to avoid, you can legitimately say that you’re busy with another project which will keep you occupied until the end of eternity :D

So add it to your To Do list. That way, whenever someone comes along with a task you wish to avoid, you can legitimately say that you’re busy with another project which will keep you occupied until the end of eternity :D

Yes I’ve been noticing the same thing when branching at warp level…

This example code runs without any problem:

#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

__global__ void test_kernel(float* in)

{

	__shared__ float smem[128];

	float val = 1.0f;

	smem[threadIdx.x] = in[threadIdx.x];

	__syncthreads();

	if(threadIdx.x < 64)

	{

		smem[threadIdx.x] += smem[threadIdx.x + 64];

		

		__syncthreads();

		if( threadIdx.x < 32)

		{

			smem[threadIdx.x] += smem[threadIdx.x + 32]; 

			smem[threadIdx.x] += smem[threadIdx.x + 16]; 

			smem[threadIdx.x] += smem[threadIdx.x + 8]; 

			smem[threadIdx.x] += smem[threadIdx.x + 4]; 

			smem[threadIdx.x] += smem[threadIdx.x + 2]; 

			smem[threadIdx.x] += smem[threadIdx.x + 1]; 

		}

		

		__syncthreads();

		

		// just some manipulation to make syncthreads() matter

		val = smem[0]*(threadIdx.x+1);

	}

	

	__syncthreads();

	// just some manipulation to make syncthreads() matter

	val *= smem[0]*(threadIdx.x+1);

	in[threadIdx.x] = val; // Result for in[0] is arraySum^2

}

int main(int argc, char* argv[])

{

	int n = 128;

	int size = n*sizeof(float);

	float* d_in;

	float* in = (float*)malloc(size);

	float sum = 0.0f;

	for(int i = 0; i < n; i++)

	{

		in[i] = i;

		sum += i;

	}

	

	cudaMalloc((void**)&d_in, size);

	cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);

	test_kernel<<< 1,n>>>(d_in);

	cudaMemcpy(in, d_in, size, cudaMemcpyDeviceToHost);

	printf("\n GPU value: %0.3f\n",in[0]);

	printf("\n CPU value: %0.3f\n",sum*sum);

	printf("\n %s\n",cudaGetErrorString(cudaGetLastError()));

	return 0;

}

It’s basically

__syncthreads();

if(tid < 128)

{

…code…

sync

…code…

sync

}

sync

…code…

It’s a one block reduction on 128 values that produces the arraySum^2 … I guess the behaviour for this code is unpredictable on different machines?

Yes I’ve been noticing the same thing when branching at warp level…

This example code runs without any problem:

#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

__global__ void test_kernel(float* in)

{

	__shared__ float smem[128];

	float val = 1.0f;

	smem[threadIdx.x] = in[threadIdx.x];

	__syncthreads();

	if(threadIdx.x < 64)

	{

		smem[threadIdx.x] += smem[threadIdx.x + 64];

		

		__syncthreads();

		if( threadIdx.x < 32)

		{

			smem[threadIdx.x] += smem[threadIdx.x + 32]; 

			smem[threadIdx.x] += smem[threadIdx.x + 16]; 

			smem[threadIdx.x] += smem[threadIdx.x + 8]; 

			smem[threadIdx.x] += smem[threadIdx.x + 4]; 

			smem[threadIdx.x] += smem[threadIdx.x + 2]; 

			smem[threadIdx.x] += smem[threadIdx.x + 1]; 

		}

		

		__syncthreads();

		

		// just some manipulation to make syncthreads() matter

		val = smem[0]*(threadIdx.x+1);

	}

	

	__syncthreads();

	// just some manipulation to make syncthreads() matter

	val *= smem[0]*(threadIdx.x+1);

	in[threadIdx.x] = val; // Result for in[0] is arraySum^2

}

int main(int argc, char* argv[])

{

	int n = 128;

	int size = n*sizeof(float);

	float* d_in;

	float* in = (float*)malloc(size);

	float sum = 0.0f;

	for(int i = 0; i < n; i++)

	{

		in[i] = i;

		sum += i;

	}

	

	cudaMalloc((void**)&d_in, size);

	cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);

	test_kernel<<< 1,n>>>(d_in);

	cudaMemcpy(in, d_in, size, cudaMemcpyDeviceToHost);

	printf("\n GPU value: %0.3f\n",in[0]);

	printf("\n CPU value: %0.3f\n",sum*sum);

	printf("\n %s\n",cudaGetErrorString(cudaGetLastError()));

	return 0;

}

It’s basically

__syncthreads();

if(tid < 128)

{

…code…

sync

…code…

sync

}

sync

…code…

It’s a one block reduction on 128 values that produces the arraySum^2 … I guess the behaviour for this code is unpredictable on different machines?

Yes, I’d suppose that would fail, at least on my magical GeForce 8800 GTX card. I’ll try it sometime and see if it does for sure.

Tmurray, I suppose you are right. Silly me, forgotten last semesters compiler class already. Go work on it anyway. :)

Yes, I’d suppose that would fail, at least on my magical GeForce 8800 GTX card. I’ll try it sometime and see if it does for sure.

Tmurray, I suppose you are right. Silly me, forgotten last semesters compiler class already. Go work on it anyway. :)

Oh, and also about this… There are utilities out there (http://www.cse.iitk.ac.in/users/lca-gpgpu-I/li.pdf) that recognize this synchronization barrier issue, and runs pretty quick. This was the only utility that I have found so far that finds this error in the code. Though, I think it may run the code to find this, and would possibly stall when analyzing it on the suspect machine. Hm…

Oh, and also about this… There are utilities out there (http://www.cse.iitk.ac.in/users/lca-gpgpu-I/li.pdf) that recognize this synchronization barrier issue, and runs pretty quick. This was the only utility that I have found so far that finds this error in the code. Though, I think it may run the code to find this, and would possibly stall when analyzing it on the suspect machine. Hm…