Syncthreads and Stalling Kernels

dmyablonski · August 25, 2010, 4:21pm

It is my understanding that it is a requirement that all threads in a block must hit the same number of syncthreads. What is the expected behavior if this is not the case?

I stumbled upon this issue by accident. In a kernel where I do not follow this rule, it works fine on a 280 GTX, but on an old 8800 GTX the application hangs on trying to read memory back to the host after running this kernel (A) and never appears to recover, I must kill the application. The single modification in code B allows the code to run fine. I had someone try this on an 8800 GS and both ran fine for them. I’m not sure if this is a driver issue, or an architecture issue.

A.

if(tid<length)

	{

	   for (unsigned int s=1; s < blockDim.x; s *= 2) 

	  { 

		  int index = 2 * s * threadIdx.x;

		  if (index < blockDim.x && index+s <= length) 

			sdata[index] += sdata[index+s];	

			 __syncthreads();

	  } 

  }

B.

for (unsigned int s=1; s < blockDim.x; s *= 2) 

  { 

	if(tid<length)

	{

	  int index = 2 * s * threadIdx.x;

	  if (index < blockDim.x && index+s <= length) 

		sdata[index] += sdata[index+s];

	}	

	__syncthreads();

  }

Any ideas?

MisterAnderson42 · August 25, 2010, 4:45pm

Not only the same number of __syncthreads, it is best to ensure that all threads execute the all same __syncthreads(). If this is not done, just assume that your kernel is going to deadlock and never return - it should do so on all hardware generations, I’m not sure why you get different behavior on different cards.

MisterAnderson42 · August 25, 2010, 4:45pm

Not only the same number of __syncthreads, it is best to ensure that all threads execute the all same __syncthreads(). If this is not done, just assume that your kernel is going to deadlock and never return - it should do so on all hardware generations, I’m not sure why you get different behavior on different cards.

ONeill · August 25, 2010, 4:47pm

AFAIK the exact behaviour is undefined and you cant build on your kernel going on. This is not really an issue but I know that on my GTX 285 this was unproblematic, too. Just try to avoid it for sakes of portability.

ONeill · August 25, 2010, 4:47pm

AFAIK the exact behaviour is undefined and you cant build on your kernel going on. This is not really an issue but I know that on my GTX 285 this was unproblematic, too. Just try to avoid it for sakes of portability.

dmyablonski · August 25, 2010, 4:54pm

Makes sense, still curious about what the nuts and bolts explanation of why a kernel deadlocks over this, and what changed in the newer cards that avoids the issue apparently?

Also, it would be nice if the compiler had some way of noticing syncthreads that will not be reached by all threads and warning or failing at compile time. Doesn’t sound like a horrible challenge for a compiler writer, but what do I know?

dmyablonski · August 25, 2010, 4:54pm

Makes sense, still curious about what the nuts and bolts explanation of why a kernel deadlocks over this, and what changed in the newer cards that avoids the issue apparently?

Also, it would be nice if the compiler had some way of noticing syncthreads that will not be reached by all threads and warning or failing at compile time. Doesn’t sound like a horrible challenge for a compiler writer, but what do I know?

tmurray · August 25, 2010, 5:00pm

Uh, that’s the halting problem.

tmurray · August 25, 2010, 5:00pm

Uh, that’s the halting problem.

YDD · August 25, 2010, 5:07pm

So add it to your To Do list. That way, whenever someone comes along with a task you wish to avoid, you can legitimately say that you’re busy with another project which will keep you occupied until the end of eternity :D

YDD · August 25, 2010, 5:07pm

So add it to your To Do list. That way, whenever someone comes along with a task you wish to avoid, you can legitimately say that you’re busy with another project which will keep you occupied until the end of eternity :D

Jimmy_Pettersson · August 25, 2010, 7:53pm

Yes I’ve been noticing the same thing when branching at warp level…

This example code runs without any problem:

#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

__global__ void test_kernel(float* in)

{

	__shared__ float smem[128];

	float val = 1.0f;

	smem[threadIdx.x] = in[threadIdx.x];

	__syncthreads();

	if(threadIdx.x < 64)

	{

		smem[threadIdx.x] += smem[threadIdx.x + 64];

		

		__syncthreads();

		if( threadIdx.x < 32)

		{

			smem[threadIdx.x] += smem[threadIdx.x + 32]; 

			smem[threadIdx.x] += smem[threadIdx.x + 16]; 

			smem[threadIdx.x] += smem[threadIdx.x + 8]; 

			smem[threadIdx.x] += smem[threadIdx.x + 4]; 

			smem[threadIdx.x] += smem[threadIdx.x + 2]; 

			smem[threadIdx.x] += smem[threadIdx.x + 1]; 

		}

		

		__syncthreads();

		

		// just some manipulation to make syncthreads() matter

		val = smem[0]*(threadIdx.x+1);

	}

	

	__syncthreads();

	// just some manipulation to make syncthreads() matter

	val *= smem[0]*(threadIdx.x+1);

	in[threadIdx.x] = val; // Result for in[0] is arraySum^2

}

int main(int argc, char* argv[])

{

	int n = 128;

	int size = n*sizeof(float);

	float* d_in;

	float* in = (float*)malloc(size);

	float sum = 0.0f;

	for(int i = 0; i < n; i++)

	{

		in[i] = i;

		sum += i;

	}

	

	cudaMalloc((void**)&d_in, size);

	cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);

	test_kernel<<< 1,n>>>(d_in);

	cudaMemcpy(in, d_in, size, cudaMemcpyDeviceToHost);

	printf("\n GPU value: %0.3f\n",in[0]);

	printf("\n CPU value: %0.3f\n",sum*sum);

	printf("\n %s\n",cudaGetErrorString(cudaGetLastError()));

	return 0;

}

It’s basically

__syncthreads();

if(tid < 128)

{

…code…

sync

…code…

sync

}

sync

…code…

It’s a one block reduction on 128 values that produces the arraySum^2 … I guess the behaviour for this code is unpredictable on different machines?

Jimmy_Pettersson · August 25, 2010, 7:53pm

Yes I’ve been noticing the same thing when branching at warp level…

This example code runs without any problem:

#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

__global__ void test_kernel(float* in)

{

	__shared__ float smem[128];

	float val = 1.0f;

	smem[threadIdx.x] = in[threadIdx.x];

	__syncthreads();

	if(threadIdx.x < 64)

	{

		smem[threadIdx.x] += smem[threadIdx.x + 64];

		

		__syncthreads();

		if( threadIdx.x < 32)

		{

			smem[threadIdx.x] += smem[threadIdx.x + 32]; 

			smem[threadIdx.x] += smem[threadIdx.x + 16]; 

			smem[threadIdx.x] += smem[threadIdx.x + 8]; 

			smem[threadIdx.x] += smem[threadIdx.x + 4]; 

			smem[threadIdx.x] += smem[threadIdx.x + 2]; 

			smem[threadIdx.x] += smem[threadIdx.x + 1]; 

		}

		

		__syncthreads();

		

		// just some manipulation to make syncthreads() matter

		val = smem[0]*(threadIdx.x+1);

	}

	

	__syncthreads();

	// just some manipulation to make syncthreads() matter

	val *= smem[0]*(threadIdx.x+1);

	in[threadIdx.x] = val; // Result for in[0] is arraySum^2

}

int main(int argc, char* argv[])

{

	int n = 128;

	int size = n*sizeof(float);

	float* d_in;

	float* in = (float*)malloc(size);

	float sum = 0.0f;

	for(int i = 0; i < n; i++)

	{

		in[i] = i;

		sum += i;

	}

	

	cudaMalloc((void**)&d_in, size);

	cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);

	test_kernel<<< 1,n>>>(d_in);

	cudaMemcpy(in, d_in, size, cudaMemcpyDeviceToHost);

	printf("\n GPU value: %0.3f\n",in[0]);

	printf("\n CPU value: %0.3f\n",sum*sum);

	printf("\n %s\n",cudaGetErrorString(cudaGetLastError()));

	return 0;

}

It’s basically

__syncthreads();

if(tid < 128)

{

…code…

sync

…code…

sync

}

sync

…code…

It’s a one block reduction on 128 values that produces the arraySum^2 … I guess the behaviour for this code is unpredictable on different machines?

dmyablonski · August 26, 2010, 1:01am

Yes I’ve been noticing the same thing when branching at warp level…

This example code runs without any problem:

#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

__global__ void test_kernel(float* in)

{

	__shared__ float smem[128];

	float val = 1.0f;

	smem[threadIdx.x] = in[threadIdx.x];

	__syncthreads();

	if(threadIdx.x < 64)

	{

		smem[threadIdx.x] += smem[threadIdx.x + 64];

		

		__syncthreads();

		if( threadIdx.x < 32)

		{

			smem[threadIdx.x] += smem[threadIdx.x + 32]; 

			smem[threadIdx.x] += smem[threadIdx.x + 16]; 

			smem[threadIdx.x] += smem[threadIdx.x + 8]; 

			smem[threadIdx.x] += smem[threadIdx.x + 4]; 

			smem[threadIdx.x] += smem[threadIdx.x + 2]; 

			smem[threadIdx.x] += smem[threadIdx.x + 1]; 

		}

		

		__syncthreads();

		

		// just some manipulation to make syncthreads() matter

		val = smem[0]*(threadIdx.x+1);

	}

	

	__syncthreads();

	// just some manipulation to make syncthreads() matter

	val *= smem[0]*(threadIdx.x+1);

	in[threadIdx.x] = val; // Result for in[0] is arraySum^2

}

int main(int argc, char* argv[])

{

	int n = 128;

	int size = n*sizeof(float);

	float* d_in;

	float* in = (float*)malloc(size);

	float sum = 0.0f;

	for(int i = 0; i < n; i++)

	{

		in[i] = i;

		sum += i;

	}

	

	cudaMalloc((void**)&d_in, size);

	cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);

	test_kernel<<< 1,n>>>(d_in);

	cudaMemcpy(in, d_in, size, cudaMemcpyDeviceToHost);

	printf("\n GPU value: %0.3f\n",in[0]);

	printf("\n CPU value: %0.3f\n",sum*sum);

	printf("\n %s\n",cudaGetErrorString(cudaGetLastError()));

	return 0;

}

It’s basically

__syncthreads();

if(tid < 128)

{

…code…

sync

…code…

sync

}

sync

…code…

It’s a one block reduction on 128 values that produces the arraySum^2 … I guess the behaviour for this code is unpredictable on different machines?

Yes, I’d suppose that would fail, at least on my magical GeForce 8800 GTX card. I’ll try it sometime and see if it does for sure.

Tmurray, I suppose you are right. Silly me, forgotten last semesters compiler class already. Go work on it anyway. :)

dmyablonski · August 26, 2010, 1:01am

Yes I’ve been noticing the same thing when branching at warp level…

This example code runs without any problem:

#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

__global__ void test_kernel(float* in)

{

	__shared__ float smem[128];

	float val = 1.0f;

	smem[threadIdx.x] = in[threadIdx.x];

	__syncthreads();

	if(threadIdx.x < 64)

	{

		smem[threadIdx.x] += smem[threadIdx.x + 64];

		

		__syncthreads();

		if( threadIdx.x < 32)

		{

			smem[threadIdx.x] += smem[threadIdx.x + 32]; 

			smem[threadIdx.x] += smem[threadIdx.x + 16]; 

			smem[threadIdx.x] += smem[threadIdx.x + 8]; 

			smem[threadIdx.x] += smem[threadIdx.x + 4]; 

			smem[threadIdx.x] += smem[threadIdx.x + 2]; 

			smem[threadIdx.x] += smem[threadIdx.x + 1]; 

		}

		

		__syncthreads();

		

		// just some manipulation to make syncthreads() matter

		val = smem[0]*(threadIdx.x+1);

	}

	

	__syncthreads();

	// just some manipulation to make syncthreads() matter

	val *= smem[0]*(threadIdx.x+1);

	in[threadIdx.x] = val; // Result for in[0] is arraySum^2

}

int main(int argc, char* argv[])

{

	int n = 128;

	int size = n*sizeof(float);

	float* d_in;

	float* in = (float*)malloc(size);

	float sum = 0.0f;

	for(int i = 0; i < n; i++)

	{

		in[i] = i;

		sum += i;

	}

	

	cudaMalloc((void**)&d_in, size);

	cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);

	test_kernel<<< 1,n>>>(d_in);

	cudaMemcpy(in, d_in, size, cudaMemcpyDeviceToHost);

	printf("\n GPU value: %0.3f\n",in[0]);

	printf("\n CPU value: %0.3f\n",sum*sum);

	printf("\n %s\n",cudaGetErrorString(cudaGetLastError()));

	return 0;

}

It’s basically

__syncthreads();

if(tid < 128)

{

…code…

sync

…code…

sync

}

sync

…code…

It’s a one block reduction on 128 values that produces the arraySum^2 … I guess the behaviour for this code is unpredictable on different machines?

Yes, I’d suppose that would fail, at least on my magical GeForce 8800 GTX card. I’ll try it sometime and see if it does for sure.

Tmurray, I suppose you are right. Silly me, forgotten last semesters compiler class already. Go work on it anyway. :)

dmyablonski · August 26, 2010, 1:03am

Oh, and also about this… There are utilities out there (http://www.cse.iitk.ac.in/users/lca-gpgpu-I/li.pdf) that recognize this synchronization barrier issue, and runs pretty quick. This was the only utility that I have found so far that finds this error in the code. Though, I think it may run the code to find this, and would possibly stall when analyzing it on the suspect machine. Hm…

dmyablonski · August 26, 2010, 1:03am

Oh, and also about this… There are utilities out there (http://www.cse.iitk.ac.in/users/lca-gpgpu-I/li.pdf) that recognize this synchronization barrier issue, and runs pretty quick. This was the only utility that I have found so far that finds this error in the code. Though, I think it may run the code to find this, and would possibly stall when analyzing it on the suspect machine. Hm…

Topic		Replies	Views
Problems with __syncthreads() CUDA Programming and Performance	2	936	May 4, 2013
syncthreads error? CUDA Programming and Performance	16	33366	June 2, 2008
cuda syncthreads fail CUDA Programming and Performance	7	3873	February 22, 2013
__snycthreads example from the "CUDA by example" book dot.cu code strange behaviour, not as CUDA Programming and Performance	2	905	November 28, 2011
Bug report: Threads out of sync, branched syncthreads problem CUDA Programming and Performance	2	1697	November 30, 2009
Semantics of __syncthreads CUDA Programming and Performance	18	18219	January 2, 2008
How is this kernel locking with __syncthreads()? CUDA Programming and Performance	2	600	April 23, 2018
Problem with __syncthreads() It does not work for threads > 64 CUDA Programming and Performance	4	960	April 11, 2012
__syncthreads screwes calculation CUDA Programming and Performance	2	3418	November 22, 2007
__syncthreads() problem __syncthreads() results in infinite loop CUDA Programming and Performance	5	2267	August 27, 2008

Syncthreads and Stalling Kernels

Related topics