Does __syncthreads not work across multiple warps?

robosmith · April 28, 2014, 7:42pm

I have a single block with 30x34, threads which communicate using shared memory protected by multiple __syncthreads() calls.

There appears to be some kind of race condition giving me different answers (sometimes correct) depending on the call (identical inputs).

Basically it is just doing a complex matrix multiplication followed by division by the mean(abs) in a loop 10 times.

It appears that some threads are getting ahead of others in a random fashion, despite the __syncthreads() barriers.

None of __syncthreads() calls are in the binary reduction conditionals.

Does a block have to be confined to one warp for __syncthreads to work?

Robert_Crovella · April 28, 2014, 7:49pm

No, __syncthreads() is a barrier for all threads in a block. The documentation has additional information:

[url]Programming Guide :: CUDA Toolkit Documentation

Specifically for shared memory race conditions, the cuda-memcheck tool has some options to do race-condition checking. Use:

cuda-memcheck --help

to learn the command line switches to use it. (–tool racecheck)

njuffa · April 28, 2014, 7:49pm

__syncthreads() works across all warps in a thread block. Since there is no code for us to look at: two of the most frequent programming mistakes related to the use of __syncthreads() are:

(1) __syncthreads() is used in a non-uniform code branch, leading to undefined behavior that can look like a race condition
(2) There is an actual race condition because there is no __synchthreads() call guarding control flow along the backward branch of a loop.

cuda-memcheck has a race-checking tool that can help you find race conditions.

robosmith · April 28, 2014, 8:03pm

Thanks for your advice.

I enabled the memory checker and made “synchronize memory accesses” true.

It now gives the correct answer consistently. Does that mean the debugger was causing the race condition?

Or is there some other problem now hidden?

Robert_Crovella · April 28, 2014, 8:24pm

I guess the memory checking you are referring to is the one in nsight VSE. When you enable this (I think it’s essentially the same functionality as in the cuda-memcheck standalone tool) the actual execution order of threadblocks, (and perhaps even warps within threadblocks – not sure), can be modified by the tool. In a well designed CUDA program, this should not matter, as the results are supposed to be correct independent of order of execution of threadblocks (and to some degree, warps).

If you get correct results with memory checker enabled, I think it’s just an initial datapoint indicating an order-dependent race condition.

I would suggest trying the standalone cuda-memcheck tool, or else debug the race condition directly.

I don’t think debuggers should be able to “cause race conditions” in parallel code, unless the code already has the latent possibility of a race condition.

If your code gives correct results in nsight VSE with memory checker enabled, but incorrect results in other cases, it still has a problem.

robosmith · April 28, 2014, 11:36pm

Here is my kernel code for complex matrix * vector. I’ve used this binary reduction many times w/o this sort of problem. 30x34 threads, 1 block with 30x34 float2 shared memory.

Can anyone spot the race condition?

__global__ void 
kernel(float2 *pfcM, float2 *pfcrand)
{
	int tx = threadIdx.x, ty = threadIdx.y;
	int w = blockDim.x, h = blockDim.y;
	int tid = tx * h + ty;

	float2 fcM = pfcM[tid];
	// init rand in, pfcM 1st 12 is out
	float2 fcrand = pfcrand[tx];
	__syncthreads();

	int sid = tx * h + ty;
	extern __shared__ float2 fcshared[];																//12 x 12 

	fcshared[sid].x = fcrand.x * fcM.x - fcrand.y * fcM.y;
	fcshared[sid].y = fcrand.x * fcM.y + fcrand.y * fcM.x;
	__syncthreads();

	//compute binary reduction for block rows = sum(Mrow x Vcol (really tx))
	unsigned int sidhalf = w >> 1;
	unsigned int nodd = (w & 0x00000001);
	unsigned int ncompare = sidhalf;
	sidhalf += nodd;

	int nsidoffs = sid + sidhalf * h;
	float2 fc, fcp, fz = {0.0,0.0};

	do
	{
		fc = fcshared[sid];
		fcp = tx < ncompare ? fcshared[nsidoffs] : fz;
		fc.x += fcp.x;
		fc.y += fcp.y;

		// write back sum of sid & sid + binary offs
		fcshared[sid] = fc;

		nodd = (sidhalf & 0x00000001);
		sidhalf = sidhalf > 1 ? nodd + (sidhalf >> 1) : 0;
		ncompare = sidhalf - nodd;
		nsidoffs = sid + sidhalf * h;
		__syncthreads();
	} while (sidhalf > 0);
	__syncthreads();

	// swap the 1st shared (v) column to rows of threads for subsequent M*v in loop
	fcrand = fcshared[tx];
	__syncthreads();

	if(ty == 0)
	{
		// write out the vector (product of matrix and vector)
		pfcM[tx] = fcrand;
	}
}

tera · April 28, 2014, 11:47pm

You are missing a __syncthreads() between the read in line 32 and the write in line 37 (write-after-read hazard).

robosmith · April 29, 2014, 5:20am

Thanks for your advice. I will check that out, but I was under the impression that since only one thread reads and writes to that location (sid), there would be no conflict between different threads.

FWIW, I hope you’re right.

robosmith · April 29, 2014, 3:19pm

Unfortunately, that does not fix the problem.

robosmith · April 30, 2014, 4:15pm

Well, it turns out there was nothing wrong with the CUDA kernel.

The problem was that I believed the docs for rand() which provided one of the inputs.

Supposedly the sequence is the same when init with srand(const), but it clearly is not always.

Thanks for the help.

Topic		Replies	Views
__syncthreads thread syncronization CUDA Programming and Performance	7	18618	October 27, 2009
Semantics of __syncthreads CUDA Programming and Performance	18	18022	January 2, 2008
Problems with __syncthreads() CUDA Programming and Performance	2	896	May 4, 2013
How do some threads in a block get past __syncthreads, while others are still before it? CUDA Programming and Performance	9	1704	July 23, 2015
why I do not have a problem with __syncthreads ? CUDA Programming and Performance	10	7167	May 26, 2010
Problem with __syncthreads() It does not work for threads > 64 CUDA Programming and Performance	4	896	April 11, 2012
apparent shared memory race condition despite using syncthreads CUDA Programming and Performance	5	14386	July 15, 2010
The result is unpredictable. CUDA Programming and Performance	6	1093	October 25, 2013
A stupid question on __syncthread() function CUDA Programming and Performance	5	5400	May 17, 2022
Cuda: threads over 2 warps not synchronising correctly Legacy PGI Compilers	5	6908	May 26, 2011

Does __syncthreads not work across multiple warps?

Related topics