Does __syncthreads not work across multiple warps?

I have a single block with 30x34, threads which communicate using shared memory protected by multiple __syncthreads() calls.

There appears to be some kind of race condition giving me different answers (sometimes correct) depending on the call (identical inputs).

Basically it is just doing a complex matrix multiplication followed by division by the mean(abs) in a loop 10 times.

It appears that some threads are getting ahead of others in a random fashion, despite the __syncthreads() barriers.

None of __syncthreads() calls are in the binary reduction conditionals.

Does a block have to be confined to one warp for __syncthreads to work?

No, __syncthreads() is a barrier for all threads in a block. The documentation has additional information:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions

Specifically for shared memory race conditions, the cuda-memcheck tool has some options to do race-condition checking. Use:

cuda-memcheck --help

to learn the command line switches to use it. (–tool racecheck)

__syncthreads() works across all warps in a thread block. Since there is no code for us to look at: two of the most frequent programming mistakes related to the use of __syncthreads() are:

(1) __syncthreads() is used in a non-uniform code branch, leading to undefined behavior that can look like a race condition
(2) There is an actual race condition because there is no __synchthreads() call guarding control flow along the backward branch of a loop.

cuda-memcheck has a race-checking tool that can help you find race conditions.

Thanks for your advice.

I enabled the memory checker and made “synchronize memory accesses” true.

It now gives the correct answer consistently. Does that mean the debugger was causing the race condition?

Or is there some other problem now hidden?

I guess the memory checking you are referring to is the one in nsight VSE. When you enable this (I think it’s essentially the same functionality as in the cuda-memcheck standalone tool) the actual execution order of threadblocks, (and perhaps even warps within threadblocks – not sure), can be modified by the tool. In a well designed CUDA program, this should not matter, as the results are supposed to be correct independent of order of execution of threadblocks (and to some degree, warps).

If you get correct results with memory checker enabled, I think it’s just an initial datapoint indicating an order-dependent race condition.

I would suggest trying the standalone cuda-memcheck tool, or else debug the race condition directly.

I don’t think debuggers should be able to “cause race conditions” in parallel code, unless the code already has the latent possibility of a race condition.

If your code gives correct results in nsight VSE with memory checker enabled, but incorrect results in other cases, it still has a problem.

Here is my kernel code for complex matrix * vector. I’ve used this binary reduction many times w/o this sort of problem. 30x34 threads, 1 block with 30x34 float2 shared memory.

Can anyone spot the race condition?

__global__ void 
kernel(float2 *pfcM, float2 *pfcrand)
{
	int tx = threadIdx.x, ty = threadIdx.y;
	int w = blockDim.x, h = blockDim.y;
	int tid = tx * h + ty;

	float2 fcM = pfcM[tid];
	// init rand in, pfcM 1st 12 is out
	float2 fcrand = pfcrand[tx];
	__syncthreads();

	int sid = tx * h + ty;
	extern __shared__ float2 fcshared[];																//12 x 12 

	fcshared[sid].x = fcrand.x * fcM.x - fcrand.y * fcM.y;
	fcshared[sid].y = fcrand.x * fcM.y + fcrand.y * fcM.x;
	__syncthreads();

	//compute binary reduction for block rows = sum(Mrow x Vcol (really tx))
	unsigned int sidhalf = w >> 1;
	unsigned int nodd = (w & 0x00000001);
	unsigned int ncompare = sidhalf;
	sidhalf += nodd;

	int nsidoffs = sid + sidhalf * h;
	float2 fc, fcp, fz = {0.0,0.0};

	do
	{
		fc = fcshared[sid];
		fcp = tx < ncompare ? fcshared[nsidoffs] : fz;
		fc.x += fcp.x;
		fc.y += fcp.y;

		// write back sum of sid & sid + binary offs
		fcshared[sid] = fc;

		nodd = (sidhalf & 0x00000001);
		sidhalf = sidhalf > 1 ? nodd + (sidhalf >> 1) : 0;
		ncompare = sidhalf - nodd;
		nsidoffs = sid + sidhalf * h;
		__syncthreads();
	} while (sidhalf > 0);
	__syncthreads();

	// swap the 1st shared (v) column to rows of threads for subsequent M*v in loop
	fcrand = fcshared[tx];
	__syncthreads();

	if(ty == 0)
	{
		// write out the vector (product of matrix and vector)
		pfcM[tx] = fcrand;
	}
}

You are missing a __syncthreads() between the read in line 32 and the write in line 37 (write-after-read hazard).

Thanks for your advice. I will check that out, but I was under the impression that since only one thread reads and writes to that location (sid), there would be no conflict between different threads.

FWIW, I hope you’re right.

Unfortunately, that does not fix the problem.

Well, it turns out there was nothing wrong with the CUDA kernel.

The problem was that I believed the docs for rand() which provided one of the inputs.

Supposedly the sequence is the same when init with srand(const), but it clearly is not always.

Thanks for the help.