__syncthreads() works across all warps in a thread block. Since there is no code for us to look at: two of the most frequent programming mistakes related to the use of __syncthreads() are:
(1) __syncthreads() is used in a non-uniform code branch, leading to undefined behavior that can look like a race condition
(2) There is an actual race condition because there is no __synchthreads() call guarding control flow along the backward branch of a loop.
cuda-memcheck has a race-checking tool that can help you find race conditions.
I guess the memory checking you are referring to is the one in nsight VSE. When you enable this (I think it’s essentially the same functionality as in the cuda-memcheck standalone tool) the actual execution order of threadblocks, (and perhaps even warps within threadblocks – not sure), can be modified by the tool. In a well designed CUDA program, this should not matter, as the results are supposed to be correct independent of order of execution of threadblocks (and to some degree, warps).
If you get correct results with memory checker enabled, I think it’s just an initial datapoint indicating an order-dependent race condition.
I would suggest trying the standalone cuda-memcheck tool, or else debug the race condition directly.
I don’t think debuggers should be able to “cause race conditions” in parallel code, unless the code already has the latent possibility of a race condition.
If your code gives correct results in nsight VSE with memory checker enabled, but incorrect results in other cases, it still has a problem.
Here is my kernel code for complex matrix * vector. I’ve used this binary reduction many times w/o this sort of problem. 30x34 threads, 1 block with 30x34 float2 shared memory.
Can anyone spot the race condition?
__global__ void
kernel(float2 *pfcM, float2 *pfcrand)
{
int tx = threadIdx.x, ty = threadIdx.y;
int w = blockDim.x, h = blockDim.y;
int tid = tx * h + ty;
float2 fcM = pfcM[tid];
// init rand in, pfcM 1st 12 is out
float2 fcrand = pfcrand[tx];
__syncthreads();
int sid = tx * h + ty;
extern __shared__ float2 fcshared[]; //12 x 12
fcshared[sid].x = fcrand.x * fcM.x - fcrand.y * fcM.y;
fcshared[sid].y = fcrand.x * fcM.y + fcrand.y * fcM.x;
__syncthreads();
//compute binary reduction for block rows = sum(Mrow x Vcol (really tx))
unsigned int sidhalf = w >> 1;
unsigned int nodd = (w & 0x00000001);
unsigned int ncompare = sidhalf;
sidhalf += nodd;
int nsidoffs = sid + sidhalf * h;
float2 fc, fcp, fz = {0.0,0.0};
do
{
fc = fcshared[sid];
fcp = tx < ncompare ? fcshared[nsidoffs] : fz;
fc.x += fcp.x;
fc.y += fcp.y;
// write back sum of sid & sid + binary offs
fcshared[sid] = fc;
nodd = (sidhalf & 0x00000001);
sidhalf = sidhalf > 1 ? nodd + (sidhalf >> 1) : 0;
ncompare = sidhalf - nodd;
nsidoffs = sid + sidhalf * h;
__syncthreads();
} while (sidhalf > 0);
__syncthreads();
// swap the 1st shared (v) column to rows of threads for subsequent M*v in loop
fcrand = fcshared[tx];
__syncthreads();
if(ty == 0)
{
// write out the vector (product of matrix and vector)
pfcM[tx] = fcrand;
}
}
Thanks for your advice. I will check that out, but I was under the impression that since only one thread reads and writes to that location (sid), there would be no conflict between different threads.