I’ve got a little question about working with global memory.
My kernel needs to write variable to global memory based on some condition. I can guarantee that condition evaluates to TRUE on no more than 1 thread in whole grid.
The problem is that even if condition is TRUE write never occurs unless I call __syncthreads() before. Sample code is shown below:
uint4 vec = CalcVec( data );
// If __syncthreads() is present write is performed
__syncthreads();
if( vec.x == myConst )
*( (unsigned int*) pdOut ) = tid;
I really cannot understand why this __syncthreads() is needed: my kernel does not use local or shared memory at all (.cubin says: lmem=0, smem=20, reg=14).
I also experienced this once before. Making the pointer different for different threads solved my problem that time. You could also try making pdOut volatile.
However, later investigation into my program showed that there are buffer overflows elsewhere, so I’m not sure about what exactly was the cause of my problem. You’d better also check similar problems.
The described behavior shouldn’t be happening. So, it’s a bug in either your code or CUDA. Post the minimal code that reproduces the problem in a single .cu file, which can be compiled with nvcc without other dependencies.