__syncthreads() + shared memory issue

I’ve got an issue I’ve been banging my head against for a couple of days, and I’m reaching the point of looking to CUDA to be at fault.

Reason being I have done a printf of each threads ID, plus the value of a shared variable, directly after a __syncthreads() call, and one thread has a different value from the others.

How can this be possible if __syncthreads is supposed to update all threads with the latest value in shared memory??

I’ve also tried making the shared boolean value ‘volatile’ too, but without success.
Also note the incorrect usage of __syncthreads error is because thread 9 is entering a branch with a __syncthreads() because its dataToLoad value is 1.

Heres the code, and its output:

CODE:
shared bool dataToLoad;
__syncthreads();
printf(“Thread:%2d dataLoad:%d\n”,threadIdx.x,dataToLoad);

OUTPUT:
Thread: 0 dataLoad:0
Thread: 1 dataLoad:0
Thread: 2 dataLoad:0
Thread: 3 dataLoad:0
Thread: 4 dataLoad:0
Thread: 5 dataLoad:0
Thread: 6 dataLoad:0
Thread: 7 dataLoad:0
Thread: 8 dataLoad:0
Thread: 9 dataLoad:1
Error:incorrect use of __syncthreads()

From nvidia programming guide, p.21 :
“Only after the execution of a __syncthreads() (Section 4.4.2) are writes to
shared variables guaranteed to be visible by other threads.”

As far as i can see, this should be impossible, I don’t see how any code or errors on my behalf could cause the above output. A shared variable should have the same value amongst all threads after a __syncthreads() call…

ps. I can’t paste the whole code since this is work for a company, top secret hush hush stuff ;)

Device emulation does not model the underlying hardware correctly!

Shared memory variables need to be initialized! Operations on Un-initialized variables are just undefined.

Cheers for the input Sarnath, but I’ve initialised the variable. It’s not until well into the running of the kernel the issue arises.

I Don’t suppose if anyone knows of documentation beyond the few sentences in the programming guide for the __syncthreads() function?
I’m wondering if threads sync with specific calls to the __syncthreads() function, or perhaps they simply block until all threads hit a call, rather than a specific one?

It seems to be whenever I set this shared boolean to true (which causes the ‘if’ following the code I pasted to then be true) i get the cuda error on the first following occurence of __syncthreads(). I wonder if maybe the cuda compiler is erroneously thinking that since that variable is liked to a potentially divergent if, that threads will diverge and miss __syncthread calls? Thought as far as I can see, if __syncthreads does what its supposed to, no threads should diverge.

Sorry if this is too basic, but:

Is one thread executing __syncthreads() more than the others? That would be bad.

Also, can you reproduce this behavior OUTSIDE the emulator?

“Also note the incorrect usage of __syncthreads error is because thread 9 is entering a branch with a __syncthreads() because its dataToLoad value is 1.”

Yeah, you can’t have __syncthreads() in divergent branches.

I believe I ran into a similar problem of incorrect thread synchronization in emulation mode. I am using a shared array as well. My code goes like this:

  Â extern __shared__ char array[];

 Â  Â int tx = threadIdx.x, ty = threadIdx.y;

 Â  Â int offset = SIZE * blockIdx.x + tx + DIM * ty;

 Â  Â array[offset] = 3;

 Â  Â __syncthreads();

When I run this program in debug+emulation mode, all threads except one seem to block on syncthreads. The rebel thread just comes through somehow, so the program deadlocks. The program runs fine on the GPU.

I haven’t really tried if the shared memory is to blame, but what worries me that the debugging is not possible (or can I debug kernels without emulation?).

I am using ubuntu, cuda 2.0beta, and netbeans 6.1.

I tried removing the shared memory write in my code above and the program runs fine. As soon as I uncomment it, it deadlocks again. So this must be some emulation bug, right?

“Yeah, you can’t have __syncthreads() in divergent branches.” - tmurray

Yeah, this is true, however my code should be synchonous as all threads’ condition branchs operate on the same shared memory, so they should all evalute to true/false the same, if __syncthreads is updating the values obtained from shared memory as it should be.

smokyboy, I think your describing the same problem I’m having. I’ve managed to get around it by moving the shared boolean to outside the branch. I still havent figured out why it deadlocks.

I’m starting to wonder if theres some data structure describing the dependencies between variables and conditional logic between threads running in the background and its erroneously flagging a deadlock situation.

I’ve tried putting a __syncthreads before/after every line that operates on shared memory, so I dont see how user error could be causing a deadlock if every operation should be synchonrised. The only way there could be a deadlock then is if I were using the threadIdx variable in an if, which I’m not.

I’m hoping nvidia had the foresight to sync up individual __syncthreads calls, and not simply have a counter, otherwise if one were to call one more __syncthreads than the others that would explain why it falls through a call when others hold. But if they were prudent and ID’d each __syncthreads call, it should flag the error upon the next call to __syncthreads. Would be nice if Nvidia gave a bit more documentation about the guts of CUDA…