Atomic Operations in CUDA

I am trying to use the functionality of atomic operations at shared memory level for syncronizing between threads inside a block. I have a shared data structure and this structure should only be modified by one thread at a time .

I have used following code

global void …

shared float SHDATA[100];

shared int access;

if(threadIdx.x == 0)

access = 0;


int access_local = 1;

while(access_local !=0)
access_local = atomicExch(&access, access_local)


access_local = atomicExch(&access,access_local)


The above results into hanging of the GPU. The GPU needs to be rebooted in order to run any other program on the device. Do I need to modify some things in order to use Atomic operations. I have a GeForce GTX280 device which has compute capability 1.3 and hence should support atomic operations at shared mem level.

Strangely, when I reboot the computer and run this code, I get a print message which is called after the function and program hangs without completing the operations. No code runs on the device further unless I reboot it…Surely need some help here,

I believe when a warp diverges, one branch executes while the other branch goes “on hold”. When the executing branch reaches the join point, it then waits while the other branch resumes. When both branches have reached the join point, the warp is no longer divergent and resumes executing. If one branch is on hold, and the other branch is waiting for a lock which the other branch will never release (because it’s on hold), then this causes deadlock.

More here and here.

Perhaps you can do something like this:

for (int i=0; i < 32; i++) {

  if (threadIdx.x % 32 == i) {

	// get lock

	// critical section

	// release lock



If no two threads within a warp are contending for the resource then they can’t deadlock due to branch scheduling. The for loop serializes the threads within a warp. Threads in different warps will still be allowed in the critical section one at a time.

Thanks… this removes the deadlock, but I have some problem understanding the mechanism in which above code removes the deadlock. Suppose, initially three threads in a warp were contending for the resource ( let id: 1, 4,5)

With above code, threadId1 would get into the critical section and suppose we have enough instructions inside critical section so that in meantime threadId4 and threadId5 get through the comparison of threadID to i, and then both are waiting on get lock. Hence, two threads from same warp are contending for resource, so shouldn’t this also create deadlock??

I am trying to understand the basic mechanism here, as adding this code had made my code correct but the code is running too slow


From the programming guide:

The problem has to do with the SIMT mechanism and the way threads within a warp are effectively synchronized. When a branch occurs, the entire warp executes one branch (with some threads masked out so the operations have no effect), and then the entire warp executes the other branch (with the other threads masked out). So while thread 1 is in the critical section, it’s not possible for threads 4 and 5 to go ahead and attempt to get the lock, because they are stalled waiting for the branch to complete. This sequential execution of branches is the reason for the deadlock in the first place, because if all threads but one are waiting for the lock, (and if the waiting threads happen to be scheduled first) then the thread with the lock cannot proceed and release the lock.

Thanks, this is much more informative than the CUDA documentation :). But as blockSize gets larger this doesn’t guarantee consistency if shared data will be used on global scale afterwards. In my tests it was stable when blocksize is smaller than 128. It’s not because of your code, it needs a barrier mechanism whatsoever.

Also you can use warpSize instead of 32 in the code.

For this kind of locking threads, I’ve posted an example that can work for 65535x512 threads:…st&p=549687