Atomic Operations in CUDA

jadoo · April 30, 2009, 7:10am

I am trying to use the functionality of atomic operations at shared memory level for syncronizing between threads inside a block. I have a shared data structure and this structure should only be modified by one thread at a time .

I have used following code

global void …
{

…
shared float SHDATA[100];

shared int access;

if(threadIdx.x == 0)

access = 0;

__syncthreads();

int access_local = 1;
…
for()
{
while(access_local !=0)
access_local = atomicExch(&access, access_local)

…
//CRITICAL SECTION
…

access_local = atomicExch(&access,access_local)

}

The above results into hanging of the GPU. The GPU needs to be rebooted in order to run any other program on the device. Do I need to modify some things in order to use Atomic operations. I have a GeForce GTX280 device which has compute capability 1.3 and hence should support atomic operations at shared mem level.

Strangely, when I reboot the computer and run this code, I get a print message which is called after the function and program hangs without completing the operations. No code runs on the device further unless I reboot it…Surely need some help here,

Jamie_K · April 30, 2009, 1:31pm

I believe when a warp diverges, one branch executes while the other branch goes “on hold”. When the executing branch reaches the join point, it then waits while the other branch resumes. When both branches have reached the join point, the warp is no longer divergent and resumes executing. If one branch is on hold, and the other branch is waiting for a lock which the other branch will never release (because it’s on hold), then this causes deadlock.

More here and here.

Perhaps you can do something like this:

for (int i=0; i < 32; i++) {

  if (threadIdx.x % 32 == i) {

	// get lock

	// critical section

	// release lock

  }

}

If no two threads within a warp are contending for the resource then they can’t deadlock due to branch scheduling. The for loop serializes the threads within a warp. Threads in different warps will still be allowed in the critical section one at a time.

jadoo · May 10, 2009, 6:51am

I believe when a warp diverges, one branch executes while the other branch goes “on hold”. When the executing branch reaches the join point, it then waits while the other branch resumes. When both branches have reached the join point, the warp is no longer divergent and resumes executing. If one branch is on hold, and the other branch is waiting for a lock which the other branch will never release (because it’s on hold), then this causes deadlock.

More here and here.

Perhaps you can do something like this:
for (int i=0; i < 32; i++) {

  if (threadIdx.x % 32 == i) {

	// get lock

	// critical section

	// release lock

  }

}
If no two threads within a warp are contending for the resource then they can’t deadlock due to branch scheduling. The for loop serializes the threads within a warp. Threads in different warps will still be allowed in the critical section one at a time.

Thanks… this removes the deadlock, but I have some problem understanding the mechanism in which above code removes the deadlock. Suppose, initially three threads in a warp were contending for the resource ( let id: 1, 4,5)

With above code, threadId1 would get into the critical section and suppose we have enough instructions inside critical section so that in meantime threadId4 and threadId5 get through the comparison of threadID to i, and then both are waiting on get lock. Hence, two threads from same warp are contending for resource, so shouldn’t this also create deadlock??

I am trying to understand the basic mechanism here, as adding this code had made my code correct but the code is running too slow

thanks

Jamie_K · May 10, 2009, 7:41pm

From the programming guide:

The problem has to do with the SIMT mechanism and the way threads within a warp are effectively synchronized. When a branch occurs, the entire warp executes one branch (with some threads masked out so the operations have no effect), and then the entire warp executes the other branch (with the other threads masked out). So while thread 1 is in the critical section, it’s not possible for threads 4 and 5 to go ahead and attempt to get the lock, because they are stalled waiting for the branch to complete. This sequential execution of branches is the reason for the deadlock in the first place, because if all threads but one are waiting for the lock, (and if the waiting threads happen to be scheduled first) then the thread with the lock cannot proceed and release the lock.

yaksari · June 9, 2009, 7:18am

I believe when a warp diverges, one branch executes while the other branch goes “on hold”. When the executing branch reaches the join point, it then waits while the other branch resumes. When both branches have reached the join point, the warp is no longer divergent and resumes executing. If one branch is on hold, and the other branch is waiting for a lock which the other branch will never release (because it’s on hold), then this causes deadlock.

More here and here.

Perhaps you can do something like this:
for (int i=0; i < 32; i++) {

  if (threadIdx.x % 32 == i) {

	// get lock

	// critical section

	// release lock

  }

}
If no two threads within a warp are contending for the resource then they can’t deadlock due to branch scheduling. The for loop serializes the threads within a warp. Threads in different warps will still be allowed in the critical section one at a time.

Thanks, this is much more informative than the CUDA documentation :). But as blockSize gets larger this doesn’t guarantee consistency if shared data will be used on global scale afterwards. In my tests it was stable when blocksize is smaller than 128. It’s not because of your code, it needs a barrier mechanism whatsoever.

Also you can use warpSize instead of 32 in the code.

cvnguyen · June 9, 2009, 7:36am

For this kind of locking threads, I’ve posted an example that can work for 65535x512 threads:
[url=“http://forums.nvidia.com/index.php?showtopic=98638&view=findpost&p=549687”]http://forums.nvidia.com/index.php?showtop...st&p=549687[/url]

Topic		Replies	Views
why this deadlocks? try to invoke a critical area CUDA Programming and Performance	11	6078	November 6, 2009
Shared mem atomics Help needed to Fix this hang CUDA Programming and Performance	12	11759	June 8, 2009
Problem with lock using atomicCAS CUDA Programming and Performance	3	3496	July 19, 2014
kernels communication and concurrency CUDA Programming and Performance	11	1525	June 8, 2015
A stupid question on __syncthread() function CUDA Programming and Performance	5	4564	May 17, 2022
DeadLock on this piece of code. Require your help. CUDA Programming and Performance	13	4726	March 17, 2010
Useful Arbitrary Atomic Operation Hack CUDA Programming and Performance	0	10053	July 20, 2008
Interactions among blocks CUDA Programming and Performance	11	11461	February 6, 2010
Synchronize all blocks in CUDA CUDA Programming and Performance	12	44114	October 25, 2013
coalescing problem CUDA Programming and Performance	4	1064	August 8, 2011

Atomic Operations in CUDA

Related topics