atomicCAS issue (possible deadlock)

anothergpuguy · October 25, 2011, 4:47pm

Hello,

I have recently come across an atomicCAS issue. To demonstrate it I am giving the code below which hangs:

global void test( unsigned int *d_acc ){
d_acc[0] = 0;
__syncthreads();

for (int i=0; i<1000; i++){

	unsigned int oldVal = d_acc[0];
	unsigned int assumedVal;
	unsigned int newVal;
	do {
		assumedVal = oldVal;					
		newVal = assumedVal+1;
						
		oldVal = atomicCAS(&d_acc[0], assumedVal, newVal);
	} while (assumedVal != oldVal);							
}

}

I call the kernel using one block of 512 threads:
dim3 threadBlock(512,1);
int numBlocks=1;
test<<<blockGrid, threadBlock>>>( (unsigned int*) d_accSpaces );

I suspect that atomicCAS is causing a deadlock but I do not understand why. If I also launch the kernel using 32 threads (i.e. 1 warp) there is no problem. There might be something with the concurent run of more than one warps…

If I replace atomicCAS with atomicAdd (removing the do while loop) it works fine.

It also works fine for a smaller number of iterations in the for loop, but I suspect that there is an element of randomness concerning warp scheduling.

Does anybode see a reason for which this use of atomiCAS can lead to a deadlock? (This use of atomicCAS seems to be pretty straightforward as it is the one suggested by the manual too)

Thank you very much!

tera · October 25, 2011, 9:32pm

Are you sure the code hangs indefinitely, or may it just take very long. In your example [font=“Courier New”]d_acc[0][/font] is extremely contended.
If you have just one warp contending for the same variable, the loop is guaranteed to make progress, so it will take 32 iterations (linear with the number of threads). If you have more than one warp, just one of the contending warps is guaranteed to make progress, so the total number of loop iterations of all warps becomes proportional to the square of the number of contending threads.

anothergpuguy · October 26, 2011, 9:15am

thanks a lot for your reply.

I launch the kernel succesfully and then I call CUDAU_CHECK_ERROR, which should wait until the kernel exits (it calls cudaThreadSynchronize()). An exception is then thrown with the message: “the launch timed out and was terminated”.

it seems to me that there is a deadlock because of the use of atomicCAS, and probably something related to the existence of more than one warps, but I do not understand why…

I know the code does not do something useful, I wrote it to demonstrate the problem. In the real case I am using atomicCAS to perform addition with shorts (and stores there should not be that contended). That kernel has also the same behavior (“the launch timed out and was terminated”).

I would be very grateful if you have an insight about this.

anothergpuguy · October 26, 2011, 10:13am

One more observation:

This behavior can happen with one warp only. I reduced the number of threads to 32 and increased the number of iterations in the for loop to 50000 and I get the exception of the kernel timing out.

Once again replacing the do-while and atomicCAS with atomicAdd works perfectly fine.

tera · October 26, 2011, 10:48am

That’s what I suspected: There is nothing wrong with your code apart from the fact that it is very slow. On systems where the GPU also drives the user interface, kernels are subject to a runtime limit of between 2 and 5 seconds to make sure a runaway kernel does not render the machine unusable. Schedule less work per kernel invocation, or run the kernels on a dedicated CUDA card to avoid having your kernel terminated prematurely.

anothergpuguy · October 26, 2011, 10:53am

wow I was not expecting it to be that slow.

thanks a lot, that thing was driving me crazy:)

Topic		Replies	Views
atomicCAS for mutiple blocks & mutiple threads - CUDA 3.2 - Fedora 10 CUDA Programming and Performance	7	2488	April 25, 2011
atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060 CUDA Programming and Performance	70	19741	January 21, 2010
why this deadlocks? try to invoke a critical area CUDA Programming and Performance	11	6089	November 6, 2009
Problem with lock using atomicCAS CUDA Programming and Performance	3	3511	July 19, 2014
Program hangs at cudaThreadsynchronize CUDA Programming and Performance	12	9573	April 7, 2011
Atomic Operations in CUDA CUDA Programming and Performance	5	29220	June 9, 2009
Deadlock in busy waiting queue CUDA Programming and Performance cuda	6	275	June 13, 2024
Shared mem atomics Help needed to Fix this hang CUDA Programming and Performance	12	11760	June 8, 2009
Strange behaviour of __syncthreads() CUDA Programming and Performance	5	1148	January 29, 2017
atomicCAS CUDA Programming and Performance	8	3546	July 4, 2011

atomicCAS issue (possible deadlock)

Related topics