atomicCAS issue (possible deadlock)

Hello,

I have recently come across an atomicCAS issue. To demonstrate it I am giving the code below which hangs:

global void test( unsigned int *d_acc ){
d_acc[0] = 0;
__syncthreads();

for (int i=0; i<1000; i++){

	unsigned int oldVal = d_acc[0];
	unsigned int assumedVal;
	unsigned int newVal;
	do {
		assumedVal = oldVal;					
		newVal = assumedVal+1;
						
		oldVal = atomicCAS(&d_acc[0], assumedVal, newVal);
	} while (assumedVal != oldVal);							
}

}

I call the kernel using one block of 512 threads:
dim3 threadBlock(512,1);
int numBlocks=1;
test<<<blockGrid, threadBlock>>>( (unsigned int*) d_accSpaces );

I suspect that atomicCAS is causing a deadlock but I do not understand why. If I also launch the kernel using 32 threads (i.e. 1 warp) there is no problem. There might be something with the concurent run of more than one warps…

If I replace atomicCAS with atomicAdd (removing the do while loop) it works fine.

It also works fine for a smaller number of iterations in the for loop, but I suspect that there is an element of randomness concerning warp scheduling.

Does anybode see a reason for which this use of atomiCAS can lead to a deadlock? (This use of atomicCAS seems to be pretty straightforward as it is the one suggested by the manual too)

Thank you very much!

Are you sure the code hangs indefinitely, or may it just take very long. In your example [font=“Courier New”]d_acc[0][/font] is extremely contended.
If you have just one warp contending for the same variable, the loop is guaranteed to make progress, so it will take 32 iterations (linear with the number of threads). If you have more than one warp, just one of the contending warps is guaranteed to make progress, so the total number of loop iterations of all warps becomes proportional to the square of the number of contending threads.

thanks a lot for your reply.

I launch the kernel succesfully and then I call CUDAU_CHECK_ERROR, which should wait until the kernel exits (it calls cudaThreadSynchronize()). An exception is then thrown with the message: “the launch timed out and was terminated”.

it seems to me that there is a deadlock because of the use of atomicCAS, and probably something related to the existence of more than one warps, but I do not understand why…

I know the code does not do something useful, I wrote it to demonstrate the problem. In the real case I am using atomicCAS to perform addition with shorts (and stores there should not be that contended). That kernel has also the same behavior (“the launch timed out and was terminated”).

I would be very grateful if you have an insight about this.

One more observation:

This behavior can happen with one warp only. I reduced the number of threads to 32 and increased the number of iterations in the for loop to 50000 and I get the exception of the kernel timing out.

Once again replacing the do-while and atomicCAS with atomicAdd works perfectly fine.

That’s what I suspected: There is nothing wrong with your code apart from the fact that it is very slow. On systems where the GPU also drives the user interface, kernels are subject to a runtime limit of between 2 and 5 seconds to make sure a runaway kernel does not render the machine unusable. Schedule less work per kernel invocation, or run the kernels on a dedicated CUDA card to avoid having your kernel terminated prematurely.

wow I was not expecting it to be that slow.

thanks a lot, that thing was driving me crazy:)