atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060

Sarnath · June 3, 2009, 11:59am

The following code which tries to implement a spinlock just hangs my TESLA C1060.

No thread seems to see the un-locked state (which is the initial state). Every1 sees it only in the locked state and hence hang.

Flipping the condition (assuming a documentation bug) results in race condition.

Can some1 tell me what is wrong? Appreciate if some1 from NVIDIA could test this out. Many THANKS!

The behaviour is same whether result is “volatile” or not.

#include <stdio.h>

__global__ void mykernel(int *lock, int *variable)

{

	volatile int result;

	

	do

	{	

								/* "0" means unlocked. "FF" means locked */

		result = atomicCAS(lock, 0, 0xFF);

		

	} while (result == 0xFF);

	*variable = *variable + 1;

	*lock = 0;

		

}

int main(void)

{

	int *dev_variable, init=0;

	

	cudaMalloc((void **)&dev_variable, 2*sizeof(int));

	cudaMemcpy(dev_variable, &init, sizeof(int), cudaMemcpyHostToDevice);

	init = 0;

	cudaMemcpy(dev_variable + 1, &init, sizeof(int), cudaMemcpyHostToDevice);

	mykernel<<< 1000, 512 >>> (dev_variable+1, dev_variable);

	cudaThreadSynchronize();

	cudaMemcpy(&init,dev_variable, sizeof(int), cudaMemcpyDeviceToHost);

	printf("%d\n", init);

	return 0;

}

Sarnath · June 3, 2009, 12:13pm

Whoa! I think that is because the “ELSE” part gets executed first during a WARP divergence… And, hence the warp scheduler is busy with the loop and does NOT execute “*lock = 0”.

Let me leverage this concept and see if that clicks ( thanks to PDan (iirc) for raising that interesting topic)

Great ideas usually occur only in restrooms. I think it is high time scientists research this phenomenon.

cbuchner1 · June 3, 2009, 12:47pm

where is the “else” in a do while loop? ;-)

Have you examined the PTX code yet?

cvnguyen · June 3, 2009, 1:25pm

The variable result is thread-local, so whether it is volatile has no meaning.

I suspect the race condition is due to multiple threads writing to the same location. The program is definitely hanged since you reverse the while condition. Can you try this:

......

} while (result != 0xFF);

_syncthreads();

if (threadIdx.x == 0)

{

	atomicAdd(variable, 1);

	*lock = 0;

}

Sarnath · June 3, 2009, 1:39pm

Yep, I tried that __syncthreads() thing and that too did NOT work.

Here is an alternate code that does NOT work - using atomicExch()

__global__ void mykernel(int *lock, int *variable)

{

	volatile int result;

	

spin:

	result = atomicExch(lock, 0xFF);

	__syncthreads();

	if (result == 0)

	{

		*variable = *variable + 1;

		atomicExch(lock, 0);

	}

	

	__syncthreads();

	if (result != 0)

	{

		goto spin;

	}

}

This one works for 10 blocks with 512 threads – but it is DEAD slow!

But does NOT work with 20 blocks… The kernel times out.

There is something seriously wrong, I guess…

At this point , I think the problem is related to “Starvation” – due to bad scheduling…

The atomicExch(…,0) – the one that releases the lock – is NOT scheduled correctly due to in-cessant spinning from other threads…

I think one should NOT tight spin… Thats the moral. There should be a way to wait…

Oh yes… I remember the same problem while writing CPU code as well. …

TRY first and if that succeeds then LOCK… THat was the secret of performance… Otherwise cache-lines were un-necessraily killed resulting in CPU cache ping pong between multiple CPUs…

May b, I should try doing that.

Nonetheless, I think NVIDIA’s hardware must be smart enough to find the isolated transaction and schedule it fast (because only one thread uses things atomically and the lone memory transaction should be honoured more than the atomic locking requests)

tmurray · June 3, 2009, 4:36pm

Of course your code is going to deadlock. One thread will succeed and progress past both __syncthreads(), while every other thread in the CTA is going to go back and hit to __syncthreads() that not all threads in the CTA hit. There is no vagary of scheduling here, your code is just wrong.

(and for the record, atomics absolutely work)

cvnguyen · June 3, 2009, 5:43pm

I’ve tried the following code on my GTX260 and there is no deadlock. The displayed result is 1000.

#include <stdio.h>

__global__ void mykernel(int *lock, int *variable)

{

	int result;

	do

	{

		/* "0" means unlocked. "FF" means locked */

		result = atomicCAS(lock, 0, 0xFF);

	} while (result != 0xFF);

	if (threadIdx.x == 0)

	{

		atomicAdd(variable, 1);

		*lock = 0;

	}

}

int main(void)

{

	int *dev_variable, init=0;

	cudaMalloc((void **)&dev_variable, 2*sizeof(int));

	cudaMemcpy(dev_variable, &init, sizeof(int), cudaMemcpyHostToDevice);

	init = 0;

	cudaMemcpy(dev_variable + 1, &init, sizeof(int), cudaMemcpyHostToDevice);

	mykernel<<< 1000, 512 >>> (dev_variable+1, dev_variable);

	cudaThreadSynchronize();

	cudaMemcpy(&init,dev_variable, sizeof(int), cudaMemcpyDeviceToHost);

	printf("%d\n", init);

	return 0;

}

Sarnath · June 4, 2009, 4:41am

Do you mean to say “__syncthreads()” waits for DEAD THREADS ??? (Note: the thread that unlocks exits the kernel)

–edit–

The code works for 10x512… Howz that possible? And, it takes seconds to complete for 10x512 and gives correct result… This is PLAIN SLOW!

I will give you a complete program in a following post.

Sarnath · June 4, 2009, 4:51am

The locking code is absolutely wrong… You keep on spinning until “result != 0xFF” i.e. whenever it is un-locked, you spin… That is wrong.

You have to spin, only when it is locked… (result == 0xFF). atomicCAS returns the old value present in that memory location. Please check the documentation of “atomicCAS”.

Also, I dont understand why you have used “threadIDx.x == 0” condition. Only the thread that acquires the lock must do the operation.

Sarnath · June 4, 2009, 6:08am

Here is the code that works for 10x512 perfectly. It takes 2.xxx seconds to complete though. See attachment for the required header file “PerformanceCounter.h”

Does NOT work for 20x512 - just times out. At least on few occasions, I have seen it passing without hitting watchdog. It took 12 seconds, in those cases.

Here is a sample output.

"

CUDA Kernel Launch Error Value = 0 (0 means success)

Kernel ran for 12.644320 seconds

10240

"

Without the __syncthreads(), the code JUST hangs! This is because of the WARP divergence problem I think – of which programmers hve no control.

#include <stdio.h>

#include "PerformanceCounter.h"

__global__ void mykernel(int *lock, int *variable)

{

	volatile int result;

	

spin:

	result = atomicExch(lock, 0xFF);

	if (result == 0)

	{

		*variable = *variable + 1;

		atomicExch(lock, 0);

	} 

	

	__syncthreads();

	if (result != 0)

	{

		goto spin;

	}

}

int main(void)

{

	int *dev_variable, init=0;

	cudaError_t err;

	HPTimer profiler;

	

	SetThreadAffinityMask( GetCurrentThread() , 1);

	cudaMalloc((void **)&dev_variable, 2*sizeof(int));

	init = 0;

	cudaMemcpy(dev_variable, &init, sizeof(int), cudaMemcpyHostToDevice);

	init = 0;

	cudaMemcpy(dev_variable + 1, &init, sizeof(int), cudaMemcpyHostToDevice);

	profiler.start();

	mykernel<<< 10, 512>>> (dev_variable+1, dev_variable);

	err = cudaThreadSynchronize();

	profiler.stop();

	printf("CUDA Kernel Launch Error Value = %d (0 means success)\n", err);

	if (err == 0)

	{

		printf("Kernel ran for %f seconds\n", profiler.TimeInSeconds());

	}

	cudaMemcpy(&init,dev_variable, sizeof(int), cudaMemcpyDeviceToHost);

	printf("%d\n", init);

	return 0;

}

tmurray · June 4, 2009, 6:14am

If not all threads in a block hit a __syncthreads(), behavior is undefined. Your code may work. It probably won’t. It certainly isn’t guaranteed to work consistently.

Sarnath · June 4, 2009, 6:19am

Do you mean to say __syncthreads() waits for “exited threads”??? I think we discussed this before. Let me google and find out.

Meanwhile, appreciate, if you could give any tips on how to create a “spinlock” using atomics - it would be useful.

Sarnath · June 4, 2009, 6:25am

Oh yes, we discussed this issue here: [url=“http://forums.nvidia.com/index.php?showtopic=85343&start=0&p=483753&#entry483753”]http://forums.nvidia.com/index.php?showtop...mp;#entry483753[/url]

__syncthreads() waits for exited threads. Grrrrr…

Ok…

So, Without the __syncthreads() the above code hangs. How else can I get this working.???

Let me tweak PTX to see if I can get something working… But definitely, there should be a better way.

tmurray · June 4, 2009, 6:27am

Wow, guys, this isn’t that hard.

// keep track of ownership by storing blockIdx.x

// -1 implies unlocked

__device__ volatile int lock = -1;

__global__ void spinlol()

{

  //hey what's up it's a spinlock

  while (atomicCAS(&lock, -1, blockIdx.x) != -1);

/* do a bunch of stuff */

//unlock

  atomicExch(&lock, -1);

}

completely off the top of my head and totally untested because I am super lazy, but it should work fine. add if threadIdx.x == 0 or warp ID things as necessary.

(edit: okay, to avoid pedantry, that should be volatile although since it’s only being accessed via atomics it won’t matter)

Sarnath · June 4, 2009, 7:00am

Tim,

Your code times out. Find below the full CU code. I removed “volatile” coz there is no overloaded function “atomicCAS” that takes “volatile int *” as argument. (i.e. compilation fails)

Technically, there is no difference between your code and my code. Its all the same. Locking with “blockIdx” does NOT guarantee that all threads in the block will pass through.

However, if you have used “atomicCAS(&lock, blockIdx.x, blockIdx.x)” – it would guarantee you that all (participating) threads of a single block pass through. However, the initial value of “lock” must be some blockID – ANd, that means , you are assuming about the scheduling order of blocks – which is un-supported.

Even if all threads of block participate, i have to again use a lock in shared memory to sychronize among them. ANd, they too will suffer the __syncthreads problem…

But I think this is all too much info at the momment. I request you to test the code below on your hardware and reason out why it times out (I am sure its because of the warp divergence over which programmer has no control). And, this is the point I have been making since yesterday night.

Thanks,

#include <stdio.h>

// keep track of ownership by storing blockIdx.x

// -1 implies unlocked

__device__ int lock = -1;

__device__ int variable = 0;

__global__ void spinlol()

{

  //hey what's up it's a spinlock

  while (atomicCAS(&lock, -1, blockIdx.x) != -1);

variable++;	

//unlock

  atomicExch(&lock, -1);

}

int main()

{

	cudaError_t err;

	int host = 0;

	spinlol<<<10,512>>>();

	err = cudaThreadSynchronize();

	printf("CUDA Launch Error = %d (0 means success)\n", err);

	cudaMemcpyFromSymbol(&host, "variable", sizeof(int), 0, cudaMemcpyDeviceToHost);

	printf("Variable = %d\n", host);

}

tmurray · June 4, 2009, 7:13am

You don’t understand how atomicCAS works. One block will return -1, every other block will return the first block’s blockIdx.x. The first block has acquired the lock, the rest attempt to acquire the lock. Eventually, the first block will decide it’s finished, insert -1, and finish. One of the n - 1 spinlocking blocks will then get -1 from atomicCAS while every other remaining block will get that block’s blockIdx.x, repeat.

Of course this code doesn’t work in any reasonable length of time, and if you think it doesn’t eventually complete because it times out congrats on solving the halting problem. Add a if (!threadIdx.x && !threadIdx.y) conditional around the lock/unlock if you want it to execute cleanly. Whatever scheduling policy you think you’ve discovered is irrelevant in this case. Spinlocks are not hard.

(your test case is wrong anyway without inserting a threadfence after the variable++)

tmurray · June 4, 2009, 7:26am

well your test case is wrong for lots of other reasons, namely that’s a per-CTA mutex and not a per-thread mutex (but the same principles apply)

anyway:

#include <stdio.h>

#include <stdlib.h>

__device__ volatile int lock = -1;

__device__ volatile int counter = 0;

__global__ void spinlol()

{

  if (!threadIdx.x && !threadIdx.y)

	while (atomicCAS((int*)&lock, -1, blockIdx.x) != -1);

  __syncthreads();

if (!threadIdx.x && !threadIdx.y)

	{

	  counter++;

	  __threadfence();

	}

  __syncthreads();

if (!threadIdx.x && !threadIdx.y)

	atomicExch((int*)&lock, -1);

}

int main(int argc, char** argv)

{

  int hostcounter = -1;

  spinlol<<<30, 512>>>();

  cudaThreadSynchronize();

  printf("err = %s\n", cudaGetErrorString(cudaGetLastError()));

  cudaMemcpyFromSymbol(&hostcounter, "counter", sizeof(int), 0, cudaMemcpyDeviceToHost);

  printf("counter = %d\n", hostcounter);

}

tim@thor:~/hurf$ !nvcc

nvcc -o hurf hurf.cu -arch sm_12

tim@thor:~/hurf$ ./hurf 

err = no error

counter = 30

Sarnath · June 4, 2009, 7:31am

atomics work at thread level and not block level. When one thread acquire the lock, the value is already -1 out there… Other threads of the blocks will only see -1. So, only one thread passes out in that block while others spin.

Well, thats what I get from the manual. Moreover, as an insider, you might know more. Can you confirm?

Thanks for the heads-up. It should affect only the value of variable though and not the time-out problem.

I will fix that.

Sarnath · June 4, 2009, 7:38am

But I dont see any such reference in the manual. My code, actually, wants to implement a per-thread mutex. So, ultimately I want the counter to be gridDim.x * blockDim.x and not just “gridDim.x”

For example, atomicAdd() function can be used to get “gridDim.x*blockDim.x”. It works at a per-thread level. So, wht is wrong if I try to implement a spin-lock at per-thread level?

tmurray · June 4, 2009, 7:42am

What I wrote earlier is a per-CTA mutex and is way too slow to be used as a per-thread mutex. It’s not necessarily wrong, but it obviously is unbelievably slow. Better way is to do an inter-CTA mutex and an intra-CTA mutex. I do that here to count warps, extend to whatever you like.

#include <stdio.h>

#include <stdlib.h>

__device__ volatile int lock = -1;

__device__ volatile int counter = 0;;

__global__ void spinlol()

{

  __shared__ int intraCTAlock;

  if (!threadIdx.x && !threadIdx.y)

	intraCTAlock = -1;

  __syncthreads();

if (!threadIdx.x && !threadIdx.y)

	while (atomicCAS((int*)&lock, -1, blockIdx.x) != -1);

  __syncthreads();

if (threadIdx.x % 32 == 0)

	{

	  while (atomicCAS(&intraCTAlock, -1, 12) != -1);

	  counter++;

	  __threadfence();

	  atomicExch(&intraCTAlock, -1);

	}

  __syncthreads();

  if (!threadIdx.x && !threadIdx.y)

	atomicExch((int*)&lock, -1);

}

int main(int argc, char** argv)

{

  int hostcounter = -1;

  spinlol<<<60, 512>>>();

  cudaThreadSynchronize();

  printf("err = %s\n", cudaGetErrorString(cudaGetLastError()));

  cudaMemcpyFromSymbol(&hostcounter, "counter", sizeof(int), 0, cudaMemcpyDeviceToHost);

  printf("counter = %d\n", hostcounter);

}

tim@thor:~/hurf$ nvcc -o hurf hurf.cu -arch sm_12

tim@thor:~/hurf$ ./hurf 

err = no error

counter = 960

None of these cases have taken me long to write, and I’m not using any secret knowledge that can only be obtained if you pass through the ring of fire in NVIDIA HQ. Basic mutexes aren’t hard, just potentially very slow. If I ever thought I needed a per-thread mutex, I would probably decide I was doing something very wrong.

Topic		Replies	Views
Understanding a spinlock implementation by Robert Crovella CUDA Programming and Performance	6	1603	September 26, 2023
Shared mem atomics Help needed to Fix this hang CUDA Programming and Performance	12	11761	June 8, 2009
Shared mem atomics Repeat topic CUDA Programming and Performance	47	8781	December 1, 2009
Global thread barrier CUDA Programming and Performance	78	85684	December 23, 2011
this code resets my computer CUDA Programming and Performance	21	7193	March 29, 2008
DeadLock on this piece of code. Require your help. CUDA Programming and Performance	13	4738	March 17, 2010
inter-warp synchronization troubles with persistent threads (__threadfence_block() ?) CUDA Programming and Performance	6	2773	December 8, 2010
Problem of Hash Table Lock in CUDA CUDA Programming and Performance	6	1278	July 16, 2018
atomicCAS for mutiple blocks & mutiple threads - CUDA 3.2 - Fedora 10 CUDA Programming and Performance	7	2507	April 25, 2011
Writes in same memory location Cant add numbers from different threads? CUDA Programming and Performance	46	25626	July 5, 2007

atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060

Related topics