atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060

The following code which tries to implement a spinlock just hangs my TESLA C1060.

No thread seems to see the un-locked state (which is the initial state). Every1 sees it only in the locked state and hence hang.

Flipping the condition (assuming a documentation bug) results in race condition.

Can some1 tell me what is wrong? Appreciate if some1 from NVIDIA could test this out. Many THANKS!

The behaviour is same whether result is “volatile” or not.

#include <stdio.h>

__global__ void mykernel(int *lock, int *variable)

{

	volatile int result;

	

	do

	{	

								/* "0" means unlocked. "FF" means locked */

		result = atomicCAS(lock, 0, 0xFF);

		

	} while (result == 0xFF);

	*variable = *variable + 1;

	*lock = 0;

		

}

int main(void)

{

	int *dev_variable, init=0;

	

	cudaMalloc((void **)&dev_variable, 2*sizeof(int));

	cudaMemcpy(dev_variable, &init, sizeof(int), cudaMemcpyHostToDevice);

	init = 0;

	cudaMemcpy(dev_variable + 1, &init, sizeof(int), cudaMemcpyHostToDevice);

	mykernel<<< 1000, 512 >>> (dev_variable+1, dev_variable);

	cudaThreadSynchronize();

	cudaMemcpy(&init,dev_variable, sizeof(int), cudaMemcpyDeviceToHost);

	printf("%d\n", init);

	return 0;

}

Whoa! I think that is because the “ELSE” part gets executed first during a WARP divergence… And, hence the warp scheduler is busy with the loop and does NOT execute “*lock = 0”.

Let me leverage this concept and see if that clicks ( thanks to PDan (iirc) for raising that interesting topic)

Great ideas usually occur only in restrooms. I think it is high time scientists research this phenomenon.

where is the “else” in a do while loop? ;-)

Have you examined the PTX code yet?

The variable result is thread-local, so whether it is volatile has no meaning.

I suspect the race condition is due to multiple threads writing to the same location. The program is definitely hanged since you reverse the while condition. Can you try this:

......

} while (result != 0xFF);

_syncthreads();

if (threadIdx.x == 0)

{

	atomicAdd(variable, 1);

	*lock = 0;

}

Yep, I tried that __syncthreads() thing and that too did NOT work.

Here is an alternate code that does NOT work - using atomicExch()

__global__ void mykernel(int *lock, int *variable)

{

	volatile int result;

	

spin:

	result = atomicExch(lock, 0xFF);

	__syncthreads();

	if (result == 0)

	{

		*variable = *variable + 1;

		atomicExch(lock, 0);

	}

	

	__syncthreads();

	if (result != 0)

	{

		goto spin;

	}

}

This one works for 10 blocks with 512 threads – but it is DEAD slow!

But does NOT work with 20 blocks… The kernel times out.

There is something seriously wrong, I guess…

At this point , I think the problem is related to “Starvation” – due to bad scheduling…

The atomicExch(…,0) – the one that releases the lock – is NOT scheduled correctly due to in-cessant spinning from other threads…

I think one should NOT tight spin… Thats the moral. There should be a way to wait…

Oh yes… I remember the same problem while writing CPU code as well. …

TRY first and if that succeeds then LOCK… THat was the secret of performance… Otherwise cache-lines were un-necessraily killed resulting in CPU cache ping pong between multiple CPUs…

May b, I should try doing that.

Nonetheless, I think NVIDIA’s hardware must be smart enough to find the isolated transaction and schedule it fast (because only one thread uses things atomically and the lone memory transaction should be honoured more than the atomic locking requests)

Of course your code is going to deadlock. One thread will succeed and progress past both __syncthreads(), while every other thread in the CTA is going to go back and hit to __syncthreads() that not all threads in the CTA hit. There is no vagary of scheduling here, your code is just wrong.

(and for the record, atomics absolutely work)

I’ve tried the following code on my GTX260 and there is no deadlock. The displayed result is 1000.

#include <stdio.h>

__global__ void mykernel(int *lock, int *variable)

{

	int result;

	do

	{

		/* "0" means unlocked. "FF" means locked */

		result = atomicCAS(lock, 0, 0xFF);

	} while (result != 0xFF);

	if (threadIdx.x == 0)

	{

		atomicAdd(variable, 1);

		*lock = 0;

	}

}

int main(void)

{

	int *dev_variable, init=0;

	cudaMalloc((void **)&dev_variable, 2*sizeof(int));

	cudaMemcpy(dev_variable, &init, sizeof(int), cudaMemcpyHostToDevice);

	init = 0;

	cudaMemcpy(dev_variable + 1, &init, sizeof(int), cudaMemcpyHostToDevice);

	mykernel<<< 1000, 512 >>> (dev_variable+1, dev_variable);

	cudaThreadSynchronize();

	cudaMemcpy(&init,dev_variable, sizeof(int), cudaMemcpyDeviceToHost);

	printf("%d\n", init);

	return 0;

}

Do you mean to say “__syncthreads()” waits for DEAD THREADS ??? (Note: the thread that unlocks exits the kernel)

–edit–

The code works for 10x512… Howz that possible? And, it takes seconds to complete for 10x512 and gives correct result… This is PLAIN SLOW!

I will give you a complete program in a following post.

The locking code is absolutely wrong… You keep on spinning until “result != 0xFF” i.e. whenever it is un-locked, you spin… That is wrong.

You have to spin, only when it is locked… (result == 0xFF). atomicCAS returns the old value present in that memory location. Please check the documentation of “atomicCAS”.

Also, I dont understand why you have used “threadIDx.x == 0” condition. Only the thread that acquires the lock must do the operation.

Here is the code that works for 10x512 perfectly. It takes 2.xxx seconds to complete though. See attachment for the required header file “PerformanceCounter.h”

Does NOT work for 20x512 - just times out. At least on few occasions, I have seen it passing without hitting watchdog. It took 12 seconds, in those cases.

Here is a sample output.

"

CUDA Kernel Launch Error Value = 0 (0 means success)

Kernel ran for 12.644320 seconds

10240

"

Without the __syncthreads(), the code JUST hangs! This is because of the WARP divergence problem I think – of which programmers hve no control.

#include <stdio.h>

#include "PerformanceCounter.h"

__global__ void mykernel(int *lock, int *variable)

{

	volatile int result;

	

spin:

	result = atomicExch(lock, 0xFF);

	if (result == 0)

	{

		*variable = *variable + 1;

		atomicExch(lock, 0);

	} 

	

	__syncthreads();

	if (result != 0)

	{

		goto spin;

	}

}

int main(void)

{

	int *dev_variable, init=0;

	cudaError_t err;

	HPTimer profiler;

	

	SetThreadAffinityMask( GetCurrentThread() , 1);

	cudaMalloc((void **)&dev_variable, 2*sizeof(int));

	init = 0;

	cudaMemcpy(dev_variable, &init, sizeof(int), cudaMemcpyHostToDevice);

	init = 0;

	cudaMemcpy(dev_variable + 1, &init, sizeof(int), cudaMemcpyHostToDevice);

	profiler.start();

	mykernel<<< 10, 512>>> (dev_variable+1, dev_variable);

	err = cudaThreadSynchronize();

	profiler.stop();

	printf("CUDA Kernel Launch Error Value = %d (0 means success)\n", err);

	if (err == 0)

	{

		printf("Kernel ran for %f seconds\n", profiler.TimeInSeconds());

	}

	cudaMemcpy(&init,dev_variable, sizeof(int), cudaMemcpyDeviceToHost);

	printf("%d\n", init);

	return 0;

}

If not all threads in a block hit a __syncthreads(), behavior is undefined. Your code may work. It probably won’t. It certainly isn’t guaranteed to work consistently.

Do you mean to say __syncthreads() waits for “exited threads”??? I think we discussed this before. Let me google and find out.

Meanwhile, appreciate, if you could give any tips on how to create a “spinlock” using atomics - it would be useful.

Oh yes, we discussed this issue here: http://forums.nvidia.com/index.php?showtop…mp;#entry483753

__syncthreads() waits for exited threads. Grrrrr…

Ok…

So, Without the __syncthreads() the above code hangs. How else can I get this working.???

Let me tweak PTX to see if I can get something working… But definitely, there should be a better way.

Wow, guys, this isn’t that hard.

// keep track of ownership by storing blockIdx.x

// -1 implies unlocked

__device__ volatile int lock = -1;

__global__ void spinlol()

{

  //hey what's up it's a spinlock

  while (atomicCAS(&lock, -1, blockIdx.x) != -1);

/* do a bunch of stuff */

//unlock

  atomicExch(&lock, -1);

}

completely off the top of my head and totally untested because I am super lazy, but it should work fine. add if threadIdx.x == 0 or warp ID things as necessary.

(edit: okay, to avoid pedantry, that should be volatile although since it’s only being accessed via atomics it won’t matter)

Tim,

Your code times out. Find below the full CU code. I removed “volatile” coz there is no overloaded function “atomicCAS” that takes “volatile int *” as argument. (i.e. compilation fails)

Technically, there is no difference between your code and my code. Its all the same. Locking with “blockIdx” does NOT guarantee that all threads in the block will pass through.

However, if you have used “atomicCAS(&lock, blockIdx.x, blockIdx.x)” – it would guarantee you that all (participating) threads of a single block pass through. However, the initial value of “lock” must be some blockID – ANd, that means , you are assuming about the scheduling order of blocks – which is un-supported.

Even if all threads of block participate, i have to again use a lock in shared memory to sychronize among them. ANd, they too will suffer the __syncthreads problem…

But I think this is all too much info at the momment. I request you to test the code below on your hardware and reason out why it times out (I am sure its because of the warp divergence over which programmer has no control). And, this is the point I have been making since yesterday night.

Thanks,

#include <stdio.h>

// keep track of ownership by storing blockIdx.x

// -1 implies unlocked

__device__ int lock = -1;

__device__ int variable = 0;

__global__ void spinlol()

{

  //hey what's up it's a spinlock

  while (atomicCAS(&lock, -1, blockIdx.x) != -1);

variable++;	

//unlock

  atomicExch(&lock, -1);

}

int main()

{

	cudaError_t err;

	int host = 0;

	spinlol<<<10,512>>>();

	err = cudaThreadSynchronize();

	printf("CUDA Launch Error = %d (0 means success)\n", err);

	cudaMemcpyFromSymbol(&host, "variable", sizeof(int), 0, cudaMemcpyDeviceToHost);

	printf("Variable = %d\n", host);

}

You don’t understand how atomicCAS works. One block will return -1, every other block will return the first block’s blockIdx.x. The first block has acquired the lock, the rest attempt to acquire the lock. Eventually, the first block will decide it’s finished, insert -1, and finish. One of the n - 1 spinlocking blocks will then get -1 from atomicCAS while every other remaining block will get that block’s blockIdx.x, repeat.

Of course this code doesn’t work in any reasonable length of time, and if you think it doesn’t eventually complete because it times out congrats on solving the halting problem. Add a if (!threadIdx.x && !threadIdx.y) conditional around the lock/unlock if you want it to execute cleanly. Whatever scheduling policy you think you’ve discovered is irrelevant in this case. Spinlocks are not hard.

(your test case is wrong anyway without inserting a threadfence after the variable++)

well your test case is wrong for lots of other reasons, namely that’s a per-CTA mutex and not a per-thread mutex (but the same principles apply)

anyway:

#include <stdio.h>

#include <stdlib.h>

__device__ volatile int lock = -1;

__device__ volatile int counter = 0;

__global__ void spinlol()

{

  if (!threadIdx.x && !threadIdx.y)

	while (atomicCAS((int*)&lock, -1, blockIdx.x) != -1);

  __syncthreads();

if (!threadIdx.x && !threadIdx.y)

	{

	  counter++;

	  __threadfence();

	}

  __syncthreads();

if (!threadIdx.x && !threadIdx.y)

	atomicExch((int*)&lock, -1);

}

int main(int argc, char** argv)

{

  int hostcounter = -1;

  spinlol<<<30, 512>>>();

  cudaThreadSynchronize();

  printf("err = %s\n", cudaGetErrorString(cudaGetLastError()));

  cudaMemcpyFromSymbol(&hostcounter, "counter", sizeof(int), 0, cudaMemcpyDeviceToHost);

  printf("counter = %d\n", hostcounter);

}
tim@thor:~/hurf$ !nvcc

nvcc -o hurf hurf.cu -arch sm_12

tim@thor:~/hurf$ ./hurf 

err = no error

counter = 30

atomics work at thread level and not block level. When one thread acquire the lock, the value is already -1 out there… Other threads of the blocks will only see -1. So, only one thread passes out in that block while others spin.

Well, thats what I get from the manual. Moreover, as an insider, you might know more. Can you confirm?

Thanks for the heads-up. It should affect only the value of variable though and not the time-out problem.

I will fix that.

But I dont see any such reference in the manual. My code, actually, wants to implement a per-thread mutex. So, ultimately I want the counter to be gridDim.x * blockDim.x and not just “gridDim.x”

For example, atomicAdd() function can be used to get “gridDim.x*blockDim.x”. It works at a per-thread level. So, wht is wrong if I try to implement a spin-lock at per-thread level?

What I wrote earlier is a per-CTA mutex and is way too slow to be used as a per-thread mutex. It’s not necessarily wrong, but it obviously is unbelievably slow. Better way is to do an inter-CTA mutex and an intra-CTA mutex. I do that here to count warps, extend to whatever you like.

#include <stdio.h>

#include <stdlib.h>

__device__ volatile int lock = -1;

__device__ volatile int counter = 0;;

__global__ void spinlol()

{

  __shared__ int intraCTAlock;

  if (!threadIdx.x && !threadIdx.y)

	intraCTAlock = -1;

  __syncthreads();

if (!threadIdx.x && !threadIdx.y)

	while (atomicCAS((int*)&lock, -1, blockIdx.x) != -1);

  __syncthreads();

if (threadIdx.x % 32 == 0)

	{

	  while (atomicCAS(&intraCTAlock, -1, 12) != -1);

	  counter++;

	  __threadfence();

	  atomicExch(&intraCTAlock, -1);

	}

  __syncthreads();

  if (!threadIdx.x && !threadIdx.y)

	atomicExch((int*)&lock, -1);

}

int main(int argc, char** argv)

{

  int hostcounter = -1;

  spinlol<<<60, 512>>>();

  cudaThreadSynchronize();

  printf("err = %s\n", cudaGetErrorString(cudaGetLastError()));

  cudaMemcpyFromSymbol(&hostcounter, "counter", sizeof(int), 0, cudaMemcpyDeviceToHost);

  printf("counter = %d\n", hostcounter);

}
tim@thor:~/hurf$ nvcc -o hurf hurf.cu -arch sm_12

tim@thor:~/hurf$ ./hurf 

err = no error

counter = 960

None of these cases have taken me long to write, and I’m not using any secret knowledge that can only be obtained if you pass through the ring of fire in NVIDIA HQ. Basic mutexes aren’t hard, just potentially very slow. If I ever thought I needed a per-thread mutex, I would probably decide I was doing something very wrong.