atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060

Sarnath · June 4, 2009, 7:55am

Well, Tim,

You now have per-warp mutex and this one avoids the warp-divergence issue again.
Also use of __syncthreads means it cant be used in conditionals. Hmm…

Ok,

I will follow your footsteps to implement a pure per-thread mutex and will get back with results.

Meanwhile, Thanks for your time.

The next time, I steal your time – It will be for a very very valid reason. Thanks!

tmurray · June 4, 2009, 8:00am

there’s warp divergence in there. if threadIdx.x % 32 == 0 and all. it’s just that by doing this I keep contention very low and the machine stays happy and things stay fast. doing these kinds of things on a real per-thread level is always wrong–synchronize within a warp, then within a block, then within the grid. don’t just try to synchronize between arbitrary threads in the grid and ignore multi-level synchronization because your performance will always suck

anyway, I’m done, I just like atomic operations and race conditions

Sarnath · June 4, 2009, 8:02am

Ok, I changed your code as below to count the number of threads (instead of warps) and it times out.

And, I think the “while” loop (the changed one) is the cause. The warp-divergnece caused there always results in the spinning split-warp to be scheduled (my guess) and thus the shared memory lock never gets released.

#include <stdio.h>

#include <stdlib.h>

__device__ volatile int lock = -1;

__device__ volatile int counter = 0;;

__global__ void spinlol()

{

  __shared__ int intraCTAlock;

if (!threadIdx.x && !threadIdx.y)

	intraCTAlock = -1;

  __syncthreads();

if (!threadIdx.x && !threadIdx.y)

	while (atomicCAS((int*)&lock, -1, blockIdx.x) != -1);

  __syncthreads();

/* ATTENTION: CHANGED 12 to threadIdx.x */

while (atomicCAS(&intraCTAlock, -1, threadIdx.x) != -1);

  counter++;

  __threadfence();

  atomicExch(&intraCTAlock, -1);

__syncthreads();

if (!threadIdx.x && !threadIdx.y)

	atomicExch((int*)&lock, -1);

}

int main(int argc, char** argv)

{

  int hostcounter = -1;

spinlol<<<60, 512>>>();

  cudaThreadSynchronize();

  printf("err = %s\n", cudaGetErrorString(cudaGetLastError()));

  cudaMemcpyFromSymbol(&hostcounter, "counter", sizeof(int), 0, cudaMemcpyDeviceToHost);

  printf("counter = %d\n", hostcounter);

}

tmurray · June 4, 2009, 8:03am

and again, “times out” does not necessarily imply incorrect, it implies super slow and potentially incorrect…

(also the value you put in the lock doesn’t matter at all unless you want to keep track of who owns the mutex for some reason–there are reasons to do that but not here)

Sarnath · June 4, 2009, 8:06am

Truly apprectiate your time and advice. THANKS!

But still I think it is not done. See my post above :-)

Sarnath · June 4, 2009, 8:07am

Ok, I will provide you another version that works fast and another that times out – that brings out the warp-divergence issue to fore.

But seriously, tell me, do you expect the code above that times-out to run for more than 5 seconds??

tmurray · June 4, 2009, 8:14am

It might actually be depending on undefined behavior. I need to think about it a bit.

(after I sleep of course)

Sarnath · June 4, 2009, 8:20am

Ahh… Goooood Night! Truly appreciate your time on this!

btw,
I have reproduced the timeout with 1 block with 3 threads on it… I will post more info later.

Goood Night for now!

Sarnath · June 4, 2009, 9:28am

Ok, Tim, Here is the code that probably makes a good test case. (Almost same as previous case of per-thread,per-cta logic. I have just added argument support so that number of blocks/threads can be specified in the command line. Usage is documented below)

#include <stdio.h>

#include <stdlib.h>

__device__ volatile int lock = -1;

__device__ volatile int counter = 0;;

__global__ void spinlol()

{

  __shared__ int intraCTAlock;

if (!threadIdx.x && !threadIdx.y)

	intraCTAlock = -1;

  __syncthreads();

if (!threadIdx.x && !threadIdx.y)

	while (atomicCAS((int*)&lock, -1, blockIdx.x) != -1);

  __syncthreads();

/* ATTENTION: Changed 12 to threadIdx.x */

while (atomicCAS(&intraCTAlock, -1, threadIdx.x) != -1);

  counter++;

  __threadfence();

  atomicExch(&intraCTAlock, -1);

__syncthreads();

if (!threadIdx.x && !threadIdx.y)

	atomicExch((int*)&lock, -1);

}

int main(int argc, char** argv)

{

  int hostcounter = -1;

  dim3 grid, block;

if (argc != 3)

  {

	printf("Usage: <executable-name> N_Blocks N_Threads\n");

	return -1;

  }

grid.x = atoi(argv[1]);

  block.x = atoi(argv[2]);

	

  printf("Spawning kernel with %d blocks, %d threads_per_blocks\n", grid.x, block.x);

  spinlol<<<grid, block>>>();

  cudaThreadSynchronize();

  printf("err = %s\n", cudaGetErrorString(cudaGetLastError()));

  cudaMemcpyFromSymbol(&hostcounter, "counter", sizeof(int), 0, cudaMemcpyDeviceToHost);

  printf("counter = %d\n", hostcounter);

}

Here is sample output:

This is reproducible consistently.

If I comment the “while” loop acquiring the intraCTAlock, everything works fine (of course, the output is wrong due to race).

I wish I could do more work on this. But I am feeling ill today and gotta go home. We will continue this tomorrow!

Thanks for all your time,

Best Regards,

Sarnath

cvnguyen · June 4, 2009, 11:42am

OK now I understand what your “spinlock” means.

You can spinlock blocks of a kernel using atomic functions because each block is run independently of the others.

You cannot spinlock threads of a block using atomic functions because threads are always synchronized at warp level. For instance, in a warp, the first thread evaluating atomicCAS() can meet the condition to get off the spinlock loop but it cannot proceed to unlock the others (in the same block) because it MUST wait for the others (in the same warp) to get off the spinlock loop; certainly this is a dead lock. The fact that your above code can work with 2 threads per block is weird and inconsistent with CUDA specification; probably this is another CUDA bug.

My recommendation is to spinlock only one thread in each warp as follows:

if (!(threadIdx.x % 32))

		while (atomicCAS(&intraCTAlock, -1, threadIdx.x) != -1);

	atomicAdd((int*)&counter,1);

	__threadfence();

	atomicExch(&intraCTAlock, -1);

Sarnath · June 4, 2009, 12:43pm

This is what I have been saying as well. And, it all depends on which split-warp is scheduled first. In an earlier thread raised by PDan (iirc), the order of execution of serialized warps was discussed. It was found that the ELSE part executes first and the IF part later.

I did generate code (verified by seeing ptx) that had the success path (lock acquired successfully) in the ELSE part so that it gets executed first. But that did NOT work as expected. So, there is definitely somthing more to this.

The fact that your above code can work with 2 threads per block is weird and inconsistent with CUDA specification;

What does the CUDA spec tell about 2 threads?

if (!(threadIdx.x % 32))

		while (atomicCAS(&intraCTAlock, -1, threadIdx.x) != -1);

	atomicAdd((int*)&counter,1);

	__threadfence();

	atomicExch(&intraCTAlock, -1);

If I am using “atomicAdd”, I dont need the spinlock at all.

“atomicAdd” takes care of per-thread semantics very well. It works flawlessly.

The spinlocking code tries to acheive the same effect but fails because of the warp-divergence deadlock issue.

So, there is something that is available in hardware – that is NOT being exposed to programmers – and hence we cant develop per-thread spinlocks easily.

cvnguyen · June 4, 2009, 1:03pm

I cannot see how the executing order of IF/ELSE paths affects your above code.

“no other thread can access this address until the operation is complete”

“accessing” also implies “reading”.

You’ve raised this issue before: why your code works with 2 threads but not with 3 threads.

OK what are the points at all to use spinlock? To replace atomic functions on shared memory for 1.1 hardware? There should be some ways to do that, I am thinking about warp vote support (edited: sorry warp vote is not available in 1.1 hardware too).

Sarnath · June 4, 2009, 1:44pm

Well, I meant conditional statements causing warp divergence. In this case, it is the spinning “while” that is causing warp divergence. So, one sub-warp needs to be executed first followed by other. My contention is that the hardware takes the spinning-while sub-warp again and thus never gets to finish that.

NVIDIA hardware is not designed for this kindaa things – Atleast that is what it looks like.

Yes. I did not quote that as an issue. I merely reported that behaviour.

A spinlock can be used for many different purposes. For example: To update linked list information, links in a binary search tree and so on… These are rarely

encountered in GPU progrmming. but that does NOT mean we dont need it – THats why atomics was introduced in the first place.

With a basic bunch of atomics, you implement higher level featues like spin-lock, read-write locks, sleep-locks (not possible in GPU), sempahore, mutex etc…

Out of the commonly known synchronization primitives, spinlock() looks to be the one that is implementable on CPUs. Probably, read-write locks can also be tried.

in general, locks can be used to protect read-write access to big data-structures (it could spawn anything) – except that all code that accesses it takes the lock and proceed.

cvnguyen · June 4, 2009, 1:57pm

Well, if CUDA really executed the IF path first and then the ELSE path, your code would still go to a dead lock. Note that all threads in a warp must converge after EVERY iteration of the while loop; otherwise, dead lock will happen.

If you want to do that, you will need to serialize the warp locking code. Follows is a naive example (it only works for a block, you still need the inter-multiprocessor spinlock code):

#if (1)

	if (!threadIdx.x)

		intraCTAlock = 0;

	__syncthreads();

	while (intraCTAlock < blockDim.x)

	{

		if (intraCTAlock == threadIdx.x)

		{

			counter++;

			__threadfence();

			intraCTAlock++;

		}

	}

#endif

	__syncthreads();

cvnguyen · June 4, 2009, 1:58pm

edited

Sarnath · June 4, 2009, 4:12pm

–edited–

Why should they converge after every iteration? I dont understand that part.

The point of convergence of a IF-ELSE part is JUST outside the IF-ELSE block.

The point of converge of FOR loops or while loops exist JUST outside the loop.

And, that is probably a reason why the hardware ALWAYS finishes the WHILE/FOR first and then re-combines with the other sub-warp that is waiting outside the loop. That is probably why we are seeing a DEADLOCK here.

My understanding has now imprved. Preivously I thought the other sub-warp-half goes on executing stuff, But yeah, its so stupid. The other sub-wap just waits for the for loop to complete… Though this theory does NOT quite explain why the “2” thread thing ever worked.

I think if we replace the WHILE loop with IF-ELSE check and GOTO as yesterday coupled with __syncthreads(), it might work… I think that could work fast – which will put an end to this thread. I will try that morrow.

Sarnath · June 4, 2009, 4:18pm

cvn,
Thanks for your interest and time.

cvnguyen · June 4, 2009, 4:46pm

OK, let’s examine a typical while loop:

while (check_condition)

	{

		bla bla bla

	}

I would say that whatever inside the WHILE loop belong to a conditional path. However, all threads in a warp MUST evaluate [font=“Courier New”]check_condition[/font] at the same time (converging) because [font=“Courier New”]check_conditio[/font]n belongs to the “common” path. If one thread completes the loop before the others in the same warp do, dead lock surely happens.

In the case of FOR loop, The story is not different.

You’re welcome. I’ve learned quite a lot from this topic, too.

tmurray · June 4, 2009, 5:07pm

Yep, this is depending on undefined behavior. You’re assuming that a thread will yield when it cannot make progress (the atomicCAS on shmem), but that’s not guaranteed to be the case at all when there’s warp divergence.

Sarnath · June 5, 2009, 2:13am

Cvn,

When “check_condition” is evaluated, some threads pass through and some do NOT. (In our case, only one thread goes OUT and all other threads stay inside the while loop). The exited threads form a sub-warp and wait OUTSIDE the while-loop for other threads to finish the while-loop.

So, the idea should be to make the thread that acquires the lock get INSIDE the WHILE (or IF) loop and all others should spin on a bigger outer loop. I will post a proper spinlock code once I reach office. Hope that would help some people.

Man, This is such a simple thing… And, I was jusss going on and on… Grr…

On retrospect, I feel so dumb to have wasted many people’s time… extremely sorry about that…

but I can only console myself that this is NOT an obvious thing as it might seem to appear.

Thanks for all your time.

Best Regards,

Sarnath