Shared mem atomics Repeat topic

Sarnath · June 17, 2009, 6:36am

All,

I am posting this again because the previous thread has gone out of visibility.

Here is a simple code that just hangs on my TESLA C1060.

This has NOTHING TO DO with WARP-SCHEDULING/DIVERGENCE OR dependence on any other un-documented feature.

The code works if I use a lock in global memory. Fails if I use a lock in shard memory (even for 1 Block, 32 threads).

You may comment the “#define SHARED_LOCK” line to use lock in global memory (no other changes in code required).

This might make it easy for experimenting.

Also, the code below has a no-brainer code inside – which when un-commented makes the code work shared lock as well. I cant make any sense of out this. Examining PTX does not give any clue as well.

Here is the complete code. You can just copy paste and compile as “nvcc -arch=sm_13 xxxx.cu”

#include <stdio.h>

__device__ int result = -123;

__device__ int g_lock=-1;

#define SHARED_LOCK // Comment this line to use g_lock AND The program will work fine.

#ifdef SHARED_LOCK

#define LOCK sharedLock

#else

#define LOCK g_lock

#endif

__global__ void checkKernel(void)

{

	__shared__ volatile int totalParticipation;

#ifdef SHARED_LOCK

	__shared__ int sharedLock;

#endif

	int lockResult;

	if (threadIdx.x == 0)

	{

		atomicExch(&LOCK,-1);

		totalParticipation = 0;

		__threadfence();

	}

	__syncthreads();

	for(int i=0; totalParticipation<1; i++)

	{

		__syncthreads();

		lockResult = atomicCAS(&LOCK, -1, (int)threadIdx.x);

		if (lockResult == -1)

		{

			totalParticipation++;

			__threadfence();

			atomicExch(&LOCK, -1);

		}

		__syncthreads();

		__threadfence();

	   /*

		// Un-cmment the following brain-dead code and you will find that the code works

		// even in case of shared memory lock

		if (totalParticipation == 0)

	  break;

		*/

	}

	

	if (threadIdx.x == 0)

	{

		result = totalParticipation;

	}

	__syncthreads();

	

	return;

}

int main()

{

	cudaError_t err;

	int host=-123;

	checkKernel<<< 1, 32>>>();

	err = cudaThreadSynchronize();

	if (err != 0)

	{

		printf("Error launching kernel, err = %d\n",err);

		return -1;

	}

	

	cudaMemcpyFromSymbol(&host, "result", sizeof(int), 0, cudaMemcpyDeviceToHost);

	printf("host = %d\n", host);		

}

Thanks,

Best Regards,

Sarnath

cvnguyen · June 17, 2009, 9:41am

On my GTX260, the program is hanged even with 2 threads per block.

If I add the following snippet in the end of the FOR loop:

if (++cnt == 100)

		{

			totalParticipation = 1005;

			break;

		}

		__syncthreads();

the result will be so amazing (host = 1006).

Sarnath · June 17, 2009, 11:35am

May b, you should do a __syncthreads() before your ++cnt code…

cvnguyen · June 17, 2009, 11:45am

Your code have already got one.

Sarnath · June 17, 2009, 12:04pm

CVN,

I tested what you said, assuming “cnt” as local variable initialzed to 0 at start of day…

You are right… I get 1006… :-(

This is just un-believable…

And, what more – If I change to “GLOBAL LOCK”, I get 1005 if I check “cnt” against 100. I get 5 if I check “cnt” against 1000. — meaning – there is a delay in seeing the actual value of “totalParticipation”.

This is just crazy…I am just not able to figure out what is going on.

For all those skeptics, I am re-posting this code that creates the 1006 magic.

#include <stdio.h>

__device__ int result = -123;

__device__ int g_lock=-1;

#define SHARED_LOCK // Comment this line to use g_lock AND The program will work fine.

#ifdef SHARED_LOCK

#define LOCK sharedLock

#else

#define LOCK g_lock

#endif

__global__ void checkKernel(void)

{

	__shared__ volatile int totalParticipation;

#ifdef SHARED_LOCK

	__shared__ int sharedLock;

#endif

	int lockResult;

	int cnt;

	if (threadIdx.x == 0)

	{

		atomicExch(&LOCK,-1);

		totalParticipation = 0;

		__threadfence();

	}

	cnt = 0;

	__syncthreads();

	for(int i=0; totalParticipation<5; i++)

	{

		__syncthreads();

		lockResult = atomicCAS(&LOCK, -1, (int)threadIdx.x);

		if (lockResult == -1)

		{

			totalParticipation++;

			__threadfence();

			atomicExch(&LOCK, -1);

	}

		__syncthreads();

		__threadfence();

		if (++cnt == 100)

		{

			totalParticipation = 1005;

			break;

		}

	}

	__syncthreads();

	

	if (threadIdx.x == 0)

	{

		result = totalParticipation;

	}

	__syncthreads();

	

	return;

}

int main()

{

	cudaError_t err;

	int host=-123;

	checkKernel<<< 1, 32>>>();

	err = cudaThreadSynchronize();

	if (err != 0)

	{

		printf("Error launching kernel, err = %d\n",err);

		return -1;

	}

	

	cudaMemcpyFromSymbol(&host, "result", sizeof(int), 0, cudaMemcpyDeviceToHost);

	printf("host = %d\n", host);		

}

Sylvain_Collange · June 17, 2009, 1:31pm

Cool bug! ;)

Looking at decuda’s output, the problem is triggered by the conditional:

    if (lockResult == -1) {}

In the global-memory case, it is compiled as:

000150: 308103fd 6c4147d8 set.ne.s32 $p1|$o127, $r1, c1[0x0004]

000158: a004c003 00000000 join.label label2

000160: 1004c003 00001280 @$p1.ne bra.label label2

...

000260: f0000001 e0000002 label2: nop.join 

000268: 861ffe03 00000000 bar.sync.u32 0x00000000

While in the shared-memory case, it is:

0001b0: 308101fd 6c4147d8 set.ne.s32 $p1|$o127, $r0, c1[0x0004]

0001b8: 1005c003 00001280 @$p1.ne bra.label label11

...

0002e0: 861ffe03 00000000 label11: bar.sync.u32 0x00000000

So the compiler did not include the reconvergence point in the latter case.

Remember that when a if is encountered, the else part is executed first, that is, conditional forward jumps are followed first.

In this case, the inside of the if is not executed until a reconvergence point is reached. In the shared-memory version, this only happens after the end of the loop, hence the 1006 result.

(Actually, I don’t even understand why it works. It should deadlock when trying to conditionally execute the __syncthread().)

What do you get if you add an else branch with some dummy code inside? I don’t have a GT200 at hand to test it myself…

So the question is why the compiler didn’t include the reconvergence information in the shared-mem case. This looks like a compiler bug to me.

By the way, I thought __threadfence was implemented as an assembly instruction. It’s not.

I suppose the current implementation means that NVIDIA doesn’t expect to release a GPU with a memory bus of more than 512-bits in the near future. :)

tmurray · June 17, 2009, 3:43pm

and that’s the conclusion I reached last week when I did exactly what you just did. :)

If somebody wants to write a shorter/more obvious repro case, it’d save me some time.

roddomi · June 17, 2009, 5:39pm

Hello,

This brings a good question I was just discussing with someone. Why would __threadfence be implemented as a cubin instruction when the SMs are in-order architectures? I thought __threadfence’s purpose was only to avoid the compiler (nvopencc and ptxas) reordering the memory operations around the fence. In other words, would you still need a fence instruction at the cubin level when your architecture is in-order?

Thanks,

Rodrigo

tmurray · June 17, 2009, 5:44pm

There’s no guaranteed ordering for memory transactions without __threadfence. That the SM is in order doesn’t really matter.

Sylvain_Collange · June 17, 2009, 6:30pm

My guess is (elaborating on Tim’s answer :) ) that even if the SMs are in-order (actually they have some out-of-order capabilities), the memory controllers and especially the interconnect are not.

So if you have two memory requests accessing different memory partitions, if there is congestion on one partition, the request accessing the other partition may be serviced earlier even though it was started later.

The __threadfence instruction sends a read request to every memory partition, then waits for their completion. This ensures memory ordering, at the cost of a few thousand cycles…

tmurray · June 17, 2009, 6:35pm

Man, why do I even answer architecture questions, you should just do it since it seems like you’ve figured all this out anyway :)

cvnguyen · June 17, 2009, 8:46pm

OK, tmurray. Since you’ve figured it out, could you please recommend us on how to avoid such compiler bugs in the meantime?

tmurray · June 17, 2009, 10:30pm

Not really, because I haven’t had a chance to reduce it to a smaller repro. There are still a whole lot of side effects in the code posted that makes it pretty painful for the compiler team to debug.

Once I do, I’ll let you know.

Sarnath · June 18, 2009, 1:38am

Tim, Sylvain, CVN

Thank you for your time and effort on this.

I have never decudaed a cubin. So, its heartening to see some1 get to this level and finding out a bug. Thanks!

Also, Tim,
I thought this was the simplest Repro (repro stands for reproduce, right?). I dont understand what u say about “side effects” etc…

Sarnath · June 18, 2009, 7:04am

btw, how does the “global” memory and “shared memory” affect the “IF-ELSE convergence” part of the CUBIN??

Thats strange…

I am just raising this – this might serve as a clue to the compiler team…

cvnguyen · June 18, 2009, 9:11am

Hi, Sarnath

From my PTX analysis, there is no problem with “IF-ELSE convergence”. When LOCK is declared as a global variable, the compiler generates appropriate code. When LOCK is declared as a shared variable, the compiler only reads it at start-up (before the loop) and then treat it as a register variable.

Sarnath · June 18, 2009, 9:47am

Yeah may b… But people r looking @ CUBIN. So, there is something intermediate thing that is happening…, I would guess…

Sylvain_Collange · June 18, 2009, 9:49am

Since I’m now in charge of answering architecture questions… ;)

Most likely, it’s due to the fact that shared-memory atomics are not natively supported in hardware, and they use spinlocks to acquire a cell of shared memory. In pseudocode:

int atomicAdd(shared int * addr, int n)

{

  while(!acquire(addr)) {}

	// inside critical section

	*addr += n;

  release(addr);

}

Where acquire and release are assembly instructions supported starting from the GT200.

So each time you use a shared-mem atomic, the compiler inserts a loop, so that may disturb other parts of the branch generation system that are not aware of this (just guessing).

By the way, I’m not convinced shared atomics are any useful for now given the current implementation. It seems it’s actually slower than global memory atomics (~2000 cycles), unless there are very few threads executing the atomic instruction.

Sarnath · June 18, 2009, 9:54am

@Sylvain,

Do u mean to say the “PTX” instruction

atom.shared.cas.b32 	%rv1, [%r3], %r8, %r1;

translates to a WHILE loop in cubin???

Sylvain_Collange · June 18, 2009, 11:42am

Yes, exactly:

join.label label7

label5: mov.acquire.b32 $p1|$r0, s[0x0010]

@$p1.lt bra.label label6

bra.label label5

label6: set.eq.u32 $p1|$o127, $r0, c1[0x0004]

mov.b32 $r1, $r0

@$p1.ne mov.b32 $r1, $r2

mov.b32 s[0x0010], $r1

mov.release.b32 s[0x0010], $r1

label7: nop.join

where:

s[0x0010] is [%r3]

c1[0x0004] is %r8 = -1

$r2 is %r1

$r1 is %rv1

Topic		Replies	Views
CUDA Memory Consistency CUDA Programming and Performance	23	55478	March 8, 2007
Beginer question Thread synchronization with shared memory CUDA Programming and Performance	35	9312	April 6, 2010
Peculiar Shared Memory Behavior - NEED HELP! A test program appears to have two arrays sharing v CUDA Programming and Performance	21	2159	June 5, 2011
atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060 CUDA Programming and Performance	70	19741	January 21, 2010
Global thread barrier CUDA Programming and Performance	78	85591	December 23, 2011
Writes in same memory location Cant add numbers from different threads? CUDA Programming and Performance	46	25623	July 5, 2007
Using Shared Memory in CUDA C/C++ Technical Blog	36	1952	October 8, 2020
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204314	April 13, 2009
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3305	January 10, 2010
Test Multi Threading Spinning CUDA Programming and Performance	32	4809	July 20, 2011

Shared mem atomics Repeat topic

Related topics