Shared mem atomics Repeat topic

All,

I am posting this again because the previous thread has gone out of visibility.

Here is a simple code that just hangs on my TESLA C1060.

This has NOTHING TO DO with WARP-SCHEDULING/DIVERGENCE OR dependence on any other un-documented feature.

The code works if I use a lock in global memory. Fails if I use a lock in shard memory (even for 1 Block, 32 threads).

You may comment the “#define SHARED_LOCK” line to use lock in global memory (no other changes in code required).

This might make it easy for experimenting.

Also, the code below has a no-brainer code inside – which when un-commented makes the code work shared lock as well. I cant make any sense of out this. Examining PTX does not give any clue as well.

Here is the complete code. You can just copy paste and compile as “nvcc -arch=sm_13 xxxx.cu”

#include <stdio.h>

__device__ int result = -123;

__device__ int g_lock=-1;

#define SHARED_LOCK // Comment this line to use g_lock AND The program will work fine.

#ifdef SHARED_LOCK

#define LOCK sharedLock

#else

#define LOCK g_lock

#endif

__global__ void checkKernel(void)

{

	__shared__ volatile int totalParticipation;

#ifdef SHARED_LOCK

	__shared__ int sharedLock;

#endif

	int lockResult;

	if (threadIdx.x == 0)

	{

		atomicExch(&LOCK,-1);

		totalParticipation = 0;

		__threadfence();

	}

	__syncthreads();

	for(int i=0; totalParticipation<1; i++)

	{

		__syncthreads();

		lockResult = atomicCAS(&LOCK, -1, (int)threadIdx.x);

		if (lockResult == -1)

		{

			totalParticipation++;

			__threadfence();

			atomicExch(&LOCK, -1);

		}

		__syncthreads();

		__threadfence();

	   /*

		// Un-cmment the following brain-dead code and you will find that the code works

		// even in case of shared memory lock

		if (totalParticipation == 0)

	  break;

		*/

	}

	

	if (threadIdx.x == 0)

	{

		result = totalParticipation;

	}

	__syncthreads();

	

	return;

}

int main()

{

	cudaError_t err;

	int host=-123;

	checkKernel<<< 1, 32>>>();

	err = cudaThreadSynchronize();

	if (err != 0)

	{

		printf("Error launching kernel, err = %d\n",err);

		return -1;

	}

	

	cudaMemcpyFromSymbol(&host, "result", sizeof(int), 0, cudaMemcpyDeviceToHost);

	printf("host = %d\n", host);		

}

Thanks,

Best Regards,

Sarnath

On my GTX260, the program is hanged even with 2 threads per block.

If I add the following snippet in the end of the FOR loop:

if (++cnt == 100)

		{

			totalParticipation = 1005;

			break;

		}

		__syncthreads();

the result will be so amazing (host = 1006).

May b, you should do a __syncthreads() before your ++cnt code…

Your code have already got one.

CVN,

I tested what you said, assuming “cnt” as local variable initialzed to 0 at start of day…

You are right… I get 1006… :-(

This is just un-believable…

And, what more – If I change to “GLOBAL LOCK”, I get 1005 if I check “cnt” against 100. I get 5 if I check “cnt” against 1000. — meaning – there is a delay in seeing the actual value of “totalParticipation”.

This is just crazy…I am just not able to figure out what is going on.

For all those skeptics, I am re-posting this code that creates the 1006 magic.

#include <stdio.h>

__device__ int result = -123;

__device__ int g_lock=-1;

#define SHARED_LOCK // Comment this line to use g_lock AND The program will work fine.

#ifdef SHARED_LOCK

#define LOCK sharedLock

#else

#define LOCK g_lock

#endif

__global__ void checkKernel(void)

{

	__shared__ volatile int totalParticipation;

#ifdef SHARED_LOCK

	__shared__ int sharedLock;

#endif

	int lockResult;

	int cnt;

	if (threadIdx.x == 0)

	{

		atomicExch(&LOCK,-1);

		totalParticipation = 0;

		__threadfence();

	}

	cnt = 0;

	__syncthreads();

	for(int i=0; totalParticipation<5; i++)

	{

		__syncthreads();

		lockResult = atomicCAS(&LOCK, -1, (int)threadIdx.x);

		if (lockResult == -1)

		{

			totalParticipation++;

			__threadfence();

			atomicExch(&LOCK, -1);

	}

		__syncthreads();

		__threadfence();

		if (++cnt == 100)

		{

			totalParticipation = 1005;

			break;

		}

	}

	__syncthreads();

	

	if (threadIdx.x == 0)

	{

		result = totalParticipation;

	}

	__syncthreads();

	

	return;

}

int main()

{

	cudaError_t err;

	int host=-123;

	checkKernel<<< 1, 32>>>();

	err = cudaThreadSynchronize();

	if (err != 0)

	{

		printf("Error launching kernel, err = %d\n",err);

		return -1;

	}

	

	cudaMemcpyFromSymbol(&host, "result", sizeof(int), 0, cudaMemcpyDeviceToHost);

	printf("host = %d\n", host);		

}

Cool bug! ;)

Looking at decuda’s output, the problem is triggered by the conditional:

    if (lockResult == -1) {}

In the global-memory case, it is compiled as:

000150: 308103fd 6c4147d8 set.ne.s32 $p1|$o127, $r1, c1[0x0004]

000158: a004c003 00000000 join.label label2

000160: 1004c003 00001280 @$p1.ne bra.label label2

...

000260: f0000001 e0000002 label2: nop.join 

000268: 861ffe03 00000000 bar.sync.u32 0x00000000

While in the shared-memory case, it is:

0001b0: 308101fd 6c4147d8 set.ne.s32 $p1|$o127, $r0, c1[0x0004]

0001b8: 1005c003 00001280 @$p1.ne bra.label label11

...

0002e0: 861ffe03 00000000 label11: bar.sync.u32 0x00000000

So the compiler did not include the reconvergence point in the latter case.

Remember that when a if is encountered, the else part is executed first, that is, conditional forward jumps are followed first.

In this case, the inside of the if is not executed until a reconvergence point is reached. In the shared-memory version, this only happens after the end of the loop, hence the 1006 result.

(Actually, I don’t even understand why it works. It should deadlock when trying to conditionally execute the __syncthread().)

What do you get if you add an else branch with some dummy code inside? I don’t have a GT200 at hand to test it myself…

So the question is why the compiler didn’t include the reconvergence information in the shared-mem case. This looks like a compiler bug to me.

By the way, I thought __threadfence was implemented as an assembly instruction. It’s not.

I suppose the current implementation means that NVIDIA doesn’t expect to release a GPU with a memory bus of more than 512-bits in the near future. :)

and that’s the conclusion I reached last week when I did exactly what you just did. :)

If somebody wants to write a shorter/more obvious repro case, it’d save me some time.

Hello,

This brings a good question I was just discussing with someone. Why would __threadfence be implemented as a cubin instruction when the SMs are in-order architectures? I thought __threadfence’s purpose was only to avoid the compiler (nvopencc and ptxas) reordering the memory operations around the fence. In other words, would you still need a fence instruction at the cubin level when your architecture is in-order?

Thanks,

Rodrigo

There’s no guaranteed ordering for memory transactions without __threadfence. That the SM is in order doesn’t really matter.

My guess is (elaborating on Tim’s answer :) ) that even if the SMs are in-order (actually they have some out-of-order capabilities), the memory controllers and especially the interconnect are not.

So if you have two memory requests accessing different memory partitions, if there is congestion on one partition, the request accessing the other partition may be serviced earlier even though it was started later.

The __threadfence instruction sends a read request to every memory partition, then waits for their completion. This ensures memory ordering, at the cost of a few thousand cycles…

Man, why do I even answer architecture questions, you should just do it since it seems like you’ve figured all this out anyway :)

OK, tmurray. Since you’ve figured it out, could you please recommend us on how to avoid such compiler bugs in the meantime?

Not really, because I haven’t had a chance to reduce it to a smaller repro. There are still a whole lot of side effects in the code posted that makes it pretty painful for the compiler team to debug.

Once I do, I’ll let you know.

Tim, Sylvain, CVN

Thank you for your time and effort on this.

I have never decudaed a cubin. So, its heartening to see some1 get to this level and finding out a bug. Thanks!

Also, Tim,
I thought this was the simplest Repro (repro stands for reproduce, right?). I dont understand what u say about “side effects” etc…

btw, how does the “global” memory and “shared memory” affect the “IF-ELSE convergence” part of the CUBIN??

Thats strange…

I am just raising this – this might serve as a clue to the compiler team…

Hi, Sarnath

From my PTX analysis, there is no problem with “IF-ELSE convergence”. When LOCK is declared as a global variable, the compiler generates appropriate code. When LOCK is declared as a shared variable, the compiler only reads it at start-up (before the loop) and then treat it as a register variable.

Yeah may b… But people r looking @ CUBIN. So, there is something intermediate thing that is happening…, I would guess…

Since I’m now in charge of answering architecture questions… ;)

Most likely, it’s due to the fact that shared-memory atomics are not natively supported in hardware, and they use spinlocks to acquire a cell of shared memory. In pseudocode:

int atomicAdd(shared int * addr, int n)

{

  while(!acquire(addr)) {}

	// inside critical section

	*addr += n;

  release(addr);

}

Where acquire and release are assembly instructions supported starting from the GT200.

So each time you use a shared-mem atomic, the compiler inserts a loop, so that may disturb other parts of the branch generation system that are not aware of this (just guessing).

By the way, I’m not convinced shared atomics are any useful for now given the current implementation. It seems it’s actually slower than global memory atomics (~2000 cycles), unless there are very few threads executing the atomic instruction.

@Sylvain,

Do u mean to say the “PTX” instruction

atom.shared.cas.b32 	%rv1, [%r3], %r8, %r1;

translates to a WHILE loop in cubin???

Yes, exactly:

join.label label7

label5: mov.acquire.b32 $p1|$r0, s[0x0010]

@$p1.lt bra.label label6

bra.label label5

label6: set.eq.u32 $p1|$o127, $r0, c1[0x0004]

mov.b32 $r1, $r0

@$p1.ne mov.b32 $r1, $r2

mov.b32 s[0x0010], $r1

mov.release.b32 s[0x0010], $r1

label7: nop.join

where:

s[0x0010] is [%r3]

c1[0x0004] is %r8 = -1

$r2 is %r1

$r1 is %rv1