Shared mem atomics Repeat topic

Sarnath · June 18, 2009, 12:09pm

Well Sylvain,

This is very interesting. Thanks for posting.

We had an “atomicCAS” thread sometime back where deadlocks were caused because of “spin” loops like what u have mentioned above.

The thing was: The warp scheduler used to schedule the “spinning” warps first and the one that acquires it never gets scheduled. That was a classic deadlock. I dont know if that holds water in this CUBIN code as well :-(

But with the “cnt” logic in place, I would not believe there is a deadlock here. Visibility of shared memory updates take some time. Thats the most confusing thing. Even with “global lock” in place, the results were correct only if “cnt” was checked against 1000. The results were wrong if the “cnt” was checked against “100”. Oh man…

cbuchner1 · June 18, 2009, 12:41pm

Yes, exactly:

join.label label7

label5: mov.acquire.b32 $p1|$r0, s[0x0010]

@$p1.lt bra.label label6

bra.label label5

label6: set.eq.u32 $p1|$o127, $r0, c1[0x0004]

mov.b32 $r1, $r0

@$p1.ne mov.b32 $r1, $r2

mov.b32 s[0x0010], $r1

mov.release.b32 s[0x0010], $r1

label7: nop.join

where:

s[0x0010] is [%r3]

c1[0x0004] is %r8 = -1

$r2 is %r1

$r1 is %rv1

What exactly makes shared atomics require the higher compute capability then if it just translates to a loop?

Sarnath · June 18, 2009, 12:46pm

Good point…

Shared mem atomics require Compute 1.2 hardware.

but Atomics were introduced in Compute 1.1 hardware.

Brilliant catch!

Hm… May b, that

mov.acquire

started working for smem on compute 1.2 hardware… May b…

cbuchner1 · June 18, 2009, 1:21pm

This could be tested by patching a cubin test case using shared atomics to request compute capability 1.1 instead of 1.2 -

and then executing it on G92 hardware.

cvnguyen · June 18, 2009, 1:58pm

OK, my PTX analysis was wrong. That register contains the address of LOCK but not the value.

I believe that, if shared-memory atomic instructions are not intrinsic, deadlocks are usually unavoidable if more than one threads in the same warp execute one in a conditional path. That would not be a compiler bug. We should not abuse shared-memory atomic instructions in such heavy checking code.

Sarnath · June 19, 2009, 4:50am

32 threads vying for a lock is abuse? :-(

cvnguyen · June 19, 2009, 5:17am

I think using them for intra-warp “spinlock” is abuse, Sarnath.

Sarnath · June 19, 2009, 5:21am

CVN, May be…

Let us wait and see what Tim and his compiler team say.

They have acknowledged that there is a compiler bug (I cant make any sense of what exactly the bug is…). May b, I should decuda and see it for myself.

cvnguyen · June 23, 2009, 2:45am

Just a side question: Is __threadfence_block() also not intrinsic like shared memory atomic functions?

Nico · June 23, 2009, 12:35pm

I don’t have a device of compute capability 1.3 to test on for the moment, but I was just wondering what happens if you change

for(int i=0; totalParticipation<1; i++)

to

while(totalParticipation<1)

shouldn’t make a difference but you never know, looks like stranger things have happened already :)

N.

Sarnath · June 23, 2009, 12:40pm

Nico,

I tested what u suggested.

Same Hang… No change…

Thanks 4 ur interest!

Nico · June 23, 2009, 1:27pm

Here’s another interesting test case. I just isolated the last part of your kernel:

[codebox]#include <stdio.h>

device int result = -123;

global void checkKernel(void)

{

if (threadIdx.x == 0)

{

    result = 12;

	__threadfence();

}

__syncthreads();

}

int main()

{

cudaError_t err;

int host=-456;

checkKernel<<< 1, 64>>>();

err = cudaThreadSynchronize();

if (err != 0)

{

    printf("Error launching kernel, err = %d\n",err);

    return -1;

}

cudaMemcpyFromSymbol(&host, “result”, sizeof(int), 0, cudaMemcpyDeviceToHost);

printf("host = %d\n", host);

}[/codebox]

This results in a host value of 12, but if you remove the call to __threadfence(), the result is -123

EDIT: Also, if you remove the call to __threadfence(), but change the comparison to (threadIdx.x == 31), the result is 12 once again, which indicates the last thread of a warp is responsible for this.

But changing the comparison to (threadIdx.x == 63) for the last thread of the second warp results in -123 once again :blink:

N.

Sarnath · June 24, 2009, 4:58am

Nico,

THis is true. THis has been discussed in another thread. It looks like a compiler bug.
Check this thread : [url=“http://forums.nvidia.com/index.php?showtopic=98640”]http://forums.nvidia.com/index.php?showtopic=98640[/url]
The thread analyzes the PTX and zeroes-in on the problem.

Best REgards,
Sarnath

Nico · June 24, 2009, 7:03am

I take it that inserting the __threadfence() call in the code of this thread does not solve the problem, but it really has to do with the shared Vs global lock variable?

N.

Sylvain_Collange · June 24, 2009, 9:45am

It’s implemented as no instruction (that is, not implemented). Which means that memory ordering is always consistent inside a block on current architectures, but that may change in the future.

cvnguyen · June 24, 2009, 10:04am

Thanks for your response. However, I believe that, without __threadfence_block(), a multiprocessor does not wait until the current write completes before proceeding to the next instruction. __threadfence_block() does have certain effects, so it would be already “implemented”.

Sarnath · June 24, 2009, 10:38am

Yes, __threadfence() does NOT solve this forum-thread’s problem. The problem you mention is a separate compiler bug that is being tracked (hopefully) in the URL above.

Sylvain_Collange · June 25, 2009, 8:05am

I agree. This was my experience as well, even inside a warp.

But it doesn’t necessarily mean that the reads and writes will be inconsistent.

Knowing that:

instructions are started in order
global memory is fully consistent from 1 thread

(that is, a read from t[i] after a write to t[i] will always return the value that was just written.)

It seems natural that the mechanism used to maintain consistency inside a warp is also used across warps inside a SM or TPC (it would probably be more expensive to do otherwise).

Relaxing consistency further would allow potential performance improvements, so I suppose NVIDIA introduced this instruction to be able to perform such optimizations in the future while maintaining compatibility.

roddomi · July 18, 2009, 7:57pm

Hi Sylvain,

I have been meaning to take a closer look at this bug and I finally got some time today. :)

However, when I try to use decuda 0.4.2 with the shared mem lock version I get an error:

$ ~/decuda-0.4.2/decuda bug99521.sm_13.cubin 

// Disassembling _Z11checkKernelv (0)

Traceback (most recent call last):

  File "/home/rdomingu/decuda-0.4.2/decuda", line 89, in <module>

	main()

  File "/home/rdomingu/decuda-0.4.2/decuda", line 86, in main

	kernel.disassemble(sys.stdout, formatter)

  File "/home/rdomingu/decuda-0.4.2/CubinFile.py", line 116, in disassemble

	instructions.append(disa.decode(base, inst))

  File "/home/rdomingu/decuda-0.4.2/Disass.py", line 121, in decode

	i.decode()

  File "/home/rdomingu/decuda-0.4.2/Opcodes.py", line 122, in decode

	dtype = (OP_SIGN_NONE,size,OP_TYPE_INT)

UnboundLocalError: local variable 'size' referenced before assignment

Did you get something like this?

Thank you,

Rodrigo

Sylvain_Collange · July 19, 2009, 9:25am

Yes, decuda doesn’t currently support shared atomics…

You need to perform the following modifications in Opcodes.py

# Line 101

class stsha(Instruction):

	"""Store a value to shared memory"""

	def decode(self):

		super(stsha, self).decode()

		self.base = "mov"

		

		type = self.bits(1,0x00600000) # dst width

		if type == 0:

			size = 16

		elif type == 1: 

			size = 32

		elif type == 2:

			size = 8

		else: # ??

			size = 32

			self.modifiers.append(".?%i?" % type)

		atom = self.bits(1,0x00800000)

		if atom:

			self.modifiers.append(".atom")

			

		#flag =  self.bits(0,0x08000000) 

...

# Line 186 (after edit)

class ldshar(Instruction):

	"""Load data between registers, constants, ..."""

	def decode(self):

		super(ldshar,self).decode()

		

		self.base = "mov"

		

		atom = self.bits(1,0x00800000)

		if atom:

			self.modifiers.append(".atom")

		

		if self.subsubop <= 0x3:

...

One day, I will put these changes together and send a patch to Wumpus. One day… External Image

Topic		Replies	Views
Shared mem atomics Help needed to Fix this hang CUDA Programming and Performance	12	11923	June 8, 2009
atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060 CUDA Programming and Performance	70	20351	January 21, 2010
order of execution in a divergent warp CUDA Programming and Performance	27	4930	January 17, 2012
Results differ when compiled with sm_10 and sm_20 CUDA Programming and Performance	19	2654	June 17, 2011
global to shared mem loads and sync CUDA Programming and Performance	26	11873	February 21, 2008
__syncthreads and shared memory CUDA Programming and Performance	21	4732	June 15, 2011
DeadLock on this piece of code. Require your help. CUDA Programming and Performance	13	4884	March 17, 2010
How is this kernel locking with __syncthreads()? CUDA Programming and Performance	2	626	April 23, 2018
Kernel exit does not flush writes if atomics are held. Expected behaviour? CUDA Programming and Performance	14	3743	June 18, 2009
Mysteriously changing shared memory variable Race condition? Overwriting an array? CUDA Programming and Performance	12	6184	November 25, 2011

Shared mem atomics Repeat topic

Related topics