Shared mem atomics Repeat topic

Well Sylvain,

This is very interesting. Thanks for posting.

We had an “atomicCAS” thread sometime back where deadlocks were caused because of “spin” loops like what u have mentioned above.

The thing was: The warp scheduler used to schedule the “spinning” warps first and the one that acquires it never gets scheduled. That was a classic deadlock. I dont know if that holds water in this CUBIN code as well :-(

But with the “cnt” logic in place, I would not believe there is a deadlock here. Visibility of shared memory updates take some time. Thats the most confusing thing. Even with “global lock” in place, the results were correct only if “cnt” was checked against 1000. The results were wrong if the “cnt” was checked against “100”. Oh man…

What exactly makes shared atomics require the higher compute capability then if it just translates to a loop?

Good point…

Shared mem atomics require Compute 1.2 hardware.

but Atomics were introduced in Compute 1.1 hardware.

Brilliant catch!

Hm… May b, that

mov.acquire

started working for smem on compute 1.2 hardware… May b…

This could be tested by patching a cubin test case using shared atomics to request compute capability 1.1 instead of 1.2 -

and then executing it on G92 hardware.

OK, my PTX analysis was wrong. That register contains the address of LOCK but not the value.

I believe that, if shared-memory atomic instructions are not intrinsic, deadlocks are usually unavoidable if more than one threads in the same warp execute one in a conditional path. That would not be a compiler bug. We should not abuse shared-memory atomic instructions in such heavy checking code.

32 threads vying for a lock is abuse? :-(

I think using them for intra-warp “spinlock” is abuse, Sarnath.

CVN, May be…

Let us wait and see what Tim and his compiler team say.

They have acknowledged that there is a compiler bug (I cant make any sense of what exactly the bug is…). May b, I should decuda and see it for myself.

Just a side question: Is __threadfence_block() also not intrinsic like shared memory atomic functions?

I don’t have a device of compute capability 1.3 to test on for the moment, but I was just wondering what happens if you change

for(int i=0; totalParticipation<1; i++)

to

while(totalParticipation<1)

shouldn’t make a difference but you never know, looks like stranger things have happened already :)

N.

Nico,

I tested what u suggested.

Same Hang… No change…

Thanks 4 ur interest!

Here’s another interesting test case. I just isolated the last part of your kernel:

[codebox]#include <stdio.h>

device int result = -123;

global void checkKernel(void)

{

if (threadIdx.x == 0)

{

    result = 12;

	__threadfence();

}

__syncthreads();

}

int main()

{

cudaError_t err;

int host=-456;

checkKernel<<< 1, 64>>>();

err = cudaThreadSynchronize();

if (err != 0)

{

    printf("Error launching kernel, err = %d\n",err);

    return -1;

}

cudaMemcpyFromSymbol(&host, “result”, sizeof(int), 0, cudaMemcpyDeviceToHost);

printf("host = %d\n", host);

}[/codebox]

This results in a host value of 12, but if you remove the call to __threadfence(), the result is -123

EDIT: Also, if you remove the call to __threadfence(), but change the comparison to (threadIdx.x == 31), the result is 12 once again, which indicates the last thread of a warp is responsible for this.

But changing the comparison to (threadIdx.x == 63) for the last thread of the second warp results in -123 once again :blink:

N.

Nico,

THis is true. THis has been discussed in another thread. It looks like a compiler bug.
Check this thread : [url=“http://forums.nvidia.com/index.php?showtopic=98640”]http://forums.nvidia.com/index.php?showtopic=98640[/url]
The thread analyzes the PTX and zeroes-in on the problem.

Best REgards,
Sarnath

I take it that inserting the __threadfence() call in the code of this thread does not solve the problem, but it really has to do with the shared Vs global lock variable?

N.

It’s implemented as no instruction (that is, not implemented). Which means that memory ordering is always consistent inside a block on current architectures, but that may change in the future.

Thanks for your response. However, I believe that, without __threadfence_block(), a multiprocessor does not wait until the current write completes before proceeding to the next instruction. __threadfence_block() does have certain effects, so it would be already “implemented”.

Yes, __threadfence() does NOT solve this forum-thread’s problem. The problem you mention is a separate compiler bug that is being tracked (hopefully) in the URL above.

I agree. This was my experience as well, even inside a warp.

But it doesn’t necessarily mean that the reads and writes will be inconsistent.

Knowing that:

  • instructions are started in order

  • global memory is fully consistent from 1 thread

(that is, a read from t[i] after a write to t[i] will always return the value that was just written.)

It seems natural that the mechanism used to maintain consistency inside a warp is also used across warps inside a SM or TPC (it would probably be more expensive to do otherwise).

Relaxing consistency further would allow potential performance improvements, so I suppose NVIDIA introduced this instruction to be able to perform such optimizations in the future while maintaining compatibility.

Hi Sylvain,

I have been meaning to take a closer look at this bug and I finally got some time today. :)

However, when I try to use decuda 0.4.2 with the shared mem lock version I get an error:

$ ~/decuda-0.4.2/decuda bug99521.sm_13.cubin 

// Disassembling _Z11checkKernelv (0)

Traceback (most recent call last):

  File "/home/rdomingu/decuda-0.4.2/decuda", line 89, in <module>

	main()

  File "/home/rdomingu/decuda-0.4.2/decuda", line 86, in main

	kernel.disassemble(sys.stdout, formatter)

  File "/home/rdomingu/decuda-0.4.2/CubinFile.py", line 116, in disassemble

	instructions.append(disa.decode(base, inst))

  File "/home/rdomingu/decuda-0.4.2/Disass.py", line 121, in decode

	i.decode()

  File "/home/rdomingu/decuda-0.4.2/Opcodes.py", line 122, in decode

	dtype = (OP_SIGN_NONE,size,OP_TYPE_INT)

UnboundLocalError: local variable 'size' referenced before assignment

Did you get something like this?

Thank you,

Rodrigo

Yes, decuda doesn’t currently support shared atomics…

You need to perform the following modifications in Opcodes.py

# Line 101

class stsha(Instruction):

	"""Store a value to shared memory"""

	def decode(self):

		super(stsha, self).decode()

		self.base = "mov"

		

		type = self.bits(1,0x00600000) # dst width

		if type == 0:

			size = 16

		elif type == 1: 

			size = 32

		elif type == 2:

			size = 8

		else: # ??

			size = 32

			self.modifiers.append(".?%i?" % type)

		atom = self.bits(1,0x00800000)

		if atom:

			self.modifiers.append(".atom")

			

		#flag =  self.bits(0,0x08000000) 

...

# Line 186 (after edit)

class ldshar(Instruction):

	"""Load data between registers, constants, ..."""

	def decode(self):

		super(ldshar,self).decode()

		

		self.base = "mov"

		

		atom = self.bits(1,0x00800000)

		if atom:

			self.modifiers.append(".atom")

		

		if self.subsubop <= 0x3:

...

One day, I will put these changes together and send a patch to Wumpus. One day… External Image