We had an “atomicCAS” thread sometime back where deadlocks were caused because of “spin” loops like what u have mentioned above.
The thing was: The warp scheduler used to schedule the “spinning” warps first and the one that acquires it never gets scheduled. That was a classic deadlock. I dont know if that holds water in this CUBIN code as well :-(
But with the “cnt” logic in place, I would not believe there is a deadlock here. Visibility of shared memory updates take some time. Thats the most confusing thing. Even with “global lock” in place, the results were correct only if “cnt” was checked against 1000. The results were wrong if the “cnt” was checked against “100”. Oh man…
OK, my PTX analysis was wrong. That register contains the address of LOCK but not the value.
I believe that, if shared-memory atomic instructions are not intrinsic, deadlocks are usually unavoidable if more than one threads in the same warp execute one in a conditional path. That would not be a compiler bug. We should not abuse shared-memory atomic instructions in such heavy checking code.
This results in a host value of 12, but if you remove the call to __threadfence(), the result is -123
EDIT: Also, if you remove the call to __threadfence(), but change the comparison to (threadIdx.x == 31), the result is 12 once again, which indicates the last thread of a warp is responsible for this.
But changing the comparison to (threadIdx.x == 63) for the last thread of the second warp results in -123 once again :blink:
I take it that inserting the __threadfence() call in the code of this thread does not solve the problem, but it really has to do with the shared Vs global lock variable?
It’s implemented as no instruction (that is, not implemented). Which means that memory ordering is always consistent inside a block on current architectures, but that may change in the future.
Thanks for your response. However, I believe that, without __threadfence_block(), a multiprocessor does not wait until the current write completes before proceeding to the next instruction. __threadfence_block() does have certain effects, so it would be already “implemented”.
Yes, __threadfence() does NOT solve this forum-thread’s problem. The problem you mention is a separate compiler bug that is being tracked (hopefully) in the URL above.
I agree. This was my experience as well, even inside a warp.
But it doesn’t necessarily mean that the reads and writes will be inconsistent.
Knowing that:
instructions are started in order
global memory is fully consistent from 1 thread
(that is, a read from t[i] after a write to t[i] will always return the value that was just written.)
It seems natural that the mechanism used to maintain consistency inside a warp is also used across warps inside a SM or TPC (it would probably be more expensive to do otherwise).
Relaxing consistency further would allow potential performance improvements, so I suppose NVIDIA introduced this instruction to be able to perform such optimizations in the future while maintaining compatibility.
I have been meaning to take a closer look at this bug and I finally got some time today. :)
However, when I try to use decuda 0.4.2 with the shared mem lock version I get an error:
$ ~/decuda-0.4.2/decuda bug99521.sm_13.cubin
// Disassembling _Z11checkKernelv (0)
Traceback (most recent call last):
File "/home/rdomingu/decuda-0.4.2/decuda", line 89, in <module>
main()
File "/home/rdomingu/decuda-0.4.2/decuda", line 86, in main
kernel.disassemble(sys.stdout, formatter)
File "/home/rdomingu/decuda-0.4.2/CubinFile.py", line 116, in disassemble
instructions.append(disa.decode(base, inst))
File "/home/rdomingu/decuda-0.4.2/Disass.py", line 121, in decode
i.decode()
File "/home/rdomingu/decuda-0.4.2/Opcodes.py", line 122, in decode
dtype = (OP_SIGN_NONE,size,OP_TYPE_INT)
UnboundLocalError: local variable 'size' referenced before assignment
Yes, decuda doesn’t currently support shared atomics…
You need to perform the following modifications in Opcodes.py
# Line 101
class stsha(Instruction):
"""Store a value to shared memory"""
def decode(self):
super(stsha, self).decode()
self.base = "mov"
type = self.bits(1,0x00600000) # dst width
if type == 0:
size = 16
elif type == 1:
size = 32
elif type == 2:
size = 8
else: # ??
size = 32
self.modifiers.append(".?%i?" % type)
atom = self.bits(1,0x00800000)
if atom:
self.modifiers.append(".atom")
#flag = self.bits(0,0x08000000)
...
# Line 186 (after edit)
class ldshar(Instruction):
"""Load data between registers, constants, ..."""
def decode(self):
super(ldshar,self).decode()
self.base = "mov"
atom = self.bits(1,0x00800000)
if atom:
self.modifiers.append(".atom")
if self.subsubop <= 0x3:
...
One day, I will put these changes together and send a patch to Wumpus. One day… External Image