Been staring at this for best part of 3 days now, and think have narrowed it down to a compiler bug…
I have a stack in shared memory implemented using shared atomic operations. Each object has a ‘next in stack’ pointer which I exchange with the top of stack pointer using atomic operations. The push operation appears to work fine. The pop operation appears to generate PTX code which reads the top of stack again after the atomic operation (if the stack is empty, the operation will dereference a null pointer), either resulting in the kernel failing, or the machine locking up…
The only slightly odd thing I can think of that I’m doing is abusing the atomicCAS for unsigned ints to read/write pointers. Don’t think there’s really any alternative tho, since there aren’t any atomic ops for pointers?
Here’s the simplest kernel I’ve managed to come up with that shows the problem:
[codebox]
struct Node
{
int some_random_data;
Node* stack;
};
device Node global_node;
device Node* result;
template
inline device T* atomic_cas(T** val, T* cmp, T* set)
{
return (T*)(atomicCAS((unsigned int*)(val), (unsigned int)(cmp), (unsigned int)(set)));
}
device
void push_stack(Node** stack, Node* n)
{
do
{
n -> stack = *stack;
} while (atomic_cas(stack, n → stack, n) != n → stack);
}
device
Node* pop_stack(Node** stack)
{
Node* n;
do
{
n = *stack;
if (!n)
return 0;
}
while (atomic_cas(stack, n, n → stack) != n);
n → stack = 0;
return n;
}
global
void test()
{
shared Node* stack;
if (threadIdx.x == 0)
{
global_node.stack = 0;
push_stack(&stack, &global_node);
result = pop_stack(&stack);
}
}
[/codebox]
The PTX for the pop is here:
[codebox]
$Lt_0_5378:
// Loop body line 37
.loc 2 37 0
ld.shared.u32 %r9, [__cuda_stack12];
mov.u32 %r10, 0;
setp.ne.u32 %p3, %r9, %r10;
@%p3 bra $Lt_0_5634;
mov.u32 %r11, 0;
bra.uni $Lt_0_258;
$Lt_0_5634:
// Part of loop body line 37, head labeled $Lt_0_5378
ld.shared.u32 %r9, [__cuda_stack12];
.loc 2 40 0
ld.global.u32 %r12, [%r9+4];
.loc 2 17 0
atom.shared.cas.b32 %rv1, [%r4], %r9, %r12;
mov.s32 %r13, %rv1;
.loc 2 40 0
setp.ne.u32 %p4, %r13, %r9;
@%p4 bra $Lt_0_5378;
.loc 2 43 0
mov.u32 %r14, 0;
.loc 2 37 0
ld.shared.u32 %r9, [__cuda_stack12]; <-- This looks wrong to me...?
.loc 2 43 0
st.global.u32 [%r9+4], %r14;
mov.s32 %r11, %r9;
$Lt_0_258:
.loc 2 59 0
st.global.u32 [result], %r11;
[/codebox]
Have I missed something obvious? Usually find after writing something like this I’ll spot a stupid bug 5 mins after hitting post :)
This is with CUDA 2.3 - will try with the 3.0 beta and see if it’s any different…
Thanks,
Rich.