Compiler bug with shared atomics?

Been staring at this for best part of 3 days now, and think have narrowed it down to a compiler bug…

I have a stack in shared memory implemented using shared atomic operations. Each object has a ‘next in stack’ pointer which I exchange with the top of stack pointer using atomic operations. The push operation appears to work fine. The pop operation appears to generate PTX code which reads the top of stack again after the atomic operation (if the stack is empty, the operation will dereference a null pointer), either resulting in the kernel failing, or the machine locking up…

The only slightly odd thing I can think of that I’m doing is abusing the atomicCAS for unsigned ints to read/write pointers. Don’t think there’s really any alternative tho, since there aren’t any atomic ops for pointers?

Here’s the simplest kernel I’ve managed to come up with that shows the problem:

[codebox]

struct Node

{

int some_random_data;

Node* stack;

};

device Node global_node;

device Node* result;

template

inline device T* atomic_cas(T** val, T* cmp, T* set)

{

return (T*)(atomicCAS((unsigned int*)(val), (unsigned int)(cmp), (unsigned int)(set)));

}

device

void push_stack(Node** stack, Node* n)

{

do

{

  n -> stack = *stack;

} while (atomic_cas(stack, n → stack, n) != n → stack);

}

device

Node* pop_stack(Node** stack)

{

Node* n;

do

{

  n = *stack;

  if (!n)

     return 0;

}

while (atomic_cas(stack, n, n → stack) != n);

n → stack = 0;

return n;

}

global

void test()

{

shared Node* stack;

if (threadIdx.x == 0)

{

  global_node.stack = 0;

  push_stack(&stack, &global_node);

  result = pop_stack(&stack);

}

}

[/codebox]

The PTX for the pop is here:

[codebox]

$Lt_0_5378:

// Loop body line 37

.loc	2	37	0

ld.shared.u32 	%r9, [__cuda_stack12];

mov.u32 	%r10, 0;

setp.ne.u32 	%p3, %r9, %r10;

@%p3 bra 	$Lt_0_5634;

mov.u32 	%r11, 0;

bra.uni 	$Lt_0_258;

$Lt_0_5634:

// Part of loop body line 37, head labeled $Lt_0_5378

ld.shared.u32 	%r9, [__cuda_stack12];

.loc	2	40	0

ld.global.u32 	%r12, [%r9+4];

.loc	2	17	0

atom.shared.cas.b32 	%rv1, [%r4], %r9, %r12;

mov.s32 	%r13, %rv1;

.loc	2	40	0

setp.ne.u32 	%p4, %r13, %r9;

@%p4 bra 	$Lt_0_5378;

.loc	2	43	0

mov.u32 	%r14, 0;

.loc	2	37	0

ld.shared.u32 	%r9, [__cuda_stack12];       <-- This looks wrong to me...?

.loc	2	43	0

st.global.u32 	[%r9+4], %r14;

mov.s32 	%r11, %r9;

$Lt_0_258:

.loc	2	59	0

st.global.u32 	[result], %r11;

[/codebox]

Have I missed something obvious? Usually find after writing something like this I’ll spot a stupid bug 5 mins after hitting post :)

This is with CUDA 2.3 - will try with the 3.0 beta and see if it’s any different…

Thanks,

Rich.

If you are working on a 64 bit host machine, then I don’t think it is going to work - CUDA pointers are the same size as host pointers, so for x86_64 sizeof(void *) != sizeof(unsigned int) and I wouldn’t expect your atomicCAS operations to work as you would hope. If you are on IA32, then the idea is at least feasible. The implementation details I can’t comment on.