Implementing an atomic stack

I want to make an atomic-access stack in global device memory. This is to coordinate the results of items that need to be worked on, and the (lone) worker who grabs the items to process them. However, designing such a stack is far from trivial!

So I have a flat array that’s [N] long. I obviously can’t write past this limit of N… if the stack is full, it’s OK, I’ll handle the work a different way.

So how would you implement the stack?

The obvious answer is to use a population variable. AtomicInc it to reserve your data slot.

You also need to check if the increment ends up past the [N] size, and decrement it back down to repair your overflowed increment.

This works OK if there’s no simultaneous reads. The code would look something like this:

unsigned int newIndex=atomicAdd(&currentSize, 1);

if (newIndex<max_n) { // success! 

  data[newIndex]=myData;  // Safe to write into MY guaranteed slot

}

else { // Oh no, output array is full

  AtomicSub(&currentSize, 1);  // remove the extra value I added

}

But if you’re reading at the same time… it’s just a mess, even with the assumption of only one reader but multiple writers.

It’s a big puzzle. It may be you need to use two pointers, and use the array as a circular buffer… read off the “head” of the list and write only to the tail of the list. But in this case I still keep expecting races and errors as writers add their increment then realize it’s too much and have to repair themselves… but there’s a race between those two accesses.

Maybe you need a third atomic variable to track size, and then only if you atomically pass the “size” test are you allowed to (atomically) increment the head pointer?

But then maybe you need a FOURTH variable to track the number of valid entries so that your reader doesn’t try to read the memory slot you’ve reserved but have not filled yet?

I am not concerned about performance here, this is for shuffling (comparatively) rare accumulated work.

I think you could get it working with 64 bit version of atomicCas.

The idea is the following:

For your data array you reserve some value (i.e - 0) which will mean that the data is not yet written by the writer (after it has incremented the population counter). Initially your data array should be 0.

Also you need to have your ‘currentSize’ variable be 64 bit value, the low part of it will hold the population counter, the high part will hold another counter which will be incremented on every push/pop (this is needed to avoid ABA problem).

The code would look like this:

[codebox]

// writer

for (;;)

{

long long int idx = currentSize;

int d = Data[idx & 0xffffffff];

if (!d && idx == atomicCas(&currentSize, idx, idx + 0x100000001)) // add 1 to both parts

{

Data[idx & 0xffffffff] = NewValue;

return;

}

}

//reader

for (;;)

{

long long int idx = currentSize;

if ((idx & 0xffffffff) == 0) // no elements

return 0;

int d = Data[idx & 0xffffffff];

if (d && idx == atomicCas(&currentSize, idx, idx + 0xffffffff) // add 1 to high part, and subtract 1 from low part

{

Data[idx & 0xffffffff] = 0;

return d;

}

}

[/codebox]

I don’t have much experience with atomics on cuda, so hopefully this code works, at least I can’t see problems here.

And I’d also use single linked list in this case, because it doesn’t have this 2 step writer problem (you’d end up with slightly fastercode), and you don’t have ‘out of slots’ situation and you also don’t need special 0 value anymore.