I want to make an atomic-access stack in global device memory. This is to coordinate the results of items that need to be worked on, and the (lone) worker who grabs the items to process them. However, designing such a stack is far from trivial!
So I have a flat array that’s [N] long. I obviously can’t write past this limit of N… if the stack is full, it’s OK, I’ll handle the work a different way.
So how would you implement the stack?
The obvious answer is to use a population variable. AtomicInc it to reserve your data slot.
You also need to check if the increment ends up past the [N] size, and decrement it back down to repair your overflowed increment.
This works OK if there’s no simultaneous reads. The code would look something like this:
unsigned int newIndex=atomicAdd(¤tSize, 1);
if (newIndex<max_n) { // success!
data[newIndex]=myData; // Safe to write into MY guaranteed slot
}
else { // Oh no, output array is full
AtomicSub(¤tSize, 1); // remove the extra value I added
}
But if you’re reading at the same time… it’s just a mess, even with the assumption of only one reader but multiple writers.
It’s a big puzzle. It may be you need to use two pointers, and use the array as a circular buffer… read off the “head” of the list and write only to the tail of the list. But in this case I still keep expecting races and errors as writers add their increment then realize it’s too much and have to repair themselves… but there’s a race between those two accesses.
Maybe you need a third atomic variable to track size, and then only if you atomically pass the “size” test are you allowed to (atomically) increment the head pointer?
But then maybe you need a FOURTH variable to track the number of valid entries so that your reader doesn’t try to read the memory slot you’ve reserved but have not filled yet?
I am not concerned about performance here, this is for shuffling (comparatively) rare accumulated work.