How to implement stack

I have a stack in shared memory. I want threads in the thread block to pull an item off the stack and process it. Processing an item will require __syncthreads. I have something like

while(StackCount > 0)
{

if(threadId < StackCount)
{
    data = Stack[StackCount-1-threadId];
    atomicDec(&StackCount);

    // I think I need to __syncthreads here, but not supposed to inside if.

    // process can push onto the stack, also writes to global memory, and these writes
    // need to be synced as well.
    Process(&data, &Stack, &StackCount, ...);
}

}

I am getting hangs, which I believe are related to __syncthreads in conditional statement.

What is the Cuda pattern for this?

As a side note: Most of my threads in the warp will diverge (processing different data item). Would it be better to just use one thread per warp? I know it is super wasteful, but might be better than downloading to CPU and then uploading again for more processing.

Here’s the trivial solution with next to no overhead.

while(StackCount > 0) // assuming all threads participate here
{

    bool cond = (threadId < StackCount);

    if(cond)
    {
        data = Stack[StackCount-1-threadId];
        atomicDec(&StackCount);
    }

    __syncthreads();

    if(cond)
    {
        // process can push onto the stack, also writes to global memory, and these writes
        // need to be synced as well.
        Process(&data, &Stack, &StackCount, ...);
    }

}

Yes, that part is pretty easy to workaround, but what about

if(cond)
{
    // process can push onto the stack, also writes to global memory, and these writes
    // need to be synced as well.
    Process(&data, &Stack, &StackCount, ...);
}

I could probably play the same game inside Process, but Process itself is inside if(cond). Maybe I’ll pass cond into Process and see if I can do similar strategy.

So Process() can push work onto the stack if some condition is met. And we also only want to process threads that are doing work, ie, if(cond).

Process has logic like:

while(condA)
{
   ...
   if(condB)
   {
       int oldIndex = atomicInc(stackCount);
       stack[oldIndex] = newData;
   }

   if(condC)
   {
       int oldIndex = atomicInc(stackCount);
       stack[oldIndex] = newData;
   }
}

and this is still wrapped inside if(cond) {…}.

Are you sure you don’t need a syncthreads() either inside Process(), or after calling it? Some threads might already have left Process and reentered the outer loop, reevaluating bool cond = (threadId < StackCount);

Christian

Yes, there are needs for __syncthreads inside Process(), which is what I was trying to say. I would think I need to __syncthreads when I push onto the stack, and also when I write to global memory (which process does) per loop iteration. But there is branching to push, and Process itself is wrapped in a conditional. So I am not easily able to use __syncthreads(). Is the only option to try and restructure all the code so that the __syncthreads are outside conditional statements?

Is there a way to just mask off threads? That is to say, these threads aren’t processing any queue items so just have them keep up with the other threads and not do anything so that __syncthreads works?