implementing stack in cuda

Hi all,

i am new in cuda. I have to implement some recursive algorithm in cuda. I know that cuda platform doesn’t provide recursion. And one way to do that is using stack. Can anyone provide insight how to use stack for recursion, because this stack is to be accessed by many threads.
Any suggestion would be really helpful.

Thanks in advance

Some more details of what you need to do would be helpful. Must the stack be global (accessed by all blocks) or local to the threads in the block? Assuming only local and that you have a limit of the maximum size of the stack, then you could allocate an array of stack elements in shared memory of max size. Then, use a counter to hold the number of elements in the stack and atomicInc and atomicDec to implement a push and pop that can be run concurrently be multiple threads in the block.

Three options really:

  1. If all of the threads in a CTA always push-pop the same data on the stack then you can maintain one stack per CTA and just have one thread read/write to it.

If this is not the case for your application, take a moment to think about re-writing your algorithm or picking a different algorithm, the next two will be complex to implement and slow.

For both of these you should create some struct for each stack entry. It might be helpful to create it in C++ and implement device push and pop methods or in C with modifier functions. Also, to avoid pointer aliasing problems, you probably want to use a base pointer to the stack and then an index to represent the stack pointer for each thread.

  1. Global/Local Memory Stack : Determine the max stack size that you need per thread (multiply this by the number of threads per CTA if you are using global memory), allocate that much memory. If you don’t know the max size, modify your algorithm such that the max size is bounded (for example, for non-recursive versions of quicksort, you can bound the stack size to logN by only pushing the smaller set onto the stack). Initialize the stack pointer for each thread’s stack. Push the first entry onto the stack. Start the kernel, pushing and poping the stack, loop until the stack is empty for each thread.

  2. Shared memory: do the same thing as 2, but maintain the stack in shared memory, this is tricky since there is a limited amount of shared memory.

I would strongly recommend against this method for sm_13 gpus as stack to out of global or local memory will likely not be coalesced and there may be a large amount of divergence among threads.

Local memory is coalesced (they made it so in some version of CUDA). The compiler takes care of the hardwork. So, the local memory stack is the best by performance.

OTOH, I think changing the algorithm is the best. If u r looking to solve some brute force problems, think of changing the algo…

I don’t think that this is true in all cases. Consider a max stack size of N entries with M threads. Assume that each entry is one word (say 32-bits) and local memory is laid out such that entry i for thread j is placed adjacent to entry i for thread j+1 (this is the best that the compiler could do assuming that all threads access the entry with the same index). If all threads have a different stack pointer, and the stack pointers are uniformly distributed, then the probability of having a perfectly coalesced access is (1/N)^(M-1). For a 10 entry stack and 32 threads, this is will happen one in a billion times. The uniform distribution is probably unrealistic (it is the worst case), but it really depends on your algorithm. I would not expect the access patterns for all threads to be perfectly correlated…

Yeah… Good point…

As long as there is no warp-divergence, coalescing is still maintained at the expense of the compiler. And, with compute 1.3, it would get still better…
but then not having a warp-divergence is too big an assumption…

Thanks for bringing up this point

On the other hand, if you’re implementing a call stack, you could do as if disabled threads push empty elements and keep the stack synchronized (so you need only 1 stack pointer per warp).

The performance and memory footprint won’t be worse than what the mask stack already requires anyway.

In pseudocuda:

__shared__ int stackindex[BLOCK_SIZE_Y];

__global__ int stack[STACK_SIZE * BLOCK_SIZE_Y * WARP_SIZE];

__device__ void Push(int v)


	// blockDim.x = WARP_SIZE

	stack[stackIndex[blockIdx.y]+blockIdx.x] = v;  // Broadcast from shared + Coalesced global write

	stackIndex[blockIdx.y] += WARP_SIZE;  // All threads in a warp write the same value


__device__ int Pop()


	stackIndex[blockIdx.y] -= WARP_SIZE;

	return stack[stackIndex[blockIdx.y]+blockIdx.x];  // Broadcast + Coalesced global read


Nvidia really needs to clarify their “local memory is always coalesced” statement, I’m still confused about it, and judging by the number of times I’ve seen it debated on the forums, I think others are also.