How to implement a stack inside a kernel ?


I have a sort of specific task that requires a stack-like structure in memory.

My kernel has two inputs: an array of ‘instructions’ and an array of source data.
The array of instructions is a list in reverse polish notation that represents a mathematical expression, for the sake of brevity let’s state that this expression has the only variable X and a set of simple math operations (+ - * /) that should be applied to X.

For example: X * X + (X / X)
The same in reverse polish notation: X X * X X / +

The second input - an array of source data - contains X values. The goal is to parse the mathematical expression for a huge number of X values as fast as possible. Each thread should parse the same expression for different X values (the index in the source data array is calculated on the basis of thread number, expression is the same for all running threads).

Originally, tha task was to parse the mathematical expression that is represented as a binary tree, however, this requires recursion. Reverse polish notation (tree is transformed to a linear list) allows to avoid recursive calls and other unpleasant things, however, stack is still required. When parsing the sample expression (X X * X X / +) it is necessary to store two X values somewhere before multiplication and the store the result of multiplication as well.

I would appreciate any suggestions very much … still can hardly imagine how to use shared memory (or some other approach ?) for this purpose.

Thanks in advance.

Shared memory is probably your best bet:

You could access stack items like this.

__global__ void kernel(int max_stack_depth, ...)


    extern __shared__ float stack_sdata[];

   int stack_depth = 0; 


    // push

    stack_sdata[threadIdx.x * max_stack_depth + stack_depth] = value;


   // pop

    top = stack_sdata[threadIdx.x * max_stack_depth + stack_depth - 1];



The function that calls the kernel will need to allocate the proper amount of shared memory max_stack_depth * blockDim.x * sizeof(float) bytes. This will need to be less than 16 KiB, so that limits the maximum stack depth somewhat.

Thank you! Yes, your answer is definitely valuable …

I have a couple of questions though:

  1. Is there a chance that 16K limitation will be extended in the next generation of NVIDIA hardware that is about to be announced ?

  2. Say, 16K is critical and the shared memory can’t be utilized … what is the possible alternative ?

Thank you in advance.

NVIDIA doesn’t announce any hardware details before the big announcement, so your guess is as good as mine.

Local memory. If you instead declare the stack as

float stack_sdata[BLOCK_DIM * MAX_STACK_DEPTH];

where BLOCK_DIM and MAX_STACK_DEPTH are compile time constants, it should still work. Performance will suffer comared to shared memory, but it won’t be teriible. Since all of your threads are working with the same expression, all threads within a warp will be at the same position in a stack. Hence if MAX_STACK_DEPTH is a multiple of 32 you will get coalesced memory accesses and should attain ~70 GiB/s on an 8800 GTX.

If you need a variable block dimension / max stack depth you could allocate the stack in global memory and index with blockIdx.x * blockDim.x + threadIdx.x instead of just threadIdx.x. However, with global memory I’m not certain if a read after a write is guaranteed to read the written value: someone who knows more of the low level details might be able to answer this.

Thank you very much!
Now I have a direction to work on.

Should that not be float stack_data[MAX_STACK_DEPTH]? Local memory is allocated per thread & the compiler tries to optimize things for coalescing I believe.

I also have never used local memory, but I remember something about the compiler optimizing access.

Ooops. You are right, no BLOCK_DIM in the local memory because it is per thread.

I read that the compiler optimizes for coalesced local memory too, but it’s not obvious what it would do. Maybe that means MAX_STACK_DEPTH doesn’t need to be a multiple of 32.

My guess would be that it allocates a big amount of global memory ceil(num_threads_per_block/32)32local_array_size big and access like:

warp1-element0 (32 elements)

warpN-element0 (also 32 elements, even when the warp is not full (num_threads not a multiple of 32)





that way you are always coalesced I believe? And then it does not need to be a multiple of 32.