How to implement a stack inside a kernel ?

Romant · June 3, 2008, 2:23pm

Hello,

I have a sort of specific task that requires a stack-like structure in memory.

My kernel has two inputs: an array of ‘instructions’ and an array of source data.
The array of instructions is a list in reverse polish notation that represents a mathematical expression, for the sake of brevity let’s state that this expression has the only variable X and a set of simple math operations (+ - * /) that should be applied to X.

For example: X * X + (X / X)
The same in reverse polish notation: X X * X X / +

The second input - an array of source data - contains X values. The goal is to parse the mathematical expression for a huge number of X values as fast as possible. Each thread should parse the same expression for different X values (the index in the source data array is calculated on the basis of thread number, expression is the same for all running threads).

Originally, tha task was to parse the mathematical expression that is represented as a binary tree, however, this requires recursion. Reverse polish notation (tree is transformed to a linear list) allows to avoid recursive calls and other unpleasant things, however, stack is still required. When parsing the sample expression (X X * X X / +) it is necessary to store two X values somewhere before multiplication and the store the result of multiplication as well.

I would appreciate any suggestions very much … still can hardly imagine how to use shared memory (or some other approach ?) for this purpose.

Thanks in advance.

MisterAnderson42 · June 3, 2008, 2:39pm

Shared memory is probably your best bet:

You could access stack items like this.

__global__ void kernel(int max_stack_depth, ...)

    {

    extern __shared__ float stack_sdata[];

   int stack_depth = 0; 

    

    // push

    stack_sdata[threadIdx.x * max_stack_depth + stack_depth] = value;

    stack_depth++;

   // pop

    top = stack_sdata[threadIdx.x * max_stack_depth + stack_depth - 1];

    stack_depth--;

    }

The function that calls the kernel will need to allocate the proper amount of shared memory max_stack_depth * blockDim.x * sizeof(float) bytes. This will need to be less than 16 KiB, so that limits the maximum stack depth somewhat.

Romant · June 3, 2008, 2:59pm

Shared memory is probably your best bet:

You could access stack items like this.
__global__ void kernel(int max_stack_depth, ...)

 Â  Â {

 Â  Â extern __shared__ float stack_sdata[];

Â  Â int stack_depth = 0; 

 Â  Â 

 Â  Â // push

 Â  Â stack_sdata[threadIdx.x * max_stack_depth + stack_depth] = value;

 Â  Â stack_depth++;

Â  Â // pop

 Â  Â top = stack_sdata[threadIdx.x * max_stack_depth + stack_depth - 1];

 Â  Â stack_depth--;

 Â  Â }
The function that calls the kernel will need to allocate the proper amount of shared memory max_stack_depth * blockDim.x * sizeof(float) bytes. This will need to be less than 16 KiB, so that limits the maximum stack depth somewhat.

[snapback]387640[/snapback]

Thank you! Yes, your answer is definitely valuable …

I have a couple of questions though:

Is there a chance that 16K limitation will be extended in the next generation of NVIDIA hardware that is about to be announced ?
Say, 16K is critical and the shared memory can’t be utilized … what is the possible alternative ?

Thank you in advance.

MisterAnderson42 · June 3, 2008, 3:50pm

NVIDIA doesn’t announce any hardware details before the big announcement, so your guess is as good as mine.

Local memory. If you instead declare the stack as

float stack_sdata[BLOCK_DIM * MAX_STACK_DEPTH];

where BLOCK_DIM and MAX_STACK_DEPTH are compile time constants, it should still work. Performance will suffer comared to shared memory, but it won’t be teriible. Since all of your threads are working with the same expression, all threads within a warp will be at the same position in a stack. Hence if MAX_STACK_DEPTH is a multiple of 32 you will get coalesced memory accesses and should attain ~70 GiB/s on an 8800 GTX.

If you need a variable block dimension / max stack depth you could allocate the stack in global memory and index with blockIdx.x * blockDim.x + threadIdx.x instead of just threadIdx.x. However, with global memory I’m not certain if a read after a write is guaranteed to read the written value: someone who knows more of the low level details might be able to answer this.

Romant · June 3, 2008, 4:11pm

Thank you very much!
Now I have a direction to work on.

E.D_Riedijk · June 3, 2008, 5:24pm

Local memory. If you instead declare the stack as
float stack_sdata[BLOCK_DIM * MAX_STACK_DEPTH];
where BLOCK_DIM and MAX_STACK_DEPTH are compile time constants, it should still work. Performance will suffer comared to shared memory, but it won’t be teriible. Since all of your threads are working with the same expression, all threads within a warp will be at the same position in a stack. Hence if MAX_STACK_DEPTH is a multiple of 32 you will get coalesced memory accesses and should attain ~70 GiB/s on an 8800 GTX.

[snapback]387677[/snapback]

Should that not be float stack_data[MAX_STACK_DEPTH]? Local memory is allocated per thread & the compiler tries to optimize things for coalescing I believe.

I also have never used local memory, but I remember something about the compiler optimizing access.

MisterAnderson42 · June 3, 2008, 6:10pm

Ooops. You are right, no BLOCK_DIM in the local memory because it is per thread.

I read that the compiler optimizes for coalesced local memory too, but it’s not obvious what it would do. Maybe that means MAX_STACK_DEPTH doesn’t need to be a multiple of 32.

E.D_Riedijk · June 3, 2008, 6:31pm

My guess would be that it allocates a big amount of global memory ceil(num_threads_per_block/32)32local_array_size big and access like:

warp1-element0 (32 elements)
warp2-element0
…
warpN-element0 (also 32 elements, even when the warp is not full (num_threads not a multiple of 32)

warp1-element1
warp2-element1
…
warpN-element1

…

warp1-elementM
warp2-elementM
…
warpN-elementM

that way you are always coalesced I believe? And then it does not need to be a multiple of 32.

Topic		Replies	Views
Postfix expression evaluation on GPU CUDA Programming and Performance	9	1615	September 19, 2016
implementing stack in cuda CUDA Programming and Performance	7	7132	September 28, 2009
Map algorithms from CPU-GPU: recursive ans stack CUDA Programming and Performance	15	14543	August 12, 2008
Has anyone implemented an effective stack? CUDA Programming and Performance	0	1514	May 29, 2007
Which memory is used for the stack frame? CUDA Programming and Performance	6	3038	September 29, 2011
Stack in shared memory Is type of stack item important ? CUDA Programming and Performance	4	7652	June 6, 2008
Is it possible to estimate the performance ? 8500GT (current) -> 9800xxx = ? CUDA Programming and Performance	15	8427	June 13, 2008
What's the difference between CUDA stack and local memory? CUDA Programming and Performance	3	1224	September 13, 2024
Minimizing stack size depending on user's application OptiX	8	544	June 4, 2024
Dynamic allocate memory inside kernel How to dynamic allocate memory inside kernel CUDA Programming and Performance	3	3467	June 12, 2009

How to implement a stack inside a kernel ?

Related topics