Three options really:
- If all of the threads in a CTA always push-pop the same data on the stack then you can maintain one stack per CTA and just have one thread read/write to it.
If this is not the case for your application, take a moment to think about re-writing your algorithm or picking a different algorithm, the next two will be complex to implement and slow.
For both of these you should create some struct for each stack entry. It might be helpful to create it in C++ and implement device push and pop methods or in C with modifier functions. Also, to avoid pointer aliasing problems, you probably want to use a base pointer to the stack and then an index to represent the stack pointer for each thread.
Global/Local Memory Stack : Determine the max stack size that you need per thread (multiply this by the number of threads per CTA if you are using global memory), allocate that much memory. If you don’t know the max size, modify your algorithm such that the max size is bounded (for example, for non-recursive versions of quicksort, you can bound the stack size to logN by only pushing the smaller set onto the stack). Initialize the stack pointer for each thread’s stack. Push the first entry onto the stack. Start the kernel, pushing and poping the stack, loop until the stack is empty for each thread.
Shared memory: do the same thing as 2, but maintain the stack in shared memory, this is tricky since there is a limited amount of shared memory.
I would strongly recommend against this method for sm_13 gpus as stack to out of global or local memory will likely not be coalesced and there may be a large amount of divergence among threads.