Alternative for recursion?

What would be the best alternative to implement “recursion” since this is not supported on the GPU?

What I want to do (backtracking) on the GPU normally uses a lot of recursive routines.

Any hint?

You can implement the same functionality with a while loop and a stack. Indeed, you can always transform recursive calls into iterative loops.

How would you implement a stack on a gpu? It is the memory addressing I am struggling with. Writing to fixed addresses and then a sort and/or scan (to calculate the address) doesn’t seem that efficient, followed by a write. Any suggestions?

There are a couple of practical issues to using a stack-based recursion on the GPU. In CUDA these stack arrays are allocated in local memory, which has terrible latency. One can work around it by allocating a block of shared memory and divvying it up, one substack per thread, though it would be more logical for the local memory to be small but fast, serving as an L1 cache. In some cases, e.g. sorting where the stack depth has a provable finite limit, say 32 elements or less, one can use macros/templates to generate a sequence of method/subroutine names, e.g. sort calls sort1, which calls sort2, which calls sort3, etc. I’ve not tried this, although I have written a paper on a fully vectorized quicksort, but it at least offers the possibility that registers rather than local memory will be used to “stack” key variables.

Care to share a reference? Although when you implement thread specific stacks, you no longer benefit from load balancing (but you do get around the memory addressing issue and the problem might have been balanced to begin with), is that right?

Thanks