Map algorithms from CPU-GPU: recursive ans stack

Linh_Ha · November 18, 2007, 7:01pm

I have two questions about how to map some ideas from CPU to GPU

The fact that CUDA does not support recursive function limit the ability to use CUDA solve many problems, like that related to tree that the tree traversal normally perform in recursive manner. So i wonder what is the common strategy that people use to solve this problem.
One way to solve recursive is to use the stack. I don’t know what is the efficient way to build the stack in CUDA. Some one say i should use share memory, but since there’s no thread lock
in CUDA, how different threads can access and update stack concurrently.

Is there any reference about how to build in CUDA especially with shared memory, because global memory may be too slow with some thing we frequently access.

Any idea is appreciate, i’m a novice in this field and and want to improve my understanding about cuda

e.ping · November 18, 2007, 9:45pm

I strongly suggest you do some reading about data-parallel algorithms before you start trying to port CPU algorithms directly to the GPU. Many problems can be recast as parallel algorithms, but trying to just port across recursive algorithms on linked data structures may not work very well for you.

Google 'data-parallel algorithms". While it is a very old paper, the Hillis/Steele paper is quite good, and there are many other good references and tutorials out there. GPU Gems 2 and 3 also contain a number of useful papers.

Geoff.

asadafag · November 19, 2007, 3:08am

In our experience, when recursion is very thread-wise coherent and absolutely unavoidable, local memory makes a reasonably good stack. Shared memory is indeed faster, but it’s usually too small for a stack purpose.
By the way, why do you need accessing and updating a stack concurrently?

Linh_Ha · November 19, 2007, 5:25am

I think 16K shared memory is enough for most of simple stack purpose like ray tracing, or tree traversal. Normally i see the size of stack depend on the depth of the tree, and for balanced tree (like kd tree) we don’t use that much. I just don’t understand how to build the stack in the shared memory since i don’t find out the way that may thread can access and update stack concurrently.

The reason is some time we need all the thread has the ability to access and update the stack (like tree traversal).

I think one way is to use atomic function, however it is suitable only for Cuda 1.1, and global mem is not the best choice in term of speed.

asadafag · November 19, 2007, 5:56am

In my experience, thread usually doesn’t access each other’s stack, and there’s no need of concurrency. It should also be noted that the precious 16k is shared by all threads. Even just for 50% occupancy, each thread only get 42 bytes. That’s barely enough for a 10-leveled tree.

Linh_Ha · November 19, 2007, 10:05am

What happen if we need one stack or one stream of task for many thread

wumpus · November 19, 2007, 10:49am

Indeed, a per-warp or even per-block stack would be better, both to prevent divergence and to keep the shared memory from filling too much

MisterAnderson42 · November 19, 2007, 1:31pm

If you need a stack per block (or per group of threads), then just have one thread only update the stack:
if (threadIdx.x == 0)
// update stack
__synthreads();
continue processing, all threads can read the updated stack

I too cannot think of a use case where multiple threads would absolutely need to try and update the same stack in parallel. And if that is the case, the updates would have to be serialized in some fashion anyways and you’re non longer running in parallel.

Linh_Ha · November 19, 2007, 6:48pm

That is a good paper. I think the idea is pretty similar to “scan” technique, it work for the recursive algorithm with predetermined number of step. I wonder what is the answer in the case of “branch and bound” algorithm

hieronymus · December 3, 2007, 11:28am

I succesfully implemented recursion with a stack, works great

Linh_Ha · December 5, 2007, 8:42pm

Can you report your result. How much it is faster than serial version. what is your problem, is that branch and bound problem

hieronymus · December 6, 2007, 9:06am

I implemented a chess engine, which uses the Alpha/Beta recursive algorithm. It was quite hard to convert it to a stack mechanism. My biggest problem is the amount of “if” statements in the code, I try to minimize that.

But the engine works, I can examine 31.000 Nodes/s using 1 block running 64 threads on a single position. When I use more blocks the total Nodes/s goes to one million.

Albitex · August 11, 2008, 12:44am

Hi , I’d like what you help me, for CUDA and chess. I would like to increase the speed of to open source chess engine. How have you implemented CUDA on your engine?

Please, write me the code API . External Media

Romant · August 11, 2008, 6:06am

In fact, problem of stacks and recursions has been discussed many times.

As to the recursion - it can always be algorithmically eliminated. As to the stacks - it is not hard to implement a stack using shared or local memory when all threads should have their own stacks (I can hardly imagine the task when all threads should modify the same stack concurrently).

Search this forum by ‘stack’ keyword - you should find my questions about stacks with more than informative answers of MisterAnderson42 and other guys.

I can say a couple of words on my task, if you are interested. In my task I evaluate expressions that are previously converted from tree form to reverse polish form. RPN expressions evaluation is what I’ve actually done with CUDA using stacks in shared memory.

In fact, I can’t say that CUDA gives outstanding results (not x100 or x200 speed up, as it can be seen from prospects) on this task. For simple and not too long trees my implementation runs as fast as my 3 years old Athlon X2 4800+ (both cores are used) on 8500 GT (not too modern and fast GPU that I use for expreiments). However, as long as expressions become longer and more complex and when the number of test cases the expressions should be evaluated on grows up GPU shows much lower evaluation time increase and becomes 2-3 times faster than 4800+ with both cores loaded (and I believe that really complex expressions may show even more significant GPU advantage). I’m not sure whether GPU performance for this sort of task scales linearly for latest GPU’s or not.

aakova · August 11, 2008, 10:38pm

You are suggesting:

   if (ThreadIdx.x == 0) {

        // update stack

    }

    __syncthreads();

not:

   if (ThreadIdx.x == 0) {

        // update stack

        __syncthreads();

    }

correct ?

E.D_Riedijk · August 12, 2008, 5:24am

The second piece of code will deadlock. Only thread 0 can reach the syncthreads