Prefix-form trees evaluation using CUDA

Romant · February 12, 2008, 8:23pm

Hello!

I have a task that is trivial for ‘normal’ CPU and not so trivial for GPU as there is no stack and no recursion possible.

The task is to evaluate an expression that is kept as a tree in prefix notation.

For example, the x + 2 * x expression will look like this: +(x *(2 x)).
In the tree form this expression may be visualized in the following manner (dots are used as substitutes for spaces as all extra spaces are cutted away from the message):

…+
…/…
x…*
…/…
…2…x

I have a HUGE number of such trees, they all are independent and can be evaluated in parallel. On CPU it is a simple recursion task, not sure if it is possible with CUDA …

I’d really appreciate any suggestions and comments, thanks in advance.

wildcat4096 · February 12, 2008, 8:50pm

Does the topology of the tree change or simply the values of the constants and variables referenced by the trees? If the topologies change, is there a only small number of topologies that will ever be observed?

Romant · February 12, 2008, 9:00pm

Each tree may have any topology, the only limitation is a tree depth (say, 20). It is not uncommon when all the trees a totally different in topology.

However, they all work with the same set of constants, functions and variables. The main goal of all these trees is to combine available stuff (constants, functions and variables) in the ‘best’ manner, the result of the tree evaluation (it is a double precision number) determines how good the tree is.

wildcat4096 · February 12, 2008, 9:18pm

This is just my opinion but it seems you have a very challenging problem. CUDA work best for data parallel applications where (roughly) the same set of instructions is applied to many points. By having diversity in you tree topologies I don’t see how you will be able to satisfy this requirement.

I would be inclined to look at OpenMP. Your problem would be cool to try to implement on the Cell BE processor as well.

Hopefully someone else know some way to map your problem to CUDA.

BTW: You have me intrigued. Can you say what problem you are solving?

Romant · February 12, 2008, 9:32pm

This is a part of problem solver based on genetic programming engine that uses tree-based genome. CUDA-based trees evaluation is an attempt to speed things up.

As to the CELL - the more cores we have - the better for this task, I understand that … However, I hope to clarify if there is a good sense in trying to use Tesla or not.

wildcat4096 · February 12, 2008, 10:38pm

There certainly are examples available where people have implemented GPs on GPUs

[url=“http://www.cs.mun.ca/~banzhaf/papers/eurogp07.pdf”]http://www.cs.mun.ca/~banzhaf/papers/eurogp07.pdf[/url]
[url=“http://cswww.essex.ac.uk/technical-reports/2007/csm_470.pdf”]http://cswww.essex.ac.uk/technical-reports/2007/csm_470.pdf[/url]

I suspect you will find that you must have significant workload per tree to see much benefit. I.e. you may need to have a large training/testing data set which is evaluated by each tree. Alternatively, if you may be able to collapse trees with identical topologies by parameterizing their constants. Testing various constants in parallel along with a large dataset would help to turn the problem into a more data parallel one.

Romant · February 13, 2008, 9:06am

Many thanks for the link! Yes, trully interesting … my case is definitely large data sets related.

wumpus · February 13, 2008, 9:14am

I’d set it up like this:

every warp (tid/32 threads) evaluates one tree. blocks consist of blockDim/32 warps which can all process their own tree.
each thread (tid%32) evaluates that tree on different dataset

So if you can evaluate on at least 32 datasets at once, you can circumvent the entire issue of divergence.

Romant · February 13, 2008, 10:00am

I understand the point, but this are my typical conditions:

What if I normally have about 10 large datasets each tree must evaluate ? In general, I’d say that I have about 50000 trees and only about 10-15 datasets each tree must be evaluated on.
Another scenario is also possible - 50000 trees and 5000 datasets, but this task is not what I’m currently working on.

In general, my question about tree evaluation is mostly tactical - not strategical. I don’t have much experience working with CUDA and first I’d simply like to find out of it is possible to implement the stack inside the kernel code and evaluate a single tree as I can hardly imagine how to evaluate the tree without a stack …

wumpus · February 13, 2008, 12:08pm

Well in the case of 10-15 datasets you can still use my method, you will have a little bit of divergence but still not as much as you’d have if every thread has a different tree.

You can put a stack in shared or local memory. The first is fastest. Also, you don’t neccesarily need a stack for tree evaluation if you can serialize the operations somehow.

shiftreduce · July 22, 2009, 1:51pm

Sorry to bump an old thread, but I’m working on something very similar where I need to perform trees evaluation too.

Did you manage to accomplish it?

Topic		Replies	Views
decision tree classifier in CUDA.. some doubts CUDA Programming and Performance	8	2614	April 6, 2024
Map algorithms from CPU-GPU: recursive ans stack CUDA Programming and Performance	15	14355	August 12, 2008
smart ideas for an interesting problem CUDA Programming and Performance	21	9564	December 10, 2008
Optimizing kind of a binary tree please advise on speeding up CUDA Programming and Performance	9	5101	September 30, 2011
Dynamic functions; Expression tree or other alternatives? CUDA Programming and Performance	1	1267	February 22, 2009
What kind of data structures do you use? CUDA Programming and Performance	14	9121	August 4, 2009
Optimize - Many small operations (CPU is faster for now?) CUDA Programming and Performance	2	526	July 11, 2019
Some questions about CUDA and my problem CUDA Programming and Performance	3	2359	June 18, 2008
A (not so) hypothetical question CUDA Programming and Performance	6	1650	March 24, 2009
Can CUDA do permutations CUDA Programming and Performance	11	12648	March 30, 2012

Prefix-form trees evaluation using CUDA

Related topics