Prefix-form trees evaluation using CUDA

Hello!

I have a task that is trivial for ‘normal’ CPU and not so trivial for GPU as there is no stack and no recursion possible.

The task is to evaluate an expression that is kept as a tree in prefix notation.

For example, the x + 2 * x expression will look like this: +(x *(2 x)).
In the tree form this expression may be visualized in the following manner (dots are used as substitutes for spaces as all extra spaces are cutted away from the message):

…+
…/…
x…*
…/…
…2…x

I have a HUGE number of such trees, they all are independent and can be evaluated in parallel. On CPU it is a simple recursion task, not sure if it is possible with CUDA …

I’d really appreciate any suggestions and comments, thanks in advance.

Does the topology of the tree change or simply the values of the constants and variables referenced by the trees? If the topologies change, is there a only small number of topologies that will ever be observed?

Each tree may have any topology, the only limitation is a tree depth (say, 20). It is not uncommon when all the trees a totally different in topology.

However, they all work with the same set of constants, functions and variables. The main goal of all these trees is to combine available stuff (constants, functions and variables) in the ‘best’ manner, the result of the tree evaluation (it is a double precision number) determines how good the tree is.

This is just my opinion but it seems you have a very challenging problem. CUDA work best for data parallel applications where (roughly) the same set of instructions is applied to many points. By having diversity in you tree topologies I don’t see how you will be able to satisfy this requirement.

I would be inclined to look at OpenMP. Your problem would be cool to try to implement on the Cell BE processor as well.

Hopefully someone else know some way to map your problem to CUDA.

BTW: You have me intrigued. Can you say what problem you are solving?

This is a part of problem solver based on genetic programming engine that uses tree-based genome. CUDA-based trees evaluation is an attempt to speed things up.

As to the CELL - the more cores we have - the better for this task, I understand that … However, I hope to clarify if there is a good sense in trying to use Tesla or not.

There certainly are examples available where people have implemented GPs on GPUs

http://www.cs.mun.ca/~banzhaf/papers/eurogp07.pdf
http://cswww.essex.ac.uk/technical-reports/2007/csm_470.pdf

I suspect you will find that you must have significant workload per tree to see much benefit. I.e. you may need to have a large training/testing data set which is evaluated by each tree. Alternatively, if you may be able to collapse trees with identical topologies by parameterizing their constants. Testing various constants in parallel along with a large dataset would help to turn the problem into a more data parallel one.

Many thanks for the link! Yes, trully interesting … my case is definitely large data sets related.

I’d set it up like this:

  • every warp (tid/32 threads) evaluates one tree. blocks consist of blockDim/32 warps which can all process their own tree.
  • each thread (tid%32) evaluates that tree on different dataset

So if you can evaluate on at least 32 datasets at once, you can circumvent the entire issue of divergence.

I understand the point, but this are my typical conditions:

  1. What if I normally have about 10 large datasets each tree must evaluate ? In general, I’d say that I have about 50000 trees and only about 10-15 datasets each tree must be evaluated on.

  2. Another scenario is also possible - 50000 trees and 5000 datasets, but this task is not what I’m currently working on.

In general, my question about tree evaluation is mostly tactical - not strategical. I don’t have much experience working with CUDA and first I’d simply like to find out of it is possible to implement the stack inside the kernel code and evaluate a single tree as I can hardly imagine how to evaluate the tree without a stack …

Well in the case of 10-15 datasets you can still use my method, you will have a little bit of divergence but still not as much as you’d have if every thread has a different tree.

You can put a stack in shared or local memory. The first is fastest. Also, you don’t neccesarily need a stack for tree evaluation if you can serialize the operations somehow.

Sorry to bump an old thread, but I’m working on something very similar where I need to perform trees evaluation too.

Did you manage to accomplish it?