I have a task that is trivial for ‘normal’ CPU and not so trivial for GPU as there is no stack and no recursion possible.
The task is to evaluate an expression that is kept as a tree in prefix notation.
For example, the x + 2 * x expression will look like this: +(x *(2 x)).
In the tree form this expression may be visualized in the following manner (dots are used as substitutes for spaces as all extra spaces are cutted away from the message):
I have a HUGE number of such trees, they all are independent and can be evaluated in parallel. On CPU it is a simple recursion task, not sure if it is possible with CUDA …
I’d really appreciate any suggestions and comments, thanks in advance.
Does the topology of the tree change or simply the values of the constants and variables referenced by the trees? If the topologies change, is there a only small number of topologies that will ever be observed?
Each tree may have any topology, the only limitation is a tree depth (say, 20). It is not uncommon when all the trees a totally different in topology.
However, they all work with the same set of constants, functions and variables. The main goal of all these trees is to combine available stuff (constants, functions and variables) in the ‘best’ manner, the result of the tree evaluation (it is a double precision number) determines how good the tree is.
This is just my opinion but it seems you have a very challenging problem. CUDA work best for data parallel applications where (roughly) the same set of instructions is applied to many points. By having diversity in you tree topologies I don’t see how you will be able to satisfy this requirement.
I would be inclined to look at OpenMP. Your problem would be cool to try to implement on the Cell BE processor as well.
Hopefully someone else know some way to map your problem to CUDA.
BTW: You have me intrigued. Can you say what problem you are solving?
I suspect you will find that you must have significant workload per tree to see much benefit. I.e. you may need to have a large training/testing data set which is evaluated by each tree. Alternatively, if you may be able to collapse trees with identical topologies by parameterizing their constants. Testing various constants in parallel along with a large dataset would help to turn the problem into a more data parallel one.
I understand the point, but this are my typical conditions:
What if I normally have about 10 large datasets each tree must evaluate ? In general, I’d say that I have about 50000 trees and only about 10-15 datasets each tree must be evaluated on.
Another scenario is also possible - 50000 trees and 5000 datasets, but this task is not what I’m currently working on.
In general, my question about tree evaluation is mostly tactical - not strategical. I don’t have much experience working with CUDA and first I’d simply like to find out of it is possible to implement the stack inside the kernel code and evaluate a single tree as I can hardly imagine how to evaluate the tree without a stack …