openMP translation

I have a code that is usually run on a cluster with MPI interfacing the nodes and openmp running on the slave processors in the nodes. Would it be easy to recode the openmp parts of the code into cuda? So in effect I would do away with MPI, and the master would be the host cpu and the slaves would be on the gpu. I have seen a paper on making a compiler to automate openmp to cuda translation… Anyway let me know what you think. Thanks!

Can you say a little more about what tasks the slaves perform? Converting OpenMP to CUDA is more straightforward if every thread is going through roughly the same sequence of instructions. Threads which diverge on a multicore CPU have essentially no performance penalty, but quite a large penalty in CUDA.

Its a monte carlo type code, so each thread will have exactly the same instructions (only differences arising from random numbers). Or what do you mean by diverge? I guess its possible for one thread to run longer than another if the sequence of random numbers says it should…

Actually, this is a key point. Do the random numbers affect the control flow in the code? Threads in CUDA are bundled into groups of 32 (the “warp”), and if threads within a warp go different directions at a branch, then the instruction scheduler basically has to run the warp twice, once for each branch. (It’s can be a little more subtle, but that’s the basic idea.) Moreover, you generally want all the threads in your kernel to finish at roughly the same time as the CUDA schedulers are not nearly as dynamic as a CPU scheduler. A thread (as part of a block) cannot migrate from one multiprocessor to another in order to balance the load. Wherever a thread starts running is where it will finish.

There are ways to handle this (and don’t be afraid of a little divergence if you can’t avoid it), but it’s hard for me to guess. There is only a narrow subset of the space of OpenMP algorithms which map directly and efficiently to CUDA, and it sounds like you are slightly outside of that. Some rethinking of the problem will probably be required to get a good fit.

Yes, its a particle transport code, so the random numbers decide what happens to the particle. Am I approaching the problem incorrectly? Basically, there is a while loop that randomly samples probability distributions and decides what happens to a particle. If it samples to a “kill particle” situation, the while loop is terminated and the particle history is returned. I was thinking each thread would be an independent particle, but maybe this is not the correct approach if you say each thread should complete at about the same time.

One option is to assign one thread per particle, but to launch a kernel per step (or couple steps). Periodically you’ll have to compact the particle list to remove the dead particles. Once the # of particles drops below some value, you’ll want to finish on the CPU.

Another option is to use the one kernel launch per step, but to replace dead particles between launches with new particles, so you don’t need compaction steps.

At this point, you’ll probably need a simple kernel to compare the options and see what works. (And whether the number of particle-steps per second can be competitive with your current system.)

Hmm interesting. But for coding the kernel part, can you call external functions inside a kernel (one that isn’t a simple mathematical operation)? Ie one that samples a distribution and tells the particle whether or not its dead?

You can write arbitrarily complex device functions. If you put them in an external file, you will need to #include that file into your main CUDA source file, as device functions are almost always inlined for speed reasons.

Edit: You need to label these functions with device, much like global for the main kernel.

Generating random numbers on the device is a little tricky, due to the massive parallelism. You should take a look at the Mersenne Twister example in the SDK, or poke around searching for “CUDA random number generator.” (Just be aware that RNGs are easy to do wrong, so definitely look carefully at whatever code you pick.)

Oops, failed at editing…