I’m investigating if porting a Monte Carlo radiation transport code to CUDA would result in a significant speed improvement. I’m looking for 5x or better compared to a single processor 3.4 GHz Pentium. I’ve read through the programming guide and this forum and have some questions.
First, I found the previous thread where the original poster asked the same question, but the thread veered somewhat off topic to discuss random number generators. I’ve profiled the code and about 20% of the CPU time is spent in the RNG. So just putting the RNG into the GPU wouldn’t give me the speedup I’m looking for.
The way the code works is that it tracks the history of photons. Each photon can lose energy by a number of different mechanisms, and it can generate secondary photons or electrons that need to be tracked also. At each step the dice is thrown (a random number is selected) and the particle may undergo an event (Compton or Rayleigh scattering, photoelectric absorption, etc). So there is a lot of branching in the code depending on the value of a random number. My understanding is that this kind of branching is not suitable for CUDA. Is that correct?
One thought I had was to have the complete code run on the GPU, with each multiprocessor running one copy of the simulation. For example, with the Tesla C870 card, I would have 16 copies, and follow the history of 16 particles simultaneously. From the reading I’ve done, I don’t see how to do this programmatically. If I was to have 16 threads, would I be guaranteed that one thread would run on each processor? Is it possible to programmatically tell the GPU to have one thread per processor?
If I could do the above, then there is the problem that all 16 particles wouldn’t finish at the same time. This is because one initial photon may have 1 or many interactions before all it’s energy is deposited. My next thought was to have each processor track a number of initial particles, e.g. 10,000. By having each processor do a number of particles, then the average number of interactions per particle should be similar and each processor will finish at about the same time. Then I’d pull the data out of the GPU and repeat until I’ve processed the total number of particles requested. Does this sound like a reasonable approach?