Is CUDA suitable for Monte Carlo transport?

I’m investigating if porting a Monte Carlo radiation transport code to CUDA would result in a significant speed improvement. I’m looking for 5x or better compared to a single processor 3.4 GHz Pentium. I’ve read through the programming guide and this forum and have some questions.

First, I found the previous thread where the original poster asked the same question, but the thread veered somewhat off topic to discuss random number generators. I’ve profiled the code and about 20% of the CPU time is spent in the RNG. So just putting the RNG into the GPU wouldn’t give me the speedup I’m looking for.

The way the code works is that it tracks the history of photons. Each photon can lose energy by a number of different mechanisms, and it can generate secondary photons or electrons that need to be tracked also. At each step the dice is thrown (a random number is selected) and the particle may undergo an event (Compton or Rayleigh scattering, photoelectric absorption, etc). So there is a lot of branching in the code depending on the value of a random number. My understanding is that this kind of branching is not suitable for CUDA. Is that correct?

One thought I had was to have the complete code run on the GPU, with each multiprocessor running one copy of the simulation. For example, with the Tesla C870 card, I would have 16 copies, and follow the history of 16 particles simultaneously. From the reading I’ve done, I don’t see how to do this programmatically. If I was to have 16 threads, would I be guaranteed that one thread would run on each processor? Is it possible to programmatically tell the GPU to have one thread per processor?

If I could do the above, then there is the problem that all 16 particles wouldn’t finish at the same time. This is because one initial photon may have 1 or many interactions before all it’s energy is deposited. My next thought was to have each processor track a number of initial particles, e.g. 10,000. By having each processor do a number of particles, then the average number of interactions per particle should be similar and each processor will finish at about the same time. Then I’d pull the data out of the GPU and repeat until I’ve processed the total number of particles requested. Does this sound like a reasonable approach?

Sounds like an interesting problem, but there’s not quite enough to say what’s right. I suspect that the initial inputs are a set of particles with given energies. But what do you want for the outputs? Will it be a complete history of each input particle, or a set of surviving particles, or some other set? Are there particle interactions, or are they independent (after forking)? How are you planning to represent particles producing additional particles? The answers are likely to affect the factoring of the problem.

Note that divergence (e.g. branching) is only a problem for threads within a block. Different blocks can execute different code without penalty. Even within a block you only pay a penalty if the threads within a warp diverge or when you do a syncthreads call.

Another approach might be “blocked parallelism,” in which you cycle through a series of kernels over and over:

  1. takeStep - Advance all particles by one step, tagging them with an interaction code that occurs at the step endpoint. (Attenuation, refraction at material boundary, etc.)
  2. compactStream - Sort particles by interaction code, removing attenuated particles.
  3. applyInteraction - alter particle momentum and energy based on interaction code. Since the particles are grouped by interaction, the branching won’t be as bad.
  4. Go to 1.

The most painful part of this scheme is step 2, which involves a lot of shuffling, and some thinking to maximize the speed of the sort. The number of active particles will slowly drop until there is no point in executing step 2, or maybe even using the GPU at all. Some kind of bailout condition where the remaining particles are pulled back to the CPU and finished there might help the last few iterations.

I think that seibert’s approach works reasonably well if the particles do not interact.

Assuming that, what is likely to be costly is global memory access. You can do as suggested in two nested loops to reduce this. The inner loop works within a block of threads, updating status just in shared memory.

You start out each block partly full (fraction dependent on forking probability). For simplicity, a fork always writes to tx and tx*2 (where tx is the thread index). When you finish one step you compact the results in the block. You stop when the block is empty, or is too full to fork, or after some fixed number of iterations (TBD). Then you write out the state as suggested, compact the results globally, and repeat the outer loop. You should be able to save a lot of global reads and writes this way.

To your questions:
The problem is a set of radioactive sources embedded in a 3D volume. I’m using a cartesian coordinate system. Usually the volume is about 10cmx10cmx10cm with a 1mm^3 voxel size (1e6 voxels). Each source is isotropic, i.e. it emits photons into the volume with a random direction and an energy randomly chosen from the known spectrum of the source. The output I want is the total energy deposited in each voxel from all initial particles. I don’t need each particle history. There is a flag to output particle histories in the current code, but this is only used as a debugging tool. When a particle exits the volume, that’s the end of tracking it.

The particles don’t interact with each other, only with the material in each voxel.

When one particle produces another particle the energy and direction of the new particle is determined by the mechanism that created it. In general it will have a different direction and lower energy than the initial particle. There is a cutoff energy, so when a particle falls below this energy the tracking is terminated and the remaining energy is deposited into the current voxel. In the existing code when a new particle is created it’s added to a stack. When the history of the original particle is finished, then particles are pulled off the stack and their history is tracked. This is done until the stack is empty. Then a new particle is created and the process repeats.

I understand that memory accesses are costly. The current code keeps a 3D array in memory and when energy is deposited into a voxel, the additional energy is added to the current value in the voxel. Implementing this in the GPU wouldn’t work because of the access time issue for global memory and the memory collision problem. The idea I had for storing the energy deposition is for each thread to create a list of pairs of energy deposited and voxel number. When all threads are done, pull out the data and have the host add up all the energy in each voxel. My guesstimate of the GPU memory requirements for this is what led me to say that each thread would do 10,000 particles.

I’m trying to understand seibert’s approach. I think it’ll work but I’m not sure I understand it well enough at the moment.

With CUDA, it is always easiest to try the “keep it simple” method first, it is often the fastest. So, for the moment completely forget that divergent warps are “bad” and just implement 1 thread per particle. See what kind of performance you get and come back here for ideas. Or N particles per thread would work too, you really need about 10,000 threads to get into the linear performance regime of the card.

Your list idea for the counting particle-cell events is a good one: there may even be a way to perform a reduction of those values on the GPU, though it might be faster on the CPU.

Anyways, here are a few ideas for boosting the performance within the context of 1 particle per thread. I’m calling your “inner loop” the one that moves a particle and then decides if an event occurs.

  1. If the major branches in your inner loop just update i.e. velocity, energy and flags but the guts of the inner loop remains the same, the cost of divergent branches will be minimized.

  2. Does an event that needs to be recorded in your list happen on every iteration of the inner loop? Or just on ones when the right random numbers come up? If you would write to the list every time, then you can make your writes coalesced easily. If the list adds are random… well you won’t get coalescing and that will hurt.

  3. This idea will be the most complicated to code, it might not even be possible, but it could reduce your memory requirements and fix the coalescing even when the list writes are random. Instead of building one list per thread in global memory, do it in shared memory. Obviously, the shared memory can only handle a few (maybe tens, I don’t know) iterations of the inner loop before it needs to be flushed to global mem. If everything is setup right, you can reduce the list to write one list per block and perform the global memory write coalesced.

Ask me to elaborate if anything doesn’t make sense. This is just me rambling off a bunch of ideas I have in my head.

Thanks for the help and suggestions.

To Mr.Anderson42, re: #2, the answer is no for the current code, although there may be a way of rewriting it so that this is true.

It’s clear that I need to do some more reading and understanding of CUDA, especially the idea of coalesced writes. I’ve got no way of evaluating which suggestions would result in the best optimization for this application. The code consists of two parts, one defining the geometry and one the physics. I think what I’m going to do is implement a simplified geometry, i.e. the same material in all voxels, and one or two physics functions. I’ll try out the different ideas here and see which produces the fastest results.

Currently I’m doing a similar project too. My MC simulation don’t require to calculate the new particles(only one particle per each path). However, I also need to do the sum in CPU, since the collision and scatter issue(I’m using Cg, right now I’m thinking I really shoud implement it first in CUDA :yes: )


I’m not familiar with Cg Monte Carlo. Are you referring to the MORSE-CG package?

It’s probably Cg as in the GPU shader language Cg. It is for graphics and much more limited than CUDA.

Not quit sure what MORSE-CG is. But, right now I don’t think using Cg is a good idea, since it would be a difficult task even use CUDA.

Any progress about your CUDA implementation?

Very slow progress. I got pulled off of the project to work on a “crisis”. If I’m lucky I can spend an hour a day on it. Hopefully, I can get back to the MC project in a few weeks.

BruceD and liushusen,

Nice to see so many people interested in accelerating Monte Carlo simulation of radiative transport. I came here for the same reason and I’m hoping to speed up the simulation time of my future simulations quite a bit.

What applications are you looking working on (if you can reveal that)? I’m in the biomedical optics field and is currently interested in photon transport in scattering media such as tissue or pharmaceutical tablets. If any of you are doing anything similar it would be interesting to exchange knowledge and experience. Either way, the problem is basically the same so I’ll keep watching this thread. Keep us posted on you progress.

I’m working on using MC in the radiation therapy field. Photon energies go up to the MeV range. So there’s different physics than at optical energies! I’ll add to this thread when I have something useful to report.

Wow~Seems that we are working on the same project! I’m trying to implement a GPU version of MCML(MC multi-layer simulation program wrote by Lihong Wang in ASCII C), right now, the same as Bruce, I’m also stacked by other things so the progress is kind of slow. In addition, I’ve switched to CUDA.

May be we can wrote it together, if possible. Contact me if you’re interested.

My email: Liushusen.11(at)

Haha, me too. Sat down to start this a while ago, but haven’t picked it back up. Looking forward to hearing what you guys come up with, esp. re: decisions when crossing layer boundaries. I got stuck on a non-branching strategy that just looked up the local optical properties at the interaction site, which I had stored in memory. But I also had to hold the n-1 location to figure out what fraction of the photon path had been in one layer or the other, which required an if/else to avoid doing this at every interaction step.

Please let me know what you guys are thinking…would love to help. We’re actually working on an open-source “Virtual Photonics” toolbox for Biomedical Optics here at the Beckman Laser Institute at UC Irvine. Maybe you’d be interested in contributing?

You can contact me at david(dot)cuccia(at)modulatedimaging(dot)com


I found this site with the source codes of an implementation of Monte Carlo simulation of photon migration, could be helpful for you:

Hi again everyone!

One year later, I’m finally finished with a first implementation of MCML using CUDA, including partial documentation of the code as well as description of the bottlenecks and implementation considerations.
You may download source code, documentation as well as a Windows executable at our webpage:…pu_monte_carlo/

There we also offer our time-resolved Monte Carlo code (as used in our GPGPU Photon Migration Monte Carlo paper). This code is the same as the one linked to by a_grau in the post above, but located at our new and updated webpage.



Thanks Erik! Congrats on the release. :)

Hi everybody,

I just finished a CUDA program to transport x rays in the GPU using a Monte Carlo algorithm. :yes:

Essentially, I translated the atomic physics models from the PENELOPE package to CUDA. In the simulation code, the geometry is described with voxels and efficiently traced using Woodbum tracking.

I will use this code to simulate realistic projections of the human anatomy (one x ray at a time).

I have submitted a short paper describing the code, I hope it will be published soon. Meanwhile, I will be happy to share a pre-print to anyone who may be interested.

The source code will be eventually released for free, but I still need some time to validate the accuracy of the simulation.

I will try to answer some of the comments from the original post, according to my experience:

In my case, the same code runs 27 times faster in the GPU (1 gpu, NVIDIA GTX 295) than in the CPU (1 core, intel quad-code2), using single precision and NVCC fast_math. Using some double precision operations and accurate math, the speed up is reduced to a factor 7.6 (not too bad). In my imaging simulations, the precision doesn’t seem to affect the result significantly.

I think it was me who messed the discussion talking about random numbers… sorry :-)

This branching is not too bad: it will affect only the threads within each warp (ie, each thread will be slowed down only by the diverging threads in its warp), you will still have many warps working in parallel in your device. Furthermore, if you simulate low energy photons, as I do, the most probable interaction is photoelectric and therefore most particles are quickly absorbed and there is not so much divergence in the tracks.

The “blocked parallelism” that ‘seibert’ suggested could potentially speed up the simulation, but if the implementation is too complicated it can actually slow it down. Saving registers and reducing memory accesses is a key factor in the speed, maybe even more important than reducing branching!

You need hundreds of threads in each processor to hide the latency. In my case, I usually launch 384 threads in each multiprocessor and, according to the NVIDIA profiler, access to memory doesn’t take too much execution time (~15%).