Monte Carlo Transport

I am thinking of moving a Monte Carlo transport algorithm to the GPU using CUDA.

It seems to me that I should be starting one kernal, but let each thread block have its own runtime tallys and not move back into the main CPU until I have finished the total number of simulations needed in each thread block. But, as I am not originally a GPU programmer, I am having difficulty visualizing the program architecture that I need.

In general, Monte Carlo algorithms are very simple to parallelize as each simulation (or tally) has no interaction with previous or following simulations (or tallys), but they typically need fast (in this case parallel) random number generators. Built-in random number generators are typically terrible (both performance and mathematically–not VERY random).

My eventual goal is to run Monte Carlo across many different machines each having a GPUs to maximize performance. Therefore, I figure that I will need to stretch threading across machines as well as across different GPUs.

Any suggestions on working with Monte Carlo simulations in CUDA?

How much work is involved in each step? This would have an impact on whether to try to push the entire problem into the GPU, or just a part.

I’ve been interested in doing photon transport in a particle physics simulation. We have tens of thousands of photons in a given event which do not interact with each other. Each propagation step is pretty much just finding the intersection between the photon ray and the next geometric volume boundary. Then you can revert back to the CPU to do the actual boundary physics, adjust the photon position and direction, and repeat.

The way I see it, why go back to the CPU at all unless you have to?



I quite agree. But, I am trying to figure out how to split the computations. Think of have a multiple dimension integration (maybe 6 or 8 or more dimensions of freedom). I can pick a point in all of the dimensions, add the value to a running sum, and rinse and repeat until I have the variance low enough. My conceptual issue, then, is how to break these completely separate calculations into the differing threads and thread blocks. In serial modes, the one point in common is the pseudo-random number generator. If these use the same data, then we are doing redundant (and misleading) calculations. Also, my data requirements for these “dimensions” is not completely trivial (it might be 10 to 100 MB). I would REALLY like to do the entire calculation in the GPU.


For my Monte Carlo application, I have assigned one path per thread.

I first considered precomputing a large numberof random values and storing them in global memory. The problems with this approach are:

  1. The lack of inter-block communication necessitates a separate pool of random numbers for each thread block.

  2. This method doesn’t scale well: if I have 1 million paths and 1000 steps per path, I’m looking at 1 billion random numbers. Because of memory limitations, one must recycle the pool of numbers.

  3. The global reads carry substantial overhead. Maybe there are some clever ways to do stuff between reads that absorbs some of this latency…

In the end, I scrapped that approach, and imlpemented a random number generator as a device function. Here’s one reference:…pd/rpb132tr.pdf

I’ve heard there are some fast implementations of the Mersenne Twister for SIMD machines, but haven’t got that far.

I have the CPU pass an array of seeds for the uniform random number generators (one for each thread). The seeds are stored in shared memory, and updated with each call to the uniform random number generator.

This has worked out well. But I imagine there’s more than one way to skin this cat…

The use case I’m imagining would be something to plug into the GEANT4 framework (, which is standard across most of particle physics for simulation. Decoupling transport (which is a geometry question) and doing it in the GPU separate from the physics process part in the CPU would only be a code maintenance benefit. Geometry is relatively easy and doesn’t change often, but the physics part is a bunch of C++ classes that library users are can (and often do) swap in and out with their own code.

So yeah, I agree. Don’t go back to the GPU unless necessary. But physics software (at least the stuff I work on) can be kind of unruly, and sectioning off CUDA code in a highly used, but rarely changed part of the library is easier to manage. :)

(Anyway, this is all blue sky thinking anyway. I won’t be doing any GEANT4 hacking until after more important things, like graduating, occur…)

You are quite correct about many ways to skin cats… After quite a bit more research, this is a good starter website that will illuminate all of the awfulness that is pseudo-random number generators (and even worse in parallel applications).…0.html#q210.6.1

I didn’t see any SIMD Mersenne Twister algorithms, but this may just be my neglectful looking. I will be working to implement a simple Monte Carlo integration just to prove locally that we can make a pseudo-random number generator work in CUDA. When (if) I get it finished, I will post it here or on another thread.

Google search on “simd mersenne twister” returned:…SFMT/index.html

Also, I posted some RNG code:…ndpost&p=174164

Caveat: The RNG code I posted is linear congruential, so suffers from all the known problems of such algorithms. That said, it’s fast and efficient, and is sufficiently random for many applications.

Besides, it’s better than rand() or RANDU… :P

I would strongly suggest all people playing with RNGs to read L’Ecuyer and Simard’s TestU01 paper:

Mersenne Twister keeps a large state vector making it a bad choice for memory constrained or cache-concious implementations. I tested MT in my multithreaded rendering code and watched my performance drop quite a bit due to cache misses. MT has a huge period and good statistical properties, but I think that some of the xorshift RNGs are a better choice for a GPU. I hope to work on this in some detail in a month when I’m past some pressing deadlines. In the mean time, I recommend that RNG implementors sift through the results table in the TestU01 paper and decide which generators have properties they can live with, and which also work well on the GPU. For a GPU I’d strongly suggest avoiding implementations that use modulo operations or large state vectors. L’Ecuyer and Panneton have another interesting paper that has a section that delves into efficient skipping for parallel streams without using much storage through the use of combined generators:



Thanks John.

The tradeoff between speed and randomness is often a difficult one.

Mod 2^32 is very efficient! :D

Marsaglia’s Multiply-With-Carry Generator is a happy medium:

It’s simple, pretty straightforward to implement (even with 32-bit integers), and has reasonably good properties for many applications.

Also, on Multiply-With-Carry, see this paper:

Xorshift seems like a good approach, too.


Yeah, the real pain about the speed/randomness tradeoff is that you have to re-evaluate it for every application/use-case. The right RNG for one code might be a disastrous choice for another…


I am also trying to implement a Monte Carlo radiation transport code with CUDA; it seems to be a natural approach to create milions of random particle tracks.

I have been working with PENELOPE and DPM. These codes are conceptually simpler than GEANT4 and I think their physics routines can be adapted to GPU (but I don’t know how yet).

The main problem is tabulating the cross-sections and sampling them using uniform random numbers. These codes spend most of their time interpolating the cross section database, so maybe implementing them as textures and using the build-in interpolation could be faster. I don’t know if this make sense… any comment/suggestion?

For the random numbers a simple congruential generator may be enough. PENELOPE uses the “ancient” RANECU and it works pretty well. It uses only 5 lines of code and it is possible to parallelize this generators using the cycle splitting method: we can simply compute the seeds that will be generated after N calls to the generator and therefore initiate the generator in each thread N places ahead (N should be big of course).

You can read the paper from L’Ecuyer: “Efficient and portable combined random number generators” (1988) -->…ACM&coll=GUIDEç


I have worked with Monte Carlo codes for some years but I am completely new to GPUs (I can’t understand why they made CUDA an extension to C and not an extension to Fortran77?!, I am kidding… :) ).

It would be great to collaborate in this project with other developers, is anybody interested? :thumbup:

I should have put a pointer in this topic a while ago… and it is in the 0.9 SDK as well now.

I am interested in Monte Carlo implementation on GPU. I have already written a GPU raytracing algorithm that could be part of a geometry engine.

Cross-sections can be parametrized, I guess it’s better than interpolation. The memory footprint remains small, which is good for GPUs.

I’m also thinking about ways to speed up my MC radiation transport code. The trouble with running the entire algorithm on the GPU is that my geometry is an adaptive-mesh grid, so tracking rays through this leads to less-than-optimal memory access patterns. It doesn’t seem like a particularly good fit for the uncached GPU memory model.

I’ll probably start by porting the intensity(wavelength) vector operations, we’ll see if the overhead becomes prohibitive.