I am thinking of moving a Monte Carlo transport algorithm to the GPU using CUDA.

It seems to me that I should be starting one kernal, but let each thread block have its own runtime tallys and not move back into the main CPU until I have finished the total number of simulations needed in each thread block. But, as I am not originally a GPU programmer, I am having difficulty visualizing the program architecture that I need.

In general, Monte Carlo algorithms are very simple to parallelize as each simulation (or tally) has no interaction with previous or following simulations (or tallys), but they typically need fast (in this case parallel) random number generators. Built-in random number generators are typically terrible (both performance and mathematically–not VERY random).

My eventual goal is to run Monte Carlo across many different machines each having a GPUs to maximize performance. Therefore, I figure that I will need to stretch threading across machines as well as across different GPUs.

Any suggestions on working with Monte Carlo simulations in CUDA?