Sounds very interesting indeed! I would certainly like a copy of your pre-print.
I have been toying with the idea of MC radiation transport on the GPU as well, but for photon energies used in therapy (6-15MV). Maybe your code could serve as a framework for the higher energy case as well. The open questions there, of course, are how to deal with more active secondary particle production and the accuracy of single precision.
Monte Carlo radiation transport algorithms fit quite well in a massively parallel architecture such as a GPU, and I don’t think there is any problem in simulating high energy photons and charged particles.
Obviously, simulating electrons tracks will require more time than photon tracks because they interact much more often, but this is the same in the CPU or the GPU. Using only double precision operations, a CUDA code is about 10 times slower than using single precision; but even in this situation the GPUs can still be faster than the CPUs (at least this is true for my x-ray transport code).
I really think that GPUs are already a good alternative to CPUs for MC simulations, in terms of computing capability, power consumption, and hardware cost (a 480-processor GeForce GTX 295 costs about $500: just ~$1 per processor!).
It looks like there is many people interested in GPU based MC simulation; it would be great to find a way to discuss this topic in more detail. Do you know of any “radiation-physics friendly” GPU conference/workshop where we could meet?
I will surely present my work in the SPIE Medical Imaging Symposium.
Would anybody be interested in attending a roundtable about MC simulation in the upcoming NVIDIA Research Summit? (!)
The deadline for submitting roundtable proposals is July 7.
That’d be very interesting. You may want to distinguish the categories of MC apps… there’s the science community dealing with neutron transport, xray imaging scattering, atmospheric simulation, etc, then there’s the graphics guys dealing with visual simulation of light transport. There’s huge overlap in techniques from both camps of course, and such a roundtable would likely benefit from that.
Id be interested in that round table. Im in the radiotherapy dose calculation field. Im still not sure ill be attending the conference but that could be a push in the right direction for someone to fork over the funds!
On a side note, ill be attending the aapm conference (and presenting about GPUs) if anyone around here plans on going.
I am porting a Monte Carlo radiation transport code to CUDA, for a similar project as discussed in this thread.
For my application precision is important thus I have to used double precision.
My problem is I cannot use more that 192 threads with compute capability 1.3 (as I need double precision) but get the runtime error “too many resources requested for launch”,
Also to keep SPs occupied I have implemented I have implemented a conditional loop inside a for loop just like the code I downloaded from [url=“http://www.atomic.physics.lu.se/biophotoni...pu_monte_carlo/”]http://www.atomic.physics.lu.se/biophotoni...pu_monte_carlo/[/url]
But the problem is when I use compute capability 1.3 the else part of the conditional loop causes the same runtime error above.
As you all have been working on similar things for a quite while I thought of posting my problem.
Can anyone help me up with a solution?
Currently I only have 3x spped up, but I atleast need it to get increased to 10X.
Interesting, I should have done a search for this back in January!
I am just about finishing up my work on a neutron transport MC code. It is not at all oriented towards production, instead it is a ‘proof-of-concept’ for a masters thesis.
The code currently sees around an ~8.8x speed up (single precision) compared to running on the cpu, and considering i am not a good programmer, that isn’t too shabby at all.
How is everyone else’s project coming along?
I initially looked at the event-based algorithm (coincidentally talked about in one of the earlier posts - the particle goes through an event, the list of particles gets sorted and sent to different kernels depending on what happens next) and the most I could get (albeit as a very naive cuda programmer at the time) was ~ 1.13x. Not good at all.
Finally, after I was ready to say CUDA wasnt worth it, I decided to try out the thread-per-particle algorithm (which I neglected at first because from what I had read and understood, divergent branches were the CUDA-killer). And in my very first implementation, without using constant/shared memory or textures, I had ~2x speedup. Encouraging. I kept at it, and here I am now at 8.8x!
Now im about ready to start writing my thesis, and plan on presenting at PHYSOR 2010 (advances in reactor physics - pittsburgh, may 2010).
Was that roundtable ever held at the nvidia conference?
If anyone wants to discuss further, PM me and I will send you my email address.
Adam
I know this is an old post but if anybody is interested we are currently distributing our Monte Carlo simulation code MC-GPU in the website http://code.google.com/p/mcgpu/.
We get a ~30-fold speed up with this code in a Fermi GPU (using single precision), compared to a state-of-the-art CPU core.
I am currently working on a similar problem, i.e. that computation times per thread differ significantly and there is a considerable amount of thread divergence. To efficient use of the GPU, I think block parallel is the only way to go. The question I have is how would one go about solving point 2, compactStream. I was wondering if anyone has a good pointer for good implementations / ideas thereof on how to redistribute the work load for the next work block efficiently. If one implements a single stack for each task in the next work block, prohibitively expensive atomic operations are required, if there are only, say 5 options. Several stacks for one and the same options require sorting and rearranging. I’m sure there’s something that I’m missing… Help anyone? Any hints are very much appreciated.
My stream compaction method did work, somewhat. But it got less and less interesting through hardware revisions. Useful on CC1.1 but not so much with 2.0.