PVM codes

Hello, I have a PVM/MPI code written in fortran that I would like to run on my GPU as opposed to the cluster it normally runs on. Would I have to recode the entire application with CUDA-specific functions, or can I simply recompile it with a cuda compiler?

Thanks!

You would have to recode if not redesign the whole approach to parallelism (message-passing vs shared memory).

I see, maybe I’m looking at the wrong thing. MPI and PVM enable the code to be run in parallel while distributed across a network, right? I do not need this when running on a GPU. The code also has a built-in ability to spawn threads so make use of a multicore processor (not using pvm/mpi) using shared memory. Could I take advantage of this on a GPU (ie recompile with cuda fortran and use the built-in thread spawn ability to use all the individual stream processors)?

Despite using terms like “thread” and “processor” in the documentation, the CUDA software and hardware model are not at all like the pthreads model on CPUs. Depending on how exactly the algorithm is structured it might be an easy or hard port to CUDA.

you could use your MPI as is on a GPU cluster or a multi-GPU machine with MPI, if you have one. 1 MPI thread then uses 1 GPU.

then you would rewrite parts of the code to speed it up on GPU. you could use PGI Fortran or better yet rewrite in cuda C.

quite a bit more work than ‘recompile’, eh?

Yeah no kidding. It’s a monte-carlo code, so what would I have change to “speed it up” on a GPU? If the answer is really complicated, could you point me to some literature?

seibert - I see that you’re from LANL, so maybe you’d be able to answer my question more specifically. I’d like to run MCNP5 on a gpu cluster. Would that be very hard to do?

The project I’m working on now is taking an existing monte carlo code (not neutron transport, but similar in that there are particle walks with fixed, complex geometry).

I basically ended up using none of the CPU code and reimplemented the algorithm from scratch… each stage of the simulation is just not practical to keep sending to and from the CPU. It (like most MC) is embarrassingly parallel, so everything basically got reorganized to keep tens of thousands of walks in flight at once, loading local geometry voxels as needed, repartitioning groups of walkers together in coherent bundles, and keeping the bookkeeping for all the results. The CPU code is very straightforward and deals with one walker at a time. The GPU code is considerably more complex, and literally 80% of the code is all in information managing, loading cells, tallying walk results, launching groups of fresh walkers, etc. The GPU “walk” core code is just as simple as the CPU, though.

Speeds on the GPU are quite nice and are about 40X one i7 core.

I have found having the CPU implementation was very useful for debugging… its results (and partial results at different stages of the walk progression) have been great to find discrepancies (invariably due to bugs). GPU run discrepancies in fact found quite a few bugs in the existing production CPU code which everyone had thought was a gold reference robust computation.

What kind of gpu did you run your code on?

Baseline (for me) is using of the two GPUs of a GTX295, and/or one core of an i7 CPU. In “real” runs I usually use a machine with 2 or 3 GTX295s so throughput is 4 or 6 times that baseline. On the CPU in practice I run with 4 threads on a quadcore i7.

I assume if you used a dedicated computation card like a c1060, c2050, or c2070, the performance gain per gpu would be substantially greater?

I have never used MCNP5, so I can’t answer that, unfortunately. (I do neutrinos and dark matter stuff…)

LANL completed a few years ago their monster cluster made from high-end Cell processors, so I expect there is someone in this place porting MCNP to Cell. (Sadly, they might be behind a security fence for all I know, so we may never hear about it.)

Nope. The GTX295 is the fastest-per-slot card you can get since it’s got 2 GPUs. I can partition my problem so it stays under 700MB so I don’t need the larger RAM of the Teslas.

The GTX480 may change that though… it will be close in performance for me versus the GTX295 (90%? 100%? 110%?) Likely it’ll be superior overall just because of the larger RAM and potential Compute 2.0 options… plus it’s getting harder to find GTX295s in stores.