CUDA Edit for CFD Solver - NASA Project

I am working to determine the feasibility of editing a 31k line Fortran 95 CFD solver to implement CUDA. This is the first time I’ve dealt with CUDA so I could use a little help getting started. I’ve read the programming guide and all, so as far as syntax is concerned I believe I’m fairly well off. I did have a few specific questions and a few general:

  • If I decide to convert the ‘do loop’ in a subroutine to a kernel subroutine, how do I handle other routines called within that loop? I’m currently looking at coding a new module just for the loop/kernel and including a “device copy” of all called subroutines. This is getting rather tedious, however, as I’m having to track down every use/import variable and redefine it in device global memory. And then editing every variable defined within the subroutines to either sit in shared or global memory. Any tips would be appreciated.

  • The GPU purchased for this project is a c1060 Tesla. Will I be able to use the following attributes in device code: allocatable, save, target, pointer

  • The the tree of subroutine calls originating from the loop/kernel is three nodes deep and 3 nodes wide at its widest and deepest, including a total of six subroutines to be called from the kernel. Is this reasonable? Should CUDA be used to parallelism even smaller chunks? For reference, the loop is contained within a subroutine which, itself is part of a much larger loop containing many subroutine calls. This level of the software is distributed using mpi.

  • Regarding mpi, will I need to limit to one CPU core while testing with one GPU? Or can I let mpi run as usual, each CPU calling the kernel subroutine on its own. Ideally, would there be one GPU per CPU?

I apologize for the long post but its always difficult to get started with something completely new, though I feel like I’ve done sufficient data gathering to make helping me fairly easy for anyone willing. Thanks for any help in advance.

If you are looking @ double precision - TESLA C1060 is not going to give you much. Theoretical DP flops is around 78Gflops (or 93?)… You will hardly be 10x faster than MKL on single core (if u r getting to peak performance on CUDA – then memcopies r gonna take away some flops)

What we do is to write the code to C and call it from CUDA - very convennient for us – avoiding all FORTRAN gymnastics.

OR

You can purchase the CUDA FORTRAN COmpiler from PGI – atleast get a trial version and evaluate.

We’re currently working off of the PGI Fortran Compiler, so accessing the architecture isn’t a problem.