Translating FORTRAN to C++ to CUDA advice

Hi,

Im new and planning how to convert a LARGE serial Fortran program to run on GPU. The code is a finite differnce model and I suspect very parallelizable. Advice on below would be much appreciated.

My plan is to convert the FORTRAN to C++, check all works through visual studio express then move to CUDA
and test different methods for speedup. I saw Portland Group have released CUDA Fortran which sounds like would skip the onerous step 1 however this is (£?)800 and beyond budget (unless speedup is extreme which I dont know yet).

I have tried (automated) OpenMP implementation with the intel copmpiler on the Fortran code and have had limited success. Certainly not the speedup hoping for. I am certain have got the options correct but suspect CPU parallelisation (at least letting the compiler decide where to parallelize) is not the answer. This is why im lookiing to GPU/CUDA

I want to develop and initially test on my ‘good last year’ gaming machine and if initial results look promising upgrade to a more intensive system.

When complete will make the code open source

A few specific questions:

Im willing to learn C, but would rather C++. For the purposes of CUDA would C++ be usable?

The program runs in double precission (required). Do I need a more specialist card?

Do you see any ‘not good plans’ in above?

I am new to CUDA and would appreciate any advice.
Cheers
Al

Some comments:

If double-precision is must, then you’ll get mediocre improvement, at best, with current generation of hardware. But what you described sounds like rather demanding task, so Fermi hardware will probably get released along the way, and good double-precision performance should be expected using it.

If you know Fortran well, than I’d suggest sticking with Fortran and going CUDA Fortran route, using PGI compilers (these are costly indeed, but then again man-hours for the development you described are probably going to cost much more). On the other side, do not expect much of the automated parallelization to get done by compiler for you: CUDA is completely different programming model from anything else, so you’ll have to identify bottlenecks in existing code, and then design and implement parallelized replacements by yourself.

Hi Al,

Another option is to use PGI’s directive based “Accelerator” model (see: http://www.pgroup.com/resources/accel.htm). Basically, you just add directives around the code you wish to offload to the GPU and the compiler takes care of the details.

If you have a specific question about CUDA Fortran or the PGI Accelerator model, please feel free to post on the PGI Users Forum where I’m the moderator.

Thanks,
Mat

If you understand the numerics of the code then you might want to simply considering re-writing the actual solver with OpenCurrent. http://code.google.com/p/opencurrent/

How large is large - does the code currently use MPI?

Thanks all for advice. Ive been reading around and am thinking:

CGORAC - yes, double precision is a must. Im going to get a mid range nVidia card for the moment and if successful will upgrade when Fermi is released. I will take your advice on staying with Fortran, thanks
mkcolg - thanks for the heads up, Ive been looking through the PGI website (the demo videos were very useful) and i think accelerator is likely my best option - cost dependent. Will email sales.
eelson - again thanks, OpenCurrent looks ideal however i need to catchup on the maths of the solver using at the moment. I would like to come back to OpenCurrent but for the moment plan 1 will be to implement CUDA as is.

To answer general questions:
The code size is 10,000 or so lines (solver probably only a few hundred - and where the main gains can be made). Model sizes are typically 65065011 nodes with ~1,000,000 iterations for whole runtime.

So conclusion on options:

CUDA C - including conversion from FORTRAN to C and learning the low level syntax etc of CUDA will be a lot of work - but possibly the most rewarding with respect to speedup - and cheapest.
PGI CUDA F, £(?)899 is too much to risk with uncertainty that it will work. If its £(?)250 for a 1 year license I can afford this and sounds like a good option.
PGI Accelarator CUDA F, Ive used OpenMP directives and loved the simplicity. Open MP itself was only partly successful however as got a 50% speedup on CPU only. Acc sounds far more promising. Does this come with above (ie license for £250?, will email PGI sales)
OpenCurrent, looks like made exactly for this application. I understand the precoinditioned cinjugate gradient algorithm however not the math. Open Current sounds like it may be beyond me at the moment. But will look at this again after CUDA.
Intel Ct, C and C++, appears to be for CPU only - and beta?
Intel Parallel studio, $799 contains composer, inspector and amplifier $366ea, for C and C++, again looks like CPU only. I dont think these options are for me.

Does the PGI accelerator decision seem wise? (this will be my first time with GPU ‘programming’)

A concern I have from cgorac’s post - do you think I will NOT get a speed improvement by moving processing to GPU if calcs are in double precisions?

Thanks all
Al

Be aware that the range of “mid range” cards with double precision capability is limited - only the GTX260/GTX275/GTX280/GTX285/GTX295 amongst the consumer cards support double precision, and the peak performance of all the current NVIDIA GPUS is rather modest in double precision (roughly 80Gflops). That is only about twice what a reasonable quad core desktop CPU yields in double precision - in BLAS3 dgemm() I used to get about 32 Gflops peak from a Q6600, about 38 Gflops from a Q9550, and about 42 Gflops from a Phenom II 945 (all at stock clock). If you are looking for spectacular double precision speed up, you are going to have to work a lot harder to get it than you will with a single precision code.

Thanks avidday

Good to know and you have helped make up my mind that the GTX275 is the best card I can get with personal R&D budget.

Much of the program can be converted to single precission but the solver will need to remain double. I’ll convert as much to single as numerically sensible and check I havent damaged results before doing the CUDA!

Cheers
Al

Sorry for not having PGI Accelerator mentioned in my first post - I don’t like OpenMP, so while I tried Accelerator and it seems indeed able to provide very good performance improvements, it is just not listed in my book; but great to have mkcolg from PGI hanging around, and indeed if you are used to coding in OpenMP, using Accelerator could be your best bet.

As for having double precision as must: you may wish to search from papers from Prof. Jack Dongarra and others about using “mixed” precision, which means using single precision for most of the calculations, and then use double precision only for final refinement. I haven’t tried to use these techniques myself, and I don’t know are these appropriate for your problem at hand, but I remember nice improvements reported in these papers.

Why do so many people bitch about paying for stuff? It costs money to create things like CUDA Fortran, and the piper has to be paid. You get what you pay for!

Pay up and get on with the acceleration/conversion!!

MMB

For several PDE solvers, the limiting factor is memory bandwidth, not double precision performance.
Even if the GT200 has a double precision peak that is 2x the CPU one, it has between 4x (against Nahelem) or 10x (against Harpertown) more memory bandwidth.
It is not unusual to have an overall speedup of 5x-10x for PDE solvers in double precision.

I like CUDA Fortran a lot, it is more elegant than CUDA C and is the best tool to add GPU acceleration to CPU Fortran code IMHO.

Where, exactly, did you find myself “bitching about paying for stuff”, so that you found my message deserved quoting before making the comment above? For the record: I’m perfectly happy with paying for software, especially for high quality software like PGI compilers. But on the other side, it is simple fact that at $700 CUDA Fortran tool-chain is costly (well: probably not for all, but certainly for many people - for example, OP himself mentioned in his first message that this is beyond his budget) when compared to CUDA C tool-chain ($0, at least on Linux).

Sure, it sounds like that… D=H allocates and sends array H on host to device array D, what more can you wish for. I’m a native Fortran programer. But what do you think about the fact that there are now so many wonderful software tools in C++, parallel primitives library, Thrust, not just FTT/BLAS. I guess there’s nothing comparable in Fortran?

Also, I still remember the pain of having to deal with supposedly wonderful paid-for compilers which suddenly turn out somehow not able to link to great software configured for GNU compilers, or RPMs compiled with those.

Pawel

PS. Alastair:

I still somewhat hesitate… I will port serious hydro to CUDA one way or another soon. I won’t use opencurrent, as I seem to need more advanced numerical methods (if I’m correct, opencurrent implements second-order standard methods, I need effectively higher order Godunov, Riemann solvers and so on - by doing a lot of subgrid physics they are compute intensive (good) and give way better resolution in the end. Graphics hardware has little RAM, and will continue so for some time, on consumer cards at least, so for 3-D applications where CUDA is a neccesity,

we can’t beat the resolution problem by just declaring finer grids and letting CUDA work fast on them.

If implemented, multigrids miight help a bit, though adaptive meshes generate junk in at resolution jumps (I work on problems where those kind of errors will flow back therough the computational domain and come back and amplify – rotating gas flows).

Alastair,

from my personal experience you will get disappointed if you do not understand your code well. I would suggest that, before deciding on any action, you should profile your code to find the hotspots worth optimizing, and have a close look whether those loops are compute-bound or memory-bound. Also try to find out how much of the peak performance you currently get in these loops.

Another advice to gain a better understanding of the code is to spend some more time on OpenMP directive-based parallelization.
That your quad core only achieves 50% speedup seems to indicate that in the current state the code does not parallelize well and would not gain much from a direct translation to GPU code. Was this speedup achieved with intel’s auto-parallelization? That would match my (some years old) experience and not tell much about the code.
If on the other hand the 50% come from full OpenMP directive-based parallelization, then the code definitely needs improvement. If it can’t even use four threads, it will not be able to gain from the massive number of threads on the GPU (think Amdahl’s law here). The simpler nature of each CUDA code will likely make your program run even slower on the GPU than on the CPU, unless you really understand what is going on.

In summary, I would suggest to profile your code, and OpenMP-parallelize the hot spots first. Only then, I believe, you will have learned enough about your code to take the right decision on CUDA employment.

Tera,

Sage advice thanks.

A mate profiled the code and found all the time was being spent in two areas (solver and a module with simple looped calculations). We used directive based OpenMP first with limited success and then auto (similart result). I entirely take your point and will relook at these areas to see if memory dependance, IO etc is causing the bottleneck

that said I do still suspect the program could lend itself to CUDA - after some tweeking for above.

Ill post again after reprofiled and had a think about code layout.

Cheers

Al

OpenCurrent implements finite difference methods (up to 2nd order currently, I think) whereas Godunov and Riemann methods are finite volume methods. They are typically only first order though, in that the solution inside each cell is a single value and the global solution is piecewise constant. Using sub-grid physics doesn’t make them higher order.

That sounds promising. Hope you will easily find what keeps the code from parallelizing well.

Hi

Fortran + C
What about use mixing language programming?

I have the same problem that I need to convert a lot of FORTRAN codes to CUDA C.
However, this is not the best way. CUDA-enable PGI fortran compiler may be a solution,
but it costs a lot.

I have a suggestion to use Fortran and C language mixing programming.
Most part of your Fortran code keeps the same, what you need to do
is to choose a small part of code/subroutine and write it in CUDA C.

Use profile tools like gprof to find the most time-consuming code and write this part in
CUDA C.

Use fortran compiler like intel fortran compiler to compile the fortran code to make the objective files.
Use nvcc to compile the .cu file to make the objective files. Then link these
objective files to an executable program.

I have test this way on my team’s gpu work station, and it works.

Some one claims that using g95 and nvcc together also works, however I have not tested it.

Mixing language programming maybe not the best way, but it is really a good option.

  • 1 …

have been doing this for > 5 months now. I have created a wrapper using iso-c binding feature for FORTRAN 2003 to create direct calls to cuda functions like cudaMemcpy etc… .

I want to release this small library if anyone wants to use it… but am having problems writing a wrapper of cuda stream capabiltiy functions and asynchronous calls. Maybe you can help me here. I am more experienced in FORTRAN then C hence these problems.

Thanks,

Nitin

Nobody has yet mentioned flagon. I haven’t had any need to to Fortran interop with CUDA for a while, but it seemed to work just fine with the Intel Fortran compile when I tried it circa CUDA 2.1.

All interesting and promising sounding stuff. Sorry for delay in getting back, still profiling and ammending FORTRAN to make more appropriate for parallelization (at least OpenMP). GPU should be arriving in a few days and hoping next week to start…

Also the mixed program route does sound possible although havent done this before. Will look into closer to time…

Cheers
Al