Survey for PGI FORTRAN compiler !Thanks~

Hi,everyone!

I want to buy a PGI FORTRAN compiler recently to accelerate the cuda for my major always using fortran to programm instead of C .

So I want to kown the acceleration effect of PGI FORTRAN compiler comparing to CUDA-C.

Is there someone who have used PGI fortran and have some example to accelerate?

Thanks very much for your kind answer!

----a student form China

You can get a 30-day free trial from PGI. Why not take advantage of it, and see for yourself? Let us know what you find, please!

MMB

You can get a 30-day free trial from PGI. Why not take advantage of it, and see for yourself? Let us know what you find, please!

MMB

I use PGI’s CUDA Fortran in my work and I can share a bit. First, I’ll rightfully admit I’m not much of a CUDA C programmer. I’ve done a bit with it (the usual matmul examples and all) and can read it, but I have been and always shall be a Fortran programmer at heart. This is mainly due to the fact that my graduate work, postdoctoral work, and current work all deal with codes that are FORTRAN 77 to Fortran 2003.

But, that said, I can say I find programming in CUDA Fortran quite nice. If you are from a Fortran background and are used to arrays rather than pointers and the use of MODULEs, I think you’ll like it too. If nothing else, there is an ease of reading it. For example, let’s say you have a device array dev_work(1024) and a host array of host_work(1024), instead of doing a cudaMemcpy(dev_work,host_work,1024), you can just say dev_work=host_work. Same for moving values to constant memory. (Note though, you can still use the cudaMemcpy API style calls if you want. Most of them are supported, like, say, cudaMemcpy2DAsync.)

Likewise, allocating memory on the device is as simple as using allocate() instead of cudaMalloc and deallocate() rather than cudaFree. And depending on how you defined your host variable, when you allocate on the host, it’ll allocate pinned/page-locked memory for you as well:

real, pinned, allocatable, dimension(:) :: host_work

is allocated as pinned. (Again, though, if you need/want to use cudaMallocPitch, you can!)

Indeed, most of what you want to do in CUDA C, you can do in CUDA Fortran. A few things like Textures haven’t been implemented yet, but most of the language is there, and if you find something that isn’t, PGI is usually pretty good at getting it implemented in a future release.

I’ll also say that I’ve done some programming with PGI’s Accelerator pragmas (think OpenMP style pragmas but for GPUs), and have found them to be quite good. In recent times, the Accelerator model can identify and correctly code for you things like reductions, which are a bit tricky to program in CUDA C/Fortran.

For more on what programming with PGI compilers and GPUs can look like, I recommend reading some of the PGI Insider articles. For example, this article on porting WRF shows that the pragma model can essentially match hand-tuned CUDA C code with what I’m sure was much less effort and that was with a compiler almost a year old.

I use PGI’s CUDA Fortran in my work and I can share a bit. First, I’ll rightfully admit I’m not much of a CUDA C programmer. I’ve done a bit with it (the usual matmul examples and all) and can read it, but I have been and always shall be a Fortran programmer at heart. This is mainly due to the fact that my graduate work, postdoctoral work, and current work all deal with codes that are FORTRAN 77 to Fortran 2003.

But, that said, I can say I find programming in CUDA Fortran quite nice. If you are from a Fortran background and are used to arrays rather than pointers and the use of MODULEs, I think you’ll like it too. If nothing else, there is an ease of reading it. For example, let’s say you have a device array dev_work(1024) and a host array of host_work(1024), instead of doing a cudaMemcpy(dev_work,host_work,1024), you can just say dev_work=host_work. Same for moving values to constant memory. (Note though, you can still use the cudaMemcpy API style calls if you want. Most of them are supported, like, say, cudaMemcpy2DAsync.)

Likewise, allocating memory on the device is as simple as using allocate() instead of cudaMalloc and deallocate() rather than cudaFree. And depending on how you defined your host variable, when you allocate on the host, it’ll allocate pinned/page-locked memory for you as well:

real, pinned, allocatable, dimension(:) :: host_work

is allocated as pinned. (Again, though, if you need/want to use cudaMallocPitch, you can!)

Indeed, most of what you want to do in CUDA C, you can do in CUDA Fortran. A few things like Textures haven’t been implemented yet, but most of the language is there, and if you find something that isn’t, PGI is usually pretty good at getting it implemented in a future release.

I’ll also say that I’ve done some programming with PGI’s Accelerator pragmas (think OpenMP style pragmas but for GPUs), and have found them to be quite good. In recent times, the Accelerator model can identify and correctly code for you things like reductions, which are a bit tricky to program in CUDA C/Fortran.

For more on what programming with PGI compilers and GPUs can look like, I recommend reading some of the PGI Insider articles. For example, this article on porting WRF shows that the pragma model can essentially match hand-tuned CUDA C code with what I’m sure was much less effort and that was with a compiler almost a year old.

Hi,

I have been using the PGU Accelerate compiler rather than CUDA FORTRAN but I thought I would add my thoughts. I must admit I haven’t used CUDA itself much so I can’t really draw any comparisons there.

Aside from (sometimes, usually the compiler will pick out the optimal option for you) having to explicitly state what you want to do with arrays and taking care of the scheduling there is nothing much that you need to be concerned about in terms of a step up from regular FORTRAN, and those issues are common to all accelerator languages.
I also find the option to easily remove the accelerator pragmas useful for debugging, but I am not sure what the situation is with other options.

I can’t say anything definitive in terms of a comparison between the speedup achieved by the code as I don’t have a CUDA version to compare to. However, the results compared to a similar software package are currently very promising. That combined with the ease of sticking to a language I know make this the best choice for me.

Also, as TheMatt said, PGI are very good at implementing feature requests and also provide excellent support.

I hope this is of some help,

Karl

Hi,

I have been using the PGU Accelerate compiler rather than CUDA FORTRAN but I thought I would add my thoughts. I must admit I haven’t used CUDA itself much so I can’t really draw any comparisons there.

Aside from (sometimes, usually the compiler will pick out the optimal option for you) having to explicitly state what you want to do with arrays and taking care of the scheduling there is nothing much that you need to be concerned about in terms of a step up from regular FORTRAN, and those issues are common to all accelerator languages.
I also find the option to easily remove the accelerator pragmas useful for debugging, but I am not sure what the situation is with other options.

I can’t say anything definitive in terms of a comparison between the speedup achieved by the code as I don’t have a CUDA version to compare to. However, the results compared to a similar software package are currently very promising. That combined with the ease of sticking to a language I know make this the best choice for me.

Also, as TheMatt said, PGI are very good at implementing feature requests and also provide excellent support.

I hope this is of some help,

Karl

PGI Accelerate compiler is only suitable for simple loops.

You are not supposed to have function calls inside the loop.

The loops can’t be deeply nested or have loop carried dependence.

You can’t use pinned memory, can’t use atomic operation and streaming.

Basically, you have no control over anything.