General question about PGI Visual Fortran

Hi

I just heard about PGI and would like to know if PGI Visual Fortran will have a good speed up for our commercial software.

FYI, our software using finite element method to solve engineering problems. In general, matrix size to be solved ranges
from a few hundred to more than 10 thousand. We also have a outer loop that repeats the same procedure for a few to a few thousands times depending on the inputs.

We are currently using Intel FORTRAN compiler. Some time ago we implemented OpenMP in our code to the outer loop, which results in 1/N computational speed up (N is the number of cores used). We are quite happy about OpenMP.

Now reading your description about PGI Visual FORTRAN, it seems that
we can make the code speed up even more if an accelerator is attached
to the host CPU. We definitely are very interested in this and have the following questions:
(1) We use LAPACK in our code. Will PGI visual Fortran support LAPACK?
(2) About your statement "PGI Unified Binary™ technology provides the ability to generate a single executable file with code sequences optimized for multiple AMD, Intel and NVIDIA processors. " Do you mean that we can compile the code into one executable to different end uses no matter if they have AMD, Intel or NVIDIA accelerators? What’s the performance comparison for different accelerators using the same executable?
(3) If switching from Intel Fortran 2012 to GPI Visual Fortran straight forward?
(4) As I mentioned before, most of CPU time is used for setting up and solve a large matrix, and repeat the process over and over again. Therefore, we are able to put OpenMP directives at the outmost loop ( repeating the process part), the most efficient way. However, depending on the inputs, the number of repeating process might be just a few to a few thousands. For OpenMP, it’s OK since the maximum number of cores are limited, less than 10 for most cases. For graphic chip, the number of cores can be a few hundred. If the number of iterations of the outmost loop is only a few, will I see the supposed speed up or do I need to carefully modify the code?
(5) About memory. Our code requires large memory. How the required memory for the computation is allocated from the host to GPU?

Thank you so much.

Hi Woods,

(1) We use LAPACK in our code. Will PGI visual Fortran support LAPACK?

Certainly. We ship a basic LAPACK library (a direct build of the netlib code) but also ACML. While purchased separately, you can also use IMSL or MKL. Also, you are more than welcome to build your own version since external libraries can be linked. Granted it’s not as continent, but certainly supported.

(2) About your statement "PGI Unified Binary™ technology provides the ability to generate a single executable file with code sequences optimized for multiple AMD, Intel and NVIDIA processors. " Do you mean that we can compile the code into one executable to different end uses no matter if they have AMD, Intel or NVIDIA accelerators? What’s the performance comparison for different accelerators using the same executable?

Yes, that is the idea. For the host side, you would use the “-tp” flag (target-processor) to define a list of CPU architectures. When the compiler optimizes code and finds spots where a section would different optimization for different targets (including differing hardware instructions), multiple versions of this routine is made. At run time the most relevant path for the given target is executed.

On the Accelerator, the “-ta” flag (target accelerator) will perform a similar function where multiple versions of each OpenACC compute and data regions will be created for each target. However, we currently only support the “host” (CPU) and “nvidia” targets. We should have AMD support, starting with a beta product, in the near future, and support for the Intel Xeon Phi, some time after that.

(3) If switching from Intel Fortran 2012 to PGI Visual Fortran straight forward?

The code itself typically isn’t a problem so long as you’re not relying on any extensions, specific Intel libraries, system routines. You will need to recreate your project environment, which may take some work depending upon the complexity.

(4) As I mentioned before, most of CPU time is used for setting up and solve a large matrix, and repeat the process over and over again. Therefore, we are able to put OpenMP directives at the outmost loop ( repeating the process part), the most efficient way. However, depending on the inputs, the number of repeating process might be just a few to a few thousands. For OpenMP, it’s OK since the maximum number of cores are limited, less than 10 for most cases. For graphic chip, the number of cores can be a few hundred. If the number of iterations of the outmost loop is only a few, will I see the supposed speed up or do I need to carefully modify the code

For the accelerator, it’s more about the size of the matrix rather than the number of iterations. Small matrix see little speed-up while larger ones do. The more work that can be put on the Accelerator, the more the speed-up.

For the outer loop, you’ll want to move the data over to accelerator before entering it. The data movement cost becomes fixed so can be amortized by the number of iterations. So yes, the smaller number of iterations will have a higher data movement cost per iteration, but really one copy doesn’t cost too much and data movement only starts to be a problem when you’re copying data each iteration.

(5) About memory. Our code requires large memory. How the required memory for the computation is allocated from the host to GPU?

If you have an individual array greater than 2GB, then you’ll need to add the flag “-Mlarge_arrays”. Other than that, you’re only limited to the amount of memory on the device. Last I looked, NVIDIA was up to 6GB, but may have increase that. If you need more memory than what’s available on the device, you’ll either need to split your problem so that only parts are computed at one time, or split the problem across multiple accelerators using OpenMP or MPI. OpenMP is a bit more tricky since it assumes a shared memory system and accelerators have discrete memory. So you need to manually decompose your problem across the accelerators. It’s not particularly difficult, but just not natural OpenMP. Because of this, I prefer using MPI since assumes discrete memory and the domain decomposition is inherent how you program.

While on Linux, I wrote the following article using multiple accelerators, MPI and OpenACC

  • Mat

Mat, Thank you so much for your quick reply. I think I was not too
clear about my question (4) and I do not quite understand your answer.

Our Fortran code is like this:

DO I = 1, MAXP
a) evaluate every component of NxN and N in matrix, and N.
b) assemble left and right hand size of NxN matrix
c) Solve NxN matrix
ENDDO

In inside DO loop, there are many loops wth NxN or N which requires lots of memory and computational time.
The range of MAXP is from 10 - thousands. The computational time for each outer loop iteration is about the same. The computation of each outer iteration is independent of each other. So we implemented OpenMP to the outer loop, which has the computation performance as we expected with almost no overhead. In general, common CPU has less than 10 cores, so all CPU cores can be utilized.

We do not know anything about PGI visual Fortran. My understanding is that if MAXP is larger than the number of cores in GPU, then each core can perform computation for each iteration similar to that of OPenMP. Is this right? Another question is that in case if MAXP = 10, N=100000, and GPU has 100 cores, how all 100 cores in GPU can be utilized? I mean will computation of each iteration be carried out on 10 cores so that all 100 cores will be used to accelerate the code?
If so, how difficult is it to implement this?

My understanding is that if MAXP is larger than the number of cores in GPU, then each core can perform computation for each iteration similar to that of OPenMP. Is this right?

You shouldn’t think of the GPU in terms of OpenMP, ala divide the problem and map it on a single core. As least for NVIDIA GPU’s, you should be thinking in terms of threads, warps, and blocks. The limiter is the number possible threads, but since this can get into the billions on the newer cards, I doubt it will be a problem.

In some cases, the compiler may stripmine the problem and have one kernel (which will map to a thread) perform multiple iterations of I, but others it may be better to have each kernel perform one iteration. I depends on the data access pattern.

You might find this article helpful. It’s a few years old, but still relevant. Account Login | PGI

Another question is that in case if MAXP = 10, N=100000, and GPU has 100 cores, how all 100 cores in GPU can be utilized?

Again, don’t think in terms of cores, think threads and we want lots and lots of threads. Hence in the typical case where one iteration maps to one thread, accelerating the outer loop when MAXP is small would lead to poor performance. I see three possible options.

First, only accelerate the inner loop. As long as you have a data region that spans across the multiple compute regions, you can share the data between them without copying it back and forth.

Second, make the outer loop use a “gang” schedule (which corresponds to a CUDA block), and the inner loop use a “vector” schedule (which maps to a CUDA threads).

Though, the best option would be to collapse the inner and outer loops so the total number of iterations is 10x100000. Though, this requires a tightly nested loop so what the compiler may do is split the problem into multiple outer and inner loops where each becomes a kernel launch.

!$acc kernels
DO I = 1, MAXP 
a) evaluate every component of NxN and N in matrix, and N. 
ENDDO
DO I = 1, MAXP 
b) assemble left and right hand size of NxN matrix 
ENDDO 
DO I = 1, MAXP 
c) Solve NxN matrix 
ENDDO 
!$acc end kernels

(This would be done by the compiler, your code would remain unchanged)

Without the actual code, I can’t tell you what your best option is, nor which options are even available (you might have a dependency that precludes from splitting the loops). It all depends on the code if you want to maximize the number of threads or have more work per thread (small versus big kernels). And what’s best, you may not know until your try multiple options.

If so, how difficult is it to implement this?

If your code is already in form that’s data parallel, then it’s very straight forward. Typically the user’s time is spent optimizing data movement and experimenting with schedules. The 5x in 5 hours article I refered to in the last post is a good step by step guide on what it takes to port an already parallel code.

If your code is not already data parallel, then much of your time will be spent rewriting the algorithm to make it parallel. The compiler does give good feedback as to why it can’t parallelize code which is helpful during this process.

  • Mat