I just heard about PGI and would like to know if PGI Visual Fortran will have a good speed up for our commercial software.
FYI, our software using finite element method to solve engineering problems. In general, matrix size to be solved ranges
from a few hundred to more than 10 thousand. We also have a outer loop that repeats the same procedure for a few to a few thousands times depending on the inputs.
We are currently using Intel FORTRAN compiler. Some time ago we implemented OpenMP in our code to the outer loop, which results in 1/N computational speed up (N is the number of cores used). We are quite happy about OpenMP.
Now reading your description about PGI Visual FORTRAN, it seems that
we can make the code speed up even more if an accelerator is attached
to the host CPU. We definitely are very interested in this and have the following questions:
(1) We use LAPACK in our code. Will PGI visual Fortran support LAPACK?
(2) About your statement "PGI Unified Binary™ technology provides the ability to generate a single executable file with code sequences optimized for multiple AMD, Intel and NVIDIA processors. " Do you mean that we can compile the code into one executable to different end uses no matter if they have AMD, Intel or NVIDIA accelerators? What’s the performance comparison for different accelerators using the same executable?
(3) If switching from Intel Fortran 2012 to GPI Visual Fortran straight forward?
(4) As I mentioned before, most of CPU time is used for setting up and solve a large matrix, and repeat the process over and over again. Therefore, we are able to put OpenMP directives at the outmost loop ( repeating the process part), the most efficient way. However, depending on the inputs, the number of repeating process might be just a few to a few thousands. For OpenMP, it’s OK since the maximum number of cores are limited, less than 10 for most cases. For graphic chip, the number of cores can be a few hundred. If the number of iterations of the outmost loop is only a few, will I see the supposed speed up or do I need to carefully modify the code?
(5) About memory. Our code requires large memory. How the required memory for the computation is allocated from the host to GPU?
Thank you so much.