When compiling codes for performance, we have a switch called
-Minfo, which to get the most information
and it will inform when optimizations or changes in the code
flow or organization has changes. Try some programs with loops,
and compile with
to get some ideas about what the compiler is doing.
But you are asking about the GPU, and PGI does not have many
switches to inform you of GPU optimizations.
The greatest performance gains on the GPUs are from calling well designed CUDA routines to perform the operations.
Since GPUs are becoming the standard for compute performance in compute centers, understanding how to program CUDA routines would give greater insight as to how it all works.
A very good book on this is “CUDA by Example” which goes through the logic and mathematics of CUDA, how you turn a compute intensive loop into a series of CUDA calls that run tremendously faster (when you have a tremendous amount of computation to do , this is good - when not, it can be underwhelming).
OpenACC is an easier path, where you add directives to already
working code (on your CPU) and the compiler takes care of generating GPU code and moving data to and from the GPU.
is a quick tutorial about moving code from a CPU to