Large Executable Sizes with PGI: How to reduce?


I’m wondering if anyone knows how to reduce the size of PGI’s executables? For example, my main executable, GEOSgcm.x, has these sizes with various compilers:

Intel 15.0.2:       96 MB (-O3)
Intel 16.0.2:       89 MB (-O3)
Intel 16.0.2:      301 MB (full debugging flags)
GCC 6.0.1:          63 MB (-O3)
PGI 16.5 (CPU):   1604 MB (-fast)
PGI 16.5 (CPU):   2359 MB (-fast -g)
PGI 16.5 (GPU):   1609 MB (-fast)

where the Intel-g is one compiled with full debugging flags. The complete build of the models are (without cleaning and removing .o, etc.):

Intel 15.0.2:        7 GB (-O3)
Intel 16.0.2:        7 GB (-O3)
Intel 16.0.2:       14 GB (full debugging flags)
GCC 6.0.1:           5 GB (-O3)
PGI 16.5 (CPU):     84 GB (-fast)
PGI 16.5 (CPU):    115 GB (-fast -g)
PGI 16.5 (GPU):     84 GB (-fast)

I fully understand this might be my fault with my flags. The main ones I run with are (not counting any GPU target):

-fast -Kieee -Mbyteswapio -fpic -Mbackslash -Ktrap=fp -tp=px-64

as indicated by “-fast” above. I usually for profiling purposes always just add ‘-g’ to the optimized flags (often in a profiler) and that is the “-fast -g” line, so I thought I should see if ‘-g’ was the main culprit and built a model without. It helped, yes, but didn’t do much.

Does anyone know why the executable (and object files, etc.) is so large? Is the -tp=px-64? We mainly do that so we aren’t hit by a “compile on Haswell, run on Sandy Bridge” issue but if I can make it smaller by targeting a single processor, I will.


The difference between

real*8 array(2000000)=4.0

for example, and

real*8, allocatable :: array(:)

allocate (array(2000000) )

in the size of the executable, is quite a bit.

Switch from declaring and initializing arrays statically, and do it
instead dynamically at runtime. cpus are fast.

If you have to compile for both haswell and sandybridge, either
compile for sandybridge only (which will also run on haswell),
or use a unified binary (-tp=haswell,sandybridge), which will
create larger programs as well. But that extra code may be dwarfed by
your static array declarations. tp=px-64 removes all sorts of
optimization choices (might even use x87 instructions - yuk).



Well, I guess what confuses me is whyPGI so much bigger than Intel or GNU? They are all compiling the exact same Fortran code.

I mean, they’d be compiling in the same static arrays (of which there aren’t that many, our code is pointers first with automatic and allocatable second probably) right? Perhaps PGI has some default behavior on compile that is the opposite of Intel/GNU? (Sort of like pre-allocating buffers versus on-demand with some MPI stacks I’ve used.)

And, yeah, we used to do, once, westmere,sandybridge,haswell and the executables got rather…immense. I mainly do x86-64 for ease of portability with clusters where multiple hardware can occur. If someone wants a full on optimized for a chip, that’s easy enough to do. :)