Does performance degrade when using local allocatable arrays

In preperation of an MPI implementation of our photochemical model I converted each array whose size depends on user-defined parameters to an allocatable array. Now I am seeing a significant perfomance hit from the F77 statically allocated version. I investigated and this is what I found. If I leave all of the original F77 code intact and change one subroutine by using allocatable arrays for the local storage, the routine takes more than twice as long as the original. To be clear, the only difference in this comparison is that in one version I have local arrays allocated using a parameter statement. In the other version I use an allocatable array and allocate based on a value passed in through an argument list. The second version takes more than twice as long to complete. I also tried to use automatic arrays - just declaring the arrays using a statement like:

real array(isize)

where isize is passed through the argument list. I get a similar performance disbenifit. Is this something that others have experienced when using allocatable arrays? Are there programming pratices or compiler implementions that can deal with this? Do you have any thoughts on this?

I am using pgf90 6.0-5 32-bit target on x86 Linux. I also ran the comparisons on an Intel compiler. The allocatable arrays produced the same slow-down but the performance of the automatic arrays was similar to that of the static arrays.



There is extra overhead in managing and accessing arrays declared as allocatable. Each time you enter and exit the routine with allocatable arrays, you must allocate and deallocate the arrays. Also the compiler creates a descriptor which describes the allocatable array(dimensions, bounds, etc…) which can cause a degredation in performance as well. Automatic arrays are treated like allocatable arrays, with the compiler inserting calls to the allocate/deallocate functions. The Intel generated code for the automatic array case is probably using the stack for these arrays and thus no need to allocate/deallocate.