Align *allocatable* arrays with pgf90

How do I ensure that allocatable arrays are aligned on 16-byte boundaries
when compiling Fortran90 code with pgf90?

Please, note, allocatable arrays, not just static ones.
I.e., when the program runs and the memory allocation is requested,
it should provide memory. in aligned addresses.

In particular, I want the allocatable arrays to
benefit from SIMD (sse,sse2,…) instructions.

I found a few compiler flags, but it is unclear if they do this:
-Mcache_align,
-Mdalign,
-Mvect=sse


Thank you,
Gus

Hi Gus,

How do I ensure that allocatable arrays are aligned on 16-byte boundaries
when compiling Fortran90 code with pgf90?

The alignment of dynamically allocated memory is OS dependent. 64-bit Linux always returns 16-byte aligned memory. Otherwise, you need to pad your arrays to force alignment.

In particular, I want the allocatable arrays to benefit from SIMD (sse,sse2,…) instructions.

Vectorization is not dependent upon memory alignment. The compiler will generate multiple versions of your loop. At runtime it will detect if the memory is aligned on a 16-byte boundary. If it is a single movapd instruction is used, otherwise two movupd are issued to fetch the data.

To enable vectorization, you can use the flag “-Mvect=sse” but I would recommend using the aggregate flag “-fastsse” which include other optimization flags as well.

Hope this helps,
Mat

Hi Mat

Thank you very much for your very helpful answer.

So, in Linux 64-bit I suppose I can take a cavalier approach and
assume that my allocatable arrays will be 16-byte aligned,
hence, can potentially use SIMD instructions effectively for vectorization, correct?

The only problem I have with the -fast and -fastsse aggregate optimization
flags is that they include -static.
Unfortunately most codes link to shared libraries, and don’t compile with them.
Hence, I have been trying to use the parts of the aggregate flags, except -static:
-Mvect=sse, when I can find out what to use.

Is there an easy way to find all the flags that are included in
aggregates like -fastsse, -O3, etc?

Many thanks,
Gus

Hi Gus,

So, in Linux 64-bit I suppose I can take a cavalier approach and
assume that my allocatable arrays will be 16-byte aligned,
hence, can potentially use SIMD instructions effectively for vectorization, correct?

Correct.

The only problem I have with the -fast and -fastsse aggregate optimization
flags is that they include -static.

You’re confusing us with the Intel compilers. The PGI “-fast” flags do not include “-Bstatic”.

Hence, I have been trying to use the parts of the aggregate flags, except -static:
-Mvect=sse, when I can find out what to use. Is there an easy way to find all the flags that are included in aggregates like -fastsse, -O3, etc?

“-fast” will be system dependent so the best way to see what flags it comprises is to use “pgfortran -help -fast”.

% pgfortran -help -fast
Reading rcfile /usr/pgi/linux86-64/10.9/bin/.pgfortranrc
-fast               Common optimizations; includes -O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline
                    == -Mvect=sse -Mscalarsse -Mcache_align -Mflushz -Mpre
-M[no]vect[=[no]altcode|[no]assoc|cachesize:<c>|[no]fuse|[no]gather|[no]idiom|levels:<n>|[no]partial|[no]sizelimit[:n]|prefetch|[no]short|[no]sse|[no]uniform]
                    Control automatic vector pipelining
    [no]altcode     Generate appropriate alternative code for vectorized loops
    [no]assoc       Allow [disallow] reassociation
    cachesize:<c>   Optimize for cache size c
    [no]fuse        Enable [disable] loop fusion
    [no]gather      Enable [disable] vectorization of indirect array references
    [no]idiom       Enable [disable] idiom recognition
    levels:<n>      Maximum nest level of loops to optimize
    [no]partial     Enable [disable] partial loop vectorization via inner loop distribution
    [no]sizelimit[:n]
                    Limit size of vectorized loops
    prefetch        Generate prefetch instructions
    [no]short       Enable [disable] short vector operations
    [no]sse         Generate [don't generate] SSE instructions
    [no]uniform     Perform consistent optimizations in both vectorized and residual loops; this may affect the performance of the residual loop
-M[no]scalarsse     Generate scalar sse code with xmm registers; implies -Mflushz
-Mcache_align       Align long objects on cache-line boundaries
-M[no]flushz        Set SSE to flush-to-zero mode
-M[no]pre           Enable partial redundancy elimination

Hope this helps,
Mat

Hi Mat
Thank you for your help again/
You are right, I confused the PGI -fast with the Intel’s -fast.
Much better to keep static as a separate flag.
Thank you for the tip on how to get the actual meaning of -fast and -fastsse.
Gus