How do I ensure that allocatable arrays are aligned on 16-byte boundaries
when compiling Fortran90 code with pgf90?
Please, note, allocatable arrays, not just static ones.
I.e., when the program runs and the memory allocation is requested,
it should provide memory. in aligned addresses.
In particular, I want the allocatable arrays to
benefit from SIMD (sse,sse2,…) instructions.
I found a few compiler flags, but it is unclear if they do this:
-Mcache_align,
-Mdalign,
-Mvect=sse
Thank you,
Gus
Hi Gus,
How do I ensure that allocatable arrays are aligned on 16-byte boundaries
when compiling Fortran90 code with pgf90?
The alignment of dynamically allocated memory is OS dependent. 64-bit Linux always returns 16-byte aligned memory. Otherwise, you need to pad your arrays to force alignment.
In particular, I want the allocatable arrays to benefit from SIMD (sse,sse2,…) instructions.
Vectorization is not dependent upon memory alignment. The compiler will generate multiple versions of your loop. At runtime it will detect if the memory is aligned on a 16-byte boundary. If it is a single movapd instruction is used, otherwise two movupd are issued to fetch the data.
To enable vectorization, you can use the flag “-Mvect=sse” but I would recommend using the aggregate flag “-fastsse” which include other optimization flags as well.
Hope this helps,
Mat
Hi Mat
Thank you very much for your very helpful answer.
So, in Linux 64-bit I suppose I can take a cavalier approach and
assume that my allocatable arrays will be 16-byte aligned,
hence, can potentially use SIMD instructions effectively for vectorization, correct?
The only problem I have with the -fast and -fastsse aggregate optimization
flags is that they include -static.
Unfortunately most codes link to shared libraries, and don’t compile with them.
Hence, I have been trying to use the parts of the aggregate flags, except -static:
-Mvect=sse, when I can find out what to use.
Is there an easy way to find all the flags that are included in
aggregates like -fastsse, -O3, etc?
Many thanks,
Gus
Hi Gus,
So, in Linux 64-bit I suppose I can take a cavalier approach and
assume that my allocatable arrays will be 16-byte aligned,
hence, can potentially use SIMD instructions effectively for vectorization, correct?
Correct.
The only problem I have with the -fast and -fastsse aggregate optimization
flags is that they include -static.
You’re confusing us with the Intel compilers. The PGI “-fast” flags do not include “-Bstatic”.
Hence, I have been trying to use the parts of the aggregate flags, except -static:
-Mvect=sse, when I can find out what to use. Is there an easy way to find all the flags that are included in aggregates like -fastsse, -O3, etc?
“-fast” will be system dependent so the best way to see what flags it comprises is to use “pgfortran -help -fast”.
% pgfortran -help -fast
Reading rcfile /usr/pgi/linux86-64/10.9/bin/.pgfortranrc
-fast Common optimizations; includes -O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline
== -Mvect=sse -Mscalarsse -Mcache_align -Mflushz -Mpre
-M[no]vect[=[no]altcode|[no]assoc|cachesize:<c>|[no]fuse|[no]gather|[no]idiom|levels:<n>|[no]partial|[no]sizelimit[:n]|prefetch|[no]short|[no]sse|[no]uniform]
Control automatic vector pipelining
[no]altcode Generate appropriate alternative code for vectorized loops
[no]assoc Allow [disallow] reassociation
cachesize:<c> Optimize for cache size c
[no]fuse Enable [disable] loop fusion
[no]gather Enable [disable] vectorization of indirect array references
[no]idiom Enable [disable] idiom recognition
levels:<n> Maximum nest level of loops to optimize
[no]partial Enable [disable] partial loop vectorization via inner loop distribution
[no]sizelimit[:n]
Limit size of vectorized loops
prefetch Generate prefetch instructions
[no]short Enable [disable] short vector operations
[no]sse Generate [don't generate] SSE instructions
[no]uniform Perform consistent optimizations in both vectorized and residual loops; this may affect the performance of the residual loop
-M[no]scalarsse Generate scalar sse code with xmm registers; implies -Mflushz
-Mcache_align Align long objects on cache-line boundaries
-M[no]flushz Set SSE to flush-to-zero mode
-M[no]pre Enable partial redundancy elimination
Hope this helps,
Mat
Hi Mat
Thank you for your help again/
You are right, I confused the PGI -fast with the Intel’s -fast.
Much better to keep static as a separate flag.
Thank you for the tip on how to get the actual meaning of -fast and -fastsse.
Gus