Hi,
I am working on an OpenACC code optimized to run on NVIDIA GPUs that I would like to run also on Intel CPUs to demonstrate its portability, but I am getting quite bad performances.
Analyzing the compiler output when compiling with -ta=multicore and -tp=haswell I noticed that the most compute demanding functions were not parallelized/vectorized.
On the contrary they are when using -ta=tesla.
In particular the core of these functions is made up by 4 nested loops such as:
and for all of them the compiler output is similar to:
my_function:
50, Loop is parallelizable
Generating Multicore code
50, #pragma acc loop gang
50, Loop not vectorized/parallelized: too deeply nested
53, Loop is parallelizable
56, Loop is parallelizable
59, Loop is parallelizable
Loop not vectorized: data dependency
Why does the compiler report “#pragma acc loop gang” to be the pragma associated to the outer loop whatever pragma I try? For example if I try “#pragma acc loop independent collapse(3)”, it still reports: “#pragma acc loop gang”. Is it ignoring the specified pragmas?
Moreover, why does the compiler report “Loop not vectorized: data dependency” for the inner loop although I specified “#pragma acc loop independent vector”? Shouldn’t the “independent” clause be enough to ask not to check data dependencies between loop iterations?
Currently for multicore targets, the compiler is limited to a single gang loop. We do plan on expanding this. Also, we currently use classic auto-vectorization optimization but in the future expect to have the compiler use the “vector” clause to indicate to the compiler to vectorize the loop. For now, consider using the C99 “restrict” keyword and/or the “-Msafeptr” flag to help the compiler determine the independence of loop iterations.
We succeeded to solve the “Loop not vectorized/parallelized: too deeply nested” warning adding this option to the compiler (4 would be probably enough):
-Mvect=levels:5
Moreover we updated the pragmas using the tile clause, in order to adhere more strictly to the OpenACC spcifications:
Despite of that, although all the pointers passed as arguments to the function where declared as __restricted, we still experienced the “Loop not vectorized: data dependency” warning.
Using -Msafeptr as a compiler option changed the compiler output, but it is still not vectorizing the code.
The output is now:
50, Loop is parallelizable
Generating Multicore code
50, #pragma acc loop gang
53, Loop is parallelizable
56, Loop is parallelizable
59, Loop is parallelizable
Loop not vectorized: mixed data types
Do you have any other suggestions?
p.s
The same code get compiled and run perfectly fine on GPUs and just changing the pragmas it could also be vectorized by the Intel compiler targeting CPUs, so I guess it should be “just” a matter of persuading the pgi compiler about the fact that it can vectorize it.
We’ve improved the compiler vectorization over the last few years and do support mixed data types, in particular things like mixed int and double. I’m not sure why we’re not able to vectorize it in this case.
Can you post a snip-it of the code, including the data types of the variables being used? Also, what version of the compiler are you using?
The code is quite complex, but maybe the important information is that in the body of the loop, apart from some integer computations for arrays indexing, all computations are performed on array items of user defined data structures, which actually contains “C99 double complex” data type.
Maybe the compiler have difficulties to understand that double complex data elements actually just maps to double?
We are able to vectorize double complex data types (which is a 16-byte data type, doubles are 8-bytes). However, we can’t currently vectorize 16-byte types mixed with 4-byte types. Do you have any floats in the expression? If so, can you make these doubles to see if it works around the problem?
is this limitation still present, or newer pgi versions allow to collapse two (or more) outer loops to increase the number of gangs?
I’ve been talking to our compiler engineers about this. According to them collapse should be working with a gang loop when targeting multicore. However, I have seen cases where the runtime profiles appear as if the collapse was ineffective.
I’ll continue to press them since it should be working but may not be in all cases. Are you able to send me a reproducing example (trs@pgroup.com)? That may help.
Hi,
I guess in our code it is working, since we can see different execution times when using it or not… I was just asking to know if it was worth to give it a try ;)
Anyhow, how would you profile the resulting executable to identify how the loops get divided across threads?
I tried to use pgprof and then the visual profiler, but I was not able to find this information… I was expecting to find it in a similar fashion as grid and blocks when targeting GPUs. Do I need to specify some particular flags to pgprof?
For OpenACC multicore targets, we use our OpenMP back-end where each “gang” gets mapped to a OpenMP thread. For more recent versions of pgprof, you can profile OpenACC multicore and it will show up on the timeline. You can also see the times from each of the individual thread from the “CPU” tab’s thread drop-down box.
You’ll only be able to see the gang level parallelism since we’ll use classical CPU vectorization for the vector loops.