Loop "too deeply nested" and "data dependency

Hi,
I am working on an OpenACC code optimized to run on NVIDIA GPUs that I would like to run also on Intel CPUs to demonstrate its portability, but I am getting quite bad performances.

Analyzing the compiler output when compiling with -ta=multicore and -tp=haswell I noticed that the most compute demanding functions were not parallelized/vectorized.
On the contrary they are when using -ta=tesla.

In particular the core of these functions is made up by 4 nested loops such as:

#pragma acc kernels present(...)
    #pragma acc loop independent gang
    for(...) {
        #pragma acc loop independent gang
        for(...) {
            #pragma acc loop independent gang
            for(...) {
                #pragma acc loop independent vector
                for(...) {
                   ...
               }
            }
         }
      }

and for all of them the compiler output is similar to:

my_function:
     50, Loop is parallelizable
         Generating Multicore code
         50, #pragma acc loop gang
     50, Loop not vectorized/parallelized: too deeply nested
     53, Loop is parallelizable
     56, Loop is parallelizable
     59, Loop is parallelizable
         Loop not vectorized: data dependency

Why does the compiler report “#pragma acc loop gang” to be the pragma associated to the outer loop whatever pragma I try? For example if I try “#pragma acc loop independent collapse(3)”, it still reports: “#pragma acc loop gang”. Is it ignoring the specified pragmas?

Moreover, why does the compiler report “Loop not vectorized: data dependency” for the inner loop although I specified “#pragma acc loop independent vector”? Shouldn’t the “independent” clause be enough to ask not to check data dependencies between loop iterations?


Thansk and Best Regards,

Enrico

Hi Enrico,

Currently for multicore targets, the compiler is limited to a single gang loop. We do plan on expanding this. Also, we currently use classic auto-vectorization optimization but in the future expect to have the compiler use the “vector” clause to indicate to the compiler to vectorize the loop. For now, consider using the C99 “restrict” keyword and/or the “-Msafeptr” flag to help the compiler determine the independence of loop iterations.

  • Mat

We succeeded to solve the “Loop not vectorized/parallelized: too deeply nested” warning adding this option to the compiler (4 would be probably enough):

-Mvect=levels:5

Moreover we updated the pragmas using the tile clause, in order to adhere more strictly to the OpenACC spcifications:

#pragma acc kernels present(...)
    #pragma acc loop independent gang
    for(...) {
        #pragma acc loop independent vector tile(x,y,z)
        for(...) {
            for(...) {
                for(...) {
                   ...
               }
            }
         }
      }

Despite of that, although all the pointers passed as arguments to the function where declared as __restricted, we still experienced the “Loop not vectorized: data dependency” warning.

Using -Msafeptr as a compiler option changed the compiler output, but it is still not vectorizing the code.

The output is now:

    50, Loop is parallelizable
         Generating Multicore code
        50, #pragma acc loop gang
    53, Loop is parallelizable
    56, Loop is parallelizable
    59, Loop is parallelizable
         Loop not vectorized: mixed data types

Do you have any other suggestions?

p.s
The same code get compiled and run perfectly fine on GPUs and just changing the pragmas it could also be vectorized by the Intel compiler targeting CPUs, so I guess it should be “just” a matter of persuading the pgi compiler about the fact that it can vectorize it.

Thanks and Best Regards,

Enrico

Hi Enrico,

Loop not vectorized: mixed data types

We’ve improved the compiler vectorization over the last few years and do support mixed data types, in particular things like mixed int and double. I’m not sure why we’re not able to vectorize it in this case.

Can you post a snip-it of the code, including the data types of the variables being used? Also, what version of the compiler are you using?

  • Mat

Hi,
we are using pgi 16.10.

The code is quite complex, but maybe the important information is that in the body of the loop, apart from some integer computations for arrays indexing, all computations are performed on array items of user defined data structures, which actually contains “C99 double complex” data type.

Maybe the compiler have difficulties to understand that double complex data elements actually just maps to double?


Thanks and Best Regards,

Enrico

Hi Enrico,

We are able to vectorize double complex data types (which is a 16-byte data type, doubles are 8-bytes). However, we can’t currently vectorize 16-byte types mixed with 4-byte types. Do you have any floats in the expression? If so, can you make these doubles to see if it works around the problem?

  • Mat

Hi,
is this limitation still present, or newer pgi versions allow to collapse two (or more) outer loops to increase the number of gangs?

Doing something like this:

#pragma acc loop independent collapse(2) gang
for (...) {
    for (...) {
        #pragma acc loop independent vector tile(TILE_0,TILE_1)
        for (...) {
            for (...) {

Thanks and Best Regards,

Enrico

Hi EnricoC,

is this limitation still present, or newer pgi versions allow to collapse two (or more) outer loops to increase the number of gangs?

I’ve been talking to our compiler engineers about this. According to them collapse should be working with a gang loop when targeting multicore. However, I have seen cases where the runtime profiles appear as if the collapse was ineffective.

I’ll continue to press them since it should be working but may not be in all cases. Are you able to send me a reproducing example (trs@pgroup.com)? That may help.

Thanks,
Mat

Hi,
I guess in our code it is working, since we can see different execution times when using it or not… I was just asking to know if it was worth to give it a try ;)

Anyhow, how would you profile the resulting executable to identify how the loops get divided across threads?
I tried to use pgprof and then the visual profiler, but I was not able to find this information… I was expecting to find it in a similar fashion as grid and blocks when targeting GPUs. Do I need to specify some particular flags to pgprof?

Hi Enrico,

For OpenACC multicore targets, we use our OpenMP back-end where each “gang” gets mapped to a OpenMP thread. For more recent versions of pgprof, you can profile OpenACC multicore and it will show up on the timeline. You can also see the times from each of the individual thread from the “CPU” tab’s thread drop-down box.

You’ll only be able to see the gang level parallelism since we’ll use classical CPU vectorization for the vector loops.

-Mat