Loop "too deeply nested" and "data dependency

benry · October 24, 2016, 2:26pm

Hi,
I am working on an OpenACC code optimized to run on NVIDIA GPUs that I would like to run also on Intel CPUs to demonstrate its portability, but I am getting quite bad performances.

Analyzing the compiler output when compiling with -ta=multicore and -tp=haswell I noticed that the most compute demanding functions were not parallelized/vectorized.
On the contrary they are when using -ta=tesla.

In particular the core of these functions is made up by 4 nested loops such as:

#pragma acc kernels present(...)
    #pragma acc loop independent gang
    for(...) {
        #pragma acc loop independent gang
        for(...) {
            #pragma acc loop independent gang
            for(...) {
                #pragma acc loop independent vector
                for(...) {
                   ...
               }
            }
         }
      }

and for all of them the compiler output is similar to:

my_function:
     50, Loop is parallelizable
         Generating Multicore code
         50, #pragma acc loop gang
     50, Loop not vectorized/parallelized: too deeply nested
     53, Loop is parallelizable
     56, Loop is parallelizable
     59, Loop is parallelizable
         Loop not vectorized: data dependency

Why does the compiler report “#pragma acc loop gang” to be the pragma associated to the outer loop whatever pragma I try? For example if I try “#pragma acc loop independent collapse(3)”, it still reports: “#pragma acc loop gang”. Is it ignoring the specified pragmas?

Moreover, why does the compiler report “Loop not vectorized: data dependency” for the inner loop although I specified “#pragma acc loop independent vector”? Shouldn’t the “independent” clause be enough to ask not to check data dependencies between loop iterations?

Thansk and Best Regards,

Enrico

MatColgrove · October 26, 2016, 6:09am

Hi Enrico,

Currently for multicore targets, the compiler is limited to a single gang loop. We do plan on expanding this. Also, we currently use classic auto-vectorization optimization but in the future expect to have the compiler use the “vector” clause to indicate to the compiler to vectorize the loop. For now, consider using the C99 “restrict” keyword and/or the “-Msafeptr” flag to help the compiler determine the independence of loop iterations.

Mat

benry · December 2, 2016, 1:15pm

We succeeded to solve the “Loop not vectorized/parallelized: too deeply nested” warning adding this option to the compiler (4 would be probably enough):

-Mvect=levels:5

Moreover we updated the pragmas using the tile clause, in order to adhere more strictly to the OpenACC spcifications:

#pragma acc kernels present(...)
    #pragma acc loop independent gang
    for(...) {
        #pragma acc loop independent vector tile(x,y,z)
        for(...) {
            for(...) {
                for(...) {
                   ...
               }
            }
         }
      }

Despite of that, although all the pointers passed as arguments to the function where declared as __restricted, we still experienced the “Loop not vectorized: data dependency” warning.

Using -Msafeptr as a compiler option changed the compiler output, but it is still not vectorizing the code.

The output is now:

    50, Loop is parallelizable
         Generating Multicore code
        50, #pragma acc loop gang
    53, Loop is parallelizable
    56, Loop is parallelizable
    59, Loop is parallelizable
         Loop not vectorized: mixed data types

Do you have any other suggestions?

p.s
The same code get compiled and run perfectly fine on GPUs and just changing the pragmas it could also be vectorized by the Intel compiler targeting CPUs, so I guess it should be “just” a matter of persuading the pgi compiler about the fact that it can vectorize it.

Thanks and Best Regards,

Enrico

MatColgrove · December 2, 2016, 5:52pm

Hi Enrico,

Loop not vectorized: mixed data types

We’ve improved the compiler vectorization over the last few years and do support mixed data types, in particular things like mixed int and double. I’m not sure why we’re not able to vectorize it in this case.

Can you post a snip-it of the code, including the data types of the variables being used? Also, what version of the compiler are you using?

Mat

benry · December 13, 2016, 2:51pm

Hi,
we are using pgi 16.10.

The code is quite complex, but maybe the important information is that in the body of the loop, apart from some integer computations for arrays indexing, all computations are performed on array items of user defined data structures, which actually contains “C99 double complex” data type.

Maybe the compiler have difficulties to understand that double complex data elements actually just maps to double?

Thanks and Best Regards,

Enrico

MatColgrove · December 14, 2016, 8:12pm

Hi Enrico,

We are able to vectorize double complex data types (which is a 16-byte data type, doubles are 8-bytes). However, we can’t currently vectorize 16-byte types mixed with 4-byte types. Do you have any floats in the expression? If so, can you make these doubles to see if it works around the problem?

Mat

benry · November 20, 2017, 3:10pm

Hi,
is this limitation still present, or newer pgi versions allow to collapse two (or more) outer loops to increase the number of gangs?

Doing something like this:

#pragma acc loop independent collapse(2) gang
for (...) {
    for (...) {
        #pragma acc loop independent vector tile(TILE_0,TILE_1)
        for (...) {
            for (...) {

Thanks and Best Regards,

Enrico

MatColgrove · November 20, 2017, 4:53pm

Hi EnricoC,

is this limitation still present, or newer pgi versions allow to collapse two (or more) outer loops to increase the number of gangs?

I’ve been talking to our compiler engineers about this. According to them collapse should be working with a gang loop when targeting multicore. However, I have seen cases where the runtime profiles appear as if the collapse was ineffective.

I’ll continue to press them since it should be working but may not be in all cases. Are you able to send me a reproducing example (trs@pgroup.com)? That may help.

Thanks,
Mat

benry · November 23, 2017, 6:47pm

Hi,
I guess in our code it is working, since we can see different execution times when using it or not… I was just asking to know if it was worth to give it a try ;)

Anyhow, how would you profile the resulting executable to identify how the loops get divided across threads?
I tried to use pgprof and then the visual profiler, but I was not able to find this information… I was expecting to find it in a similar fashion as grid and blocks when targeting GPUs. Do I need to specify some particular flags to pgprof?

MatColgrove · November 27, 2017, 7:40pm

Hi Enrico,

For OpenACC multicore targets, we use our OpenMP back-end where each “gang” gets mapped to a OpenMP thread. For more recent versions of pgprof, you can profile OpenACC multicore and it will show up on the timeline. You can also see the times from each of the individual thread from the “CPU” tab’s thread drop-down box.

You’ll only be able to see the gang level parallelism since we’ll use classical CPU vectorization for the vector loops.

-Mat

Topic		Replies	Views
MatMul with openACC Legacy PGI Compilers	7	13039	December 17, 2012
paralle + independent and kernels + vector_length() Legacy PGI Compilers	5	4038	August 20, 2012
Complex loop carried dependence prevents parallelization Legacy PGI Compilers	5	4524	February 20, 2019
How to parallelize this loop... Legacy PGI Compilers	14	7820	December 18, 2012
Couple of questions (nested loops, loop bounds, etc.) Legacy PGI Compilers	17	25077	December 11, 2014
prevent parallelization Legacy PGI Compilers	3	1921	March 22, 2012
Segmentation fault Legacy PGI Compilers	1	3009	April 4, 2011
Complex loop carried dependence Legacy PGI Compilers	1	3898	December 21, 2015
Nvc not vectorizing inner loop due to index calculation nvc, nvc++ and nvfortran	3	595	January 13, 2021
Construct and clauses in a deeply nested loop Legacy PGI Compilers	5	570	September 24, 2020

Loop "too deeply nested" and "data dependency

Related topics