Force a loop to vectorize

mravishankar · August 31, 2014, 2:41am

I have a loop body with a lot of statements that is just refusing to vectorize, the output with -Minfo says that its not beneficial to vectorize. I want to see if this really the case, so I want to force the compiler to generate vector instructions. Is there a way to bypass the compiler cost model?

I have tried to add the following pragmas to the loop
#pragma loop ivdep
#pragma loop vector

I have also tried to add the following flags to the compilation "-fast -fastsse -Mvect=simd:256 -Mvect=nosizelimit ".

I am not posting the code here since the code is auto-generated and is too messy to read, and is not really relevant to my query.

MatColgrove · September 3, 2014, 2:54pm

Hi Mahesh,

It appears that you tried all the usual ways to force vectorization. At this point we would need to see the code to better understand what’s going on. Can you send a reproducing example to PGI Customer Service (trs@pgroup.com)?

Thanks,
Mat

iomagkanaris · July 22, 2022, 12:03pm

Hello Mat,

I’m reviving this thread just to check whether there is something different with the latest NVHPC compilers.
I have created an example kernel that has the same computation pattern as the original complex kernel we are trying to generate vectorized code for here:

struct Instance {
    double* __restrict__ a;
    double* __restrict__ b;
    double* __restrict__ c;
    double* __restrict__ d;
    int* __restrict__ b_index;
    int* __restrict__ node_index;
    int node_count;
};

void kernel(void* __restrict__ mech) {
    auto inst = static_cast<Instance*>(mech);
    int id;
    int node_id, b_id;

    #pragma omp simd
    for (id = 0; id < inst->node_count; id++) {
        node_id = inst->node_index[id];
        b_id = inst->b_index[id];
        inst->b[b_id] = inst->b[b_id] + inst->c[id];
        inst->a[node_id] = inst->a[node_id] + 10;
    }
}

I’m compiling the above code using nvc++ 22.3 and the following command trying to target an Intel CascadeLake CPU:

nvc++ -fast -O3 -mp=autopar -tp=skylake -Msafeptr=all -Minfo -Mvect=simd:512,gather,prefetch,nosizelimit nvhpc_vect_test.cpp -fpic -shared -o nvhpc_vect_test_lib.so

But I get the following:

kernel(void *):
     17, Loop not vectorized: unprofitable for target

I’ve also tried the flags and options mentioned above with the same result.
Is there any other flag I could use with the nvc++ compiler to force vectorization by ignoring the internal cost model?

Thank you very much in advance for your answer.

Ioannis

MatColgrove · July 22, 2022, 7:41pm

Hi Ioannis,

Is there any other flag I could use with the nvc++ compiler to force vectorization by ignoring the internal cost model?

Nope, still no way to force the compiler to vectorize.

In this case, I agree with the compiler that you wouldn’t want to vectorize it. The problem being with the non-consecutive indices being used to access the data. It would end-up having to do 4 or 8 separate loads (depending on the SIMD size) which would be no different than if the loop was run sequentially (non-vectorized). If anything it would be slower given then need for a residual loop. If “id” was used as the index for all accesses, then it would vectorize.

-Mat

iomagkanaris · July 25, 2022, 9:55am

Hello Mat,

Thank you for your response and for confirming that there is no special option to force the vectorization.

It’s true that in the above simple example there will be gather/scatter instructions and vectorization may not be efficient.

Just some more context regarding my question. We are analyzing the performance of different compilers (Intel, Clang, NVHPC, GCC) for some of our kernels in the NEURON simulator and we see that for one specific kernel NVHPC performance is about ~3x slower than other compilers. For this one kernel, other compilers are able to vectorize the loop but NVHPC refuses to do so. And hence I was wondering if I could somehow force the vectorisation and analyze the performance.

Ioannis

MatColgrove · July 25, 2022, 7:23pm

Hmm, I did look at the compiler feedback information with g++ (v11.3), but it gives messages indicating that it can’t vectorize this loop either:

% g++ -O3 -fopenmp -ftree-vectorize -fopt-info-vec -fopt-info-missed-vec test.cpp -c
test.cpp:17:21: missed: couldn't vectorize loop
test.cpp:20:37: missed: not vectorized: no vectype for stmt: _11 = *_10;
 scalar_type: double

Often if there’s a 3x difference in performance of a loop, it’s due to lack of vectorization. But how are you determining that the other compilers are indeed vectorizing? Could the performance delta be due to something else?

iomagkanaris · July 26, 2022, 10:42am

Hello Mat,

Indeed GCC 11.3 with the above flags doesn’t manage to vectorize the loop. If we provide certain CPU target architecture flags GCC manages to vectorize the code (-march=skylake-avx512 -mtune=skylake).
We’ve made sure that the code is vectorized by looking at the assembly code generated by the various compilers in godbolt. We can see in this link from godbolt that code generated by GCC 11.3 and Clang 13.0.0 uses vector instructions and 512, 256 or 128 bit registers (zmm, ymm and xmm) which translates also to improved measured performance.

Ioannis

Topic		Replies	Views
NVHPC 23.9 vectorization issue nvc, nvc++ and nvfortran	4	237	August 5, 2024
Decide on wheter parallelize or unroll a loop Legacy PGI Compilers	3	2397	November 5, 2015
New facet Legacy PGI Compilers	1	1975	October 4, 2012
PGF95 won't vectorize loops -- "may not be beneficial&q Legacy PGI Compilers	3	4688	October 31, 2013
Nvc not vectorizing inner loop due to index calculation nvc, nvc++ and nvfortran	3	604	January 13, 2021
Command line options to enable and disable cost model Legacy PGI Compilers	4	535	June 12, 2020
ICE from % operator on vector extensions nvc, nvc++ and nvfortran	1	493	May 19, 2021
PGI not vectorizing openmp loops Legacy PGI Compilers	1	2455	October 23, 2012
Is there a way to vectorize this routine? Legacy PGI Compilers	6	48284	October 9, 2007
Disabling optimization on specific source files (nvc++) nvc, nvc++ and nvfortran	4	644	September 1, 2023

Force a loop to vectorize

Related topics