Force a loop to vectorize

I have a loop body with a lot of statements that is just refusing to vectorize, the output with -Minfo says that its not beneficial to vectorize. I want to see if this really the case, so I want to force the compiler to generate vector instructions. Is there a way to bypass the compiler cost model?

I have tried to add the following pragmas to the loop
#pragma loop ivdep
#pragma loop vector

I have also tried to add the following flags to the compilation "-fast -fastsse -Mvect=simd:256 -Mvect=nosizelimit ".

I am not posting the code here since the code is auto-generated and is too messy to read, and is not really relevant to my query.

Hi Mahesh,

It appears that you tried all the usual ways to force vectorization. At this point we would need to see the code to better understand what’s going on. Can you send a reproducing example to PGI Customer Service (


Hello Mat,

I’m reviving this thread just to check whether there is something different with the latest NVHPC compilers.
I have created an example kernel that has the same computation pattern as the original complex kernel we are trying to generate vectorized code for here:

struct Instance {
    double* __restrict__ a;
    double* __restrict__ b;
    double* __restrict__ c;
    double* __restrict__ d;
    int* __restrict__ b_index;
    int* __restrict__ node_index;
    int node_count;

void kernel(void* __restrict__ mech) {
    auto inst = static_cast<Instance*>(mech);
    int id;
    int node_id, b_id;

    #pragma omp simd
    for (id = 0; id < inst->node_count; id++) {
        node_id = inst->node_index[id];
        b_id = inst->b_index[id];
        inst->b[b_id] = inst->b[b_id] + inst->c[id];
        inst->a[node_id] = inst->a[node_id] + 10;

I’m compiling the above code using nvc++ 22.3 and the following command trying to target an Intel CascadeLake CPU:

nvc++ -fast -O3 -mp=autopar -tp=skylake -Msafeptr=all -Minfo -Mvect=simd:512,gather,prefetch,nosizelimit nvhpc_vect_test.cpp -fpic -shared -o

But I get the following:

kernel(void *):
     17, Loop not vectorized: unprofitable for target

I’ve also tried the flags and options mentioned above with the same result.
Is there any other flag I could use with the nvc++ compiler to force vectorization by ignoring the internal cost model?

Thank you very much in advance for your answer.


Hi Ioannis,

Is there any other flag I could use with the nvc++ compiler to force vectorization by ignoring the internal cost model?

Nope, still no way to force the compiler to vectorize.

In this case, I agree with the compiler that you wouldn’t want to vectorize it. The problem being with the non-consecutive indices being used to access the data. It would end-up having to do 4 or 8 separate loads (depending on the SIMD size) which would be no different than if the loop was run sequentially (non-vectorized). If anything it would be slower given then need for a residual loop. If “id” was used as the index for all accesses, then it would vectorize.


Hello Mat,

Thank you for your response and for confirming that there is no special option to force the vectorization.

It’s true that in the above simple example there will be gather/scatter instructions and vectorization may not be efficient.

Just some more context regarding my question. We are analyzing the performance of different compilers (Intel, Clang, NVHPC, GCC) for some of our kernels in the NEURON simulator and we see that for one specific kernel NVHPC performance is about ~3x slower than other compilers. For this one kernel, other compilers are able to vectorize the loop but NVHPC refuses to do so. And hence I was wondering if I could somehow force the vectorisation and analyze the performance.


Hmm, I did look at the compiler feedback information with g++ (v11.3), but it gives messages indicating that it can’t vectorize this loop either:

% g++ -O3 -fopenmp -ftree-vectorize -fopt-info-vec -fopt-info-missed-vec test.cpp -c
test.cpp:17:21: missed: couldn't vectorize loop
test.cpp:20:37: missed: not vectorized: no vectype for stmt: _11 = *_10;
 scalar_type: double

Often if there’s a 3x difference in performance of a loop, it’s due to lack of vectorization. But how are you determining that the other compilers are indeed vectorizing? Could the performance delta be due to something else?

Hello Mat,

Indeed GCC 11.3 with the above flags doesn’t manage to vectorize the loop. If we provide certain CPU target architecture flags GCC manages to vectorize the code (-march=skylake-avx512 -mtune=skylake).
We’ve made sure that the code is vectorized by looking at the assembly code generated by the various compilers in godbolt. We can see in this link from godbolt that code generated by GCC 11.3 and Clang 13.0.0 uses vector instructions and 512, 256 or 128 bit registers (zmm, ymm and xmm) which translates also to improved measured performance.