Nvc not vectorizing inner loop due to index calculation

Hello,

I have some C code with openACC pragmas I’m compiling, and I’m wondering if I’m missing out on some performance due to a lack of vectorization. My code looks like the following (just an example, but I think it has all the relevant structure):

#pragma acc collapse(3) gang vector
for (int i…)
for (int j…)
for (int k…)
idx = row * col * i +col * j + k;
array_1[idx] = array_2[idx] + array_3[idx];

My variable idx is an int, and the array is a double or float.

When I compile with -Minfo, I notice that the compiler tells me the interior loop isn’t vectorized due to mixed data types. If I replace all the “idx” with “rowcoli +col*j + k” and get rid of the idx variable, the compiler warning goes away, but I find my code actually runs slower, I assume because of the additional cost of computing the index for all three arrays. It seems like I’m perhaps missing out on some performance benefits of vectorization of the interior loop, but obviously if the overhead of computing the index is greater than the benefit of vectorization, than taking this line out doesn’t make sense. Does anyone have any suggestions to solve this problem?

Hi luke_long,

This message would be for the host not for the device so can be ignored (unless you’re also trying to optimize the host performance).

Compiling with “-Minfo=accel” will give you just the compiler feedback for the device compilation.

-Mat

Thanks for the response, Mat.

Indeed, things look OK on the accelerator. That said, if I did want things to work on the host, what changes would I need to make? Seems odd that the compiler can optimize for the accelerator but not the host.

Seems odd that the compiler can optimize for the accelerator but not the host.
With the OpenACC “vector” clause, you’re telling compiler to vecotrize the loop on the device so doesn’t need to perform a dependency analysis. If you had the compiler try to auto-vectorize the loop for the device, you’d see a similar message.

That said, if I did want things to work on the host, what changes would I need to make?

The problem is the computed index. Since this is evaluated at runtime, the compiler doesn’t know if there’s overlap between indices, so must presume that there are, which would be unsafe to vectorize. You can override the dependency check by using the ivdep pragma on the inner loop, but the compiler still won’t auto-vectorize the loop since it can’t tell if the index is contiguous across the stride-1 dimension of the array.

To fix, don’t use computed indices and instead make this a 3D array.

Example:

% cat test.c
#include <stdio.h>
#include <stdlib.h>

void foo(float ***A3, float * A1, int nx, int ny, int nz, float val) {

    int i,j,k;
    int idx;
#pragma acc kernels loop default(present)
    for (i=0;i<nx;i++) {
      for (j=0;j<ny;j++) {
       for (k=0;k<nz;k++) {
         idx = (ny*nz*i) + (nz*k) + k;
         A1[idx] = A1[idx]+val;
    }}}

#pragma acc kernels loop default(present)
    for (i=0;i<nx;i++) {
      for (j=0;j<ny;j++) {
       for (k=0;k<nz;k++) {
         A3[i][j][k] = A3[i][j][k]+val;
    }}}


}
% nvc -c test.c -Minfo -fast
foo:
      9, Loop not fused: dependence chain to sibling loop
     11, Loop not vectorized: data dependency  
         Loop unrolled 4 times
     19, Generated vector simd code for the loop  
% nvc -c test.c -Minfo=accel -fast -acc
foo:
      9, Loop carried dependence due to exposed use of A1[:*] prevents parallelization
         Accelerator serial kernel generated
         Generating Tesla code
          9, #pragma acc loop seq
         10, #pragma acc loop seq
         11, #pragma acc loop seq
      9, Generating default present(A1[:])
     10, Loop carried dependence due to exposed use of A1[:*] prevents parallelization
     11, Complex loop carried dependence of A1-> prevents parallelization
         Loop carried dependence due to exposed use of A1[:*] prevents parallelization
     17, Loop is parallelizable
         Generating default present(A3[:nx][:ny][:nz])
     18, Loop is parallelizable
     19, Loop is parallelizable
         Generating Tesla code
         17, #pragma acc loop gang, vector(128) collapse(3) /* blockIdx.x threadIdx.x */
         18,   /* blockIdx.x threadIdx.x auto-collapsed */
         19,   /* blockIdx.x threadIdx.x auto-collapsed */