Seems odd that the compiler can optimize for the accelerator but not the host.
With the OpenACC “vector” clause, you’re telling compiler to vecotrize the loop on the device so doesn’t need to perform a dependency analysis. If you had the compiler try to auto-vectorize the loop for the device, you’d see a similar message.
That said, if I did want things to work on the host, what changes would I need to make?
The problem is the computed index. Since this is evaluated at runtime, the compiler doesn’t know if there’s overlap between indices, so must presume that there are, which would be unsafe to vectorize. You can override the dependency check by using the ivdep pragma on the inner loop, but the compiler still won’t auto-vectorize the loop since it can’t tell if the index is contiguous across the stride-1 dimension of the array.
To fix, don’t use computed indices and instead make this a 3D array.
Example:
% cat test.c
#include <stdio.h>
#include <stdlib.h>
void foo(float ***A3, float * A1, int nx, int ny, int nz, float val) {
int i,j,k;
int idx;
#pragma acc kernels loop default(present)
for (i=0;i<nx;i++) {
for (j=0;j<ny;j++) {
for (k=0;k<nz;k++) {
idx = (ny*nz*i) + (nz*k) + k;
A1[idx] = A1[idx]+val;
}}}
#pragma acc kernels loop default(present)
for (i=0;i<nx;i++) {
for (j=0;j<ny;j++) {
for (k=0;k<nz;k++) {
A3[i][j][k] = A3[i][j][k]+val;
}}}
}
% nvc -c test.c -Minfo -fast
foo:
9, Loop not fused: dependence chain to sibling loop
11, Loop not vectorized: data dependency
Loop unrolled 4 times
19, Generated vector simd code for the loop
% nvc -c test.c -Minfo=accel -fast -acc
foo:
9, Loop carried dependence due to exposed use of A1[:*] prevents parallelization
Accelerator serial kernel generated
Generating Tesla code
9, #pragma acc loop seq
10, #pragma acc loop seq
11, #pragma acc loop seq
9, Generating default present(A1[:])
10, Loop carried dependence due to exposed use of A1[:*] prevents parallelization
11, Complex loop carried dependence of A1-> prevents parallelization
Loop carried dependence due to exposed use of A1[:*] prevents parallelization
17, Loop is parallelizable
Generating default present(A3[:nx][:ny][:nz])
18, Loop is parallelizable
19, Loop is parallelizable
Generating Tesla code
17, #pragma acc loop gang, vector(128) collapse(3) /* blockIdx.x threadIdx.x */
18, /* blockIdx.x threadIdx.x auto-collapsed */
19, /* blockIdx.x threadIdx.x auto-collapsed */