Hi,

In an application I have two nested for-loops.

The outer loop has n=128 iterations, the inner one k=512000.

I measured the best performance when I parallelised the outer one with gang and the inner one with vector.

However, when I skip the first m iterations in the outer loop, this results in a performance behaviour which I don’t understand:

For m=0 and m=1 the time for the whole kernel is about the same.

For m = 2 the time reduces to the half and stays about the same for m = 3, …, m = 127.

Here is a runnable code:

```
int main () {
const int n = 128;
const int k = 512000;
double mat[n * k];
double res[n];
int i;
for (i = 0; i < n * k; i++) {
mat[i] = 2.1337;
}
#pragma acc data copyout(res[0:n]) copyin(mat[0:n*k])
{
int m;
for (m = 0; m < n; m++) {
double start = omp_get_wtime();
#pragma acc parallel present(res[0:n])
#pragma acc loop gang
for (i = m; i < n; i++) {
int j;
double sum = 0.0;
#pragma acc loop vector reduction(+:sum)
for (j = 0; j < k; j++) {
sum += pow(mat[i * k + j], i);
}
res[i] = sum;
}
double end = omp_get_wtime();
printf("m = %d, time = %fms\n", m, (end - start) * 1.0e3);
}
} /* acc data */
return 0;
}
```

In a similiary CUDA implementation the time decreases with increasing m - as I would expect.

Why does OpenACC behave different here?

Thanks,

Fabian