Hi,
In an application I have two nested for-loops.
The outer loop has n=128 iterations, the inner one k=512000.
I measured the best performance when I parallelised the outer one with gang and the inner one with vector.
However, when I skip the first m iterations in the outer loop, this results in a performance behaviour which I don’t understand:
For m=0 and m=1 the time for the whole kernel is about the same.
For m = 2 the time reduces to the half and stays about the same for m = 3, …, m = 127.
Here is a runnable code:
int main () {
const int n = 128;
const int k = 512000;
double mat[n * k];
double res[n];
int i;
for (i = 0; i < n * k; i++) {
mat[i] = 2.1337;
}
#pragma acc data copyout(res[0:n]) copyin(mat[0:n*k])
{
int m;
for (m = 0; m < n; m++) {
double start = omp_get_wtime();
#pragma acc parallel present(res[0:n])
#pragma acc loop gang
for (i = m; i < n; i++) {
int j;
double sum = 0.0;
#pragma acc loop vector reduction(+:sum)
for (j = 0; j < k; j++) {
sum += pow(mat[i * k + j], i);
}
res[i] = sum;
}
double end = omp_get_wtime();
printf("m = %d, time = %fms\n", m, (end - start) * 1.0e3);
}
} /* acc data */
return 0;
}
In a similiary CUDA implementation the time decreases with increasing m - as I would expect.
Why does OpenACC behave different here?
Thanks,
Fabian