However, I think that a reduction for tmp_x, tmp_y and tmp_z would be required. So if I uncomment the reduction I receive the exact same compiler feedback (i.e. nothing about added reductions) but the results are wrong.
In order the parallelize the inner loop, the compiler must be automatically generating the reductions, otherwise you’d be getting wrong answers without the reduction clause. I’m not sure why adding the reduction clause would then yield wrong answers. Seems like a compiler error.
Can you either post or send to PGI Customer Support (trs@pgroup.com) a reproducing example?
ups :) thanks for catching that. Yes, a compiler feedback would be nice.
Maybe another question:
I’m curious if the private(i,j) clause is really necessary. In OpenMP it would be, at least private(j), is it true for OpenACC as well?
Because if I don’t use the private clause my programm runs 15% faster and still gives the same results.
I’m curious if the private(i,j) clause is really necessary
Scalars are privatized by default, so no, privatizing i and j is not necessary. As you found out, privatizing them can actually slow down your code. Privatizing a scalar variable will create an array of the variables, one for each thread, in global memory. If it’s not privatized, the variable is declared locally in the kernel and thus more likely to be stored in a register which is much faster to access.
Is that true for the PGI compiler or for OpenACC in general?
I’m not sure. I believe it’s true for Cray, but don’t know for CAPS. The OpenACC 2.0 spec does clear this up a bit by adding a “default(none)” clause which will require the user to explicitly which variables are private. The exception being the loop index variables which are always private.