DO J = 0, N2 - 1
DO K = 0, N3 - 1
TEMP1(1:N1) = DP(K,J,0:N1 - 1)
CALL COSFFT(TEMP1, N1, -1)
DP(K,J,0:N1 - 1) = TEMP1(1:N1)
END DO
END DO
I have created memory for DP array and i want array TEMP1 to be private for each thread. As I understood I should not use data clause in this case.
Subroutine COSFFT is described as
!$acc routine(cosfft) seq
All other subroutines that are called in COSFFT subroutine also have this description. COSFFT subroutine performs 1D Fourier transform of vector TEMP1 and return result to the same array.
The problem is that when I do something like this
!$acc parallel loop private(temp1) independent vector_length(N3)
DO J = 0, N2 - 1
!$acc loop vector
DO K = 0, N3 - 1
TEMP1(1:N1) = DP(K,J,0:N1 - 1)
CALL COSFFT(TEMP1, N1, -1)
DP(K,J,0:N1 - 1) = TEMP1(1:N1)
END DO
END DO
it works fine on CPU, but GPU results are incorrect. What am I doing wrong?
The “private” clause applies to the loop on which it’s used. So here “TEMP1” is private to the outer loop but shared within the inner loop. To fix, try moving the private clause to the inner loop.
!$acc parallel loop
DO J = 0, N2 - 1
!$acc loop vector private(temp1)
DO K = 0, N3 - 1
TEMP1(1:N1) = DP(K,J,0:N1 - 1)
CALL COSFFT(TEMP1, N1, -1)
DP(K,J,0:N1 - 1) = TEMP1(1:N1)
END DO
END DO
Note that “independent” is default for “parallel” regions so there no need to add it here. Also, I’d advise not setting the vector length, at least not for the initial port. As a last step, you can go back and see if setting the vector length helps, but the compiler typically does a good job at finding the optimal vector length for the particular target device.
GR4EM sent me the example code and I was able to trace the problem down to a DO WHILE loop in one of the device routines. The compiler is incorrectly testing the exit condition for this loop thus causing it to execute one less time than the CPU version. I have created a problem report (TPR#25542) and sent to our compiler engineers for further investigation.
As work around, I advised him to change:
DO WHILE(N .GT. MMAX)
to
DO WHILE(N .GE. MMAX)
To get the GPU code to execute the last iteration of the loop.