OpenACC on GPU help

Hi All,

I’m trying to run several loops using openACC.

Here is one of them

DO J = 0, N2 - 1
	DO K = 0, N3 - 1
		TEMP1(1:N1) = DP(K,J,0:N1 - 1)
		CALL COSFFT(TEMP1, N1, -1)
	   DP(K,J,0:N1 - 1) = TEMP1(1:N1)
	END DO
END DO

I have created memory for DP array and i want array TEMP1 to be private for each thread. As I understood I should not use data clause in this case.

Subroutine COSFFT is described as

!$acc routine(cosfft) seq

All other subroutines that are called in COSFFT subroutine also have this description. COSFFT subroutine performs 1D Fourier transform of vector TEMP1 and return result to the same array.

The problem is that when I do something like this

!$acc parallel loop private(temp1) independent vector_length(N3)
		DO J = 0, N2 - 1
!$acc loop vector
			DO K = 0, N3 - 1
				TEMP1(1:N1) = DP(K,J,0:N1 - 1)
				CALL COSFFT(TEMP1, N1, -1)
				DP(K,J,0:N1 - 1) = TEMP1(1:N1)
			END DO
		END DO

it works fine on CPU, but GPU results are incorrect. What am I doing wrong?

Hi GR4EM,

The “private” clause applies to the loop on which it’s used. So here “TEMP1” is private to the outer loop but shared within the inner loop. To fix, try moving the private clause to the inner loop.

!$acc parallel loop 
      DO J = 0, N2 - 1 
!$acc loop vector private(temp1)
         DO K = 0, N3 - 1 
            TEMP1(1:N1) = DP(K,J,0:N1 - 1) 
            CALL COSFFT(TEMP1, N1, -1) 
            DP(K,J,0:N1 - 1) = TEMP1(1:N1) 
         END DO 
      END DO

Note that “independent” is default for “parallel” regions so there no need to add it here. Also, I’d advise not setting the vector length, at least not for the initial port. As a last step, you can go back and see if setting the vector length helps, but the compiler typically does a good job at finding the optimal vector length for the particular target device.

-Mat

Hi Mat,

thanks for your response. Values of DP have finally changed, but not correctly. He is full direct transform:

!$acc parallel loop
		DO J = 0, N2 - 1
!$acc loop vector private(temp1)
			DO K = 0, N3 - 1
				TEMP1(1:N1) = DP(K,J,0:N1 - 1)
                CALL COSFFT(temp1(1:N1), N1, 1)
				DP(K,J,0:N1 - 1) = TEMP1(1:N1)
			END DO
		END DO
!$acc parallel loop
		DO I = 0, N1 - 1
!$acc loop vector private(temp2)
			DO K = 0, N3 - 1
				TEMP2(1:N2) = DP(K,0:N2 - 1,I)
                CALL COSFFT(temp2(1:n2), N2, 1)
				DP(K,0:N2 - 1,I) = TEMP2(1:N2)
			END DO
		END DO
!$acc parallel loop
        DO I = 0, N1 - 1
!$acc loop vector private(temp3)
			DO J = 0, N2 - 1
				TEMP3(1:N3) = DP(0:N3 - 1,J,I)
                CALL COSFFT(temp3(1:n3), N3, 1)
            	DP(0:N3 - 1,J,I) = TEMP3(1:N3)
			END DO
		END DO
!$acc update self(dp(10:10,10:10,10:10))
print*,'dir = ',dp(10,10,10)

At the end of transform I check correctness of the transform.

CPU dir = -30186.98952259555
GPU dir = 3456.776083129177

Something is clearly wrong here and I can’t understand exactly what. Any suggestions?

I don’t see anything obvious. Can you post or send a full reproducing example to PGI Customer Service (trs@pgroup.com) so I can take a look?

-Mat

FYI,

GR4EM sent me the example code and I was able to trace the problem down to a DO WHILE loop in one of the device routines. The compiler is incorrectly testing the exit condition for this loop thus causing it to execute one less time than the CPU version. I have created a problem report (TPR#25542) and sent to our compiler engineers for further investigation.

As work around, I advised him to change:

DO WHILE(N .GT. MMAX)
to
DO WHILE(N .GE. MMAX)

To get the GPU code to execute the last iteration of the loop.

-Mat