OpenACC on GPU help

GR4EM · April 18, 2018, 8:45pm

Hi All,

I’m trying to run several loops using openACC.

Here is one of them

DO J = 0, N2 - 1
	DO K = 0, N3 - 1
		TEMP1(1:N1) = DP(K,J,0:N1 - 1)
		CALL COSFFT(TEMP1, N1, -1)
	   DP(K,J,0:N1 - 1) = TEMP1(1:N1)
	END DO
END DO

I have created memory for DP array and i want array TEMP1 to be private for each thread. As I understood I should not use data clause in this case.

Subroutine COSFFT is described as

!$acc routine(cosfft) seq

All other subroutines that are called in COSFFT subroutine also have this description. COSFFT subroutine performs 1D Fourier transform of vector TEMP1 and return result to the same array.

The problem is that when I do something like this

!$acc parallel loop private(temp1) independent vector_length(N3)
		DO J = 0, N2 - 1
!$acc loop vector
			DO K = 0, N3 - 1
				TEMP1(1:N1) = DP(K,J,0:N1 - 1)
				CALL COSFFT(TEMP1, N1, -1)
				DP(K,J,0:N1 - 1) = TEMP1(1:N1)
			END DO
		END DO

it works fine on CPU, but GPU results are incorrect. What am I doing wrong?

MatColgrove · April 19, 2018, 2:52pm

Hi GR4EM,

The “private” clause applies to the loop on which it’s used. So here “TEMP1” is private to the outer loop but shared within the inner loop. To fix, try moving the private clause to the inner loop.

!$acc parallel loop 
      DO J = 0, N2 - 1 
!$acc loop vector private(temp1)
         DO K = 0, N3 - 1 
            TEMP1(1:N1) = DP(K,J,0:N1 - 1) 
            CALL COSFFT(TEMP1, N1, -1) 
            DP(K,J,0:N1 - 1) = TEMP1(1:N1) 
         END DO 
      END DO

Note that “independent” is default for “parallel” regions so there no need to add it here. Also, I’d advise not setting the vector length, at least not for the initial port. As a last step, you can go back and see if setting the vector length helps, but the compiler typically does a good job at finding the optimal vector length for the particular target device.

-Mat

GR4EM · April 19, 2018, 4:24pm

Hi Mat,

thanks for your response. Values of DP have finally changed, but not correctly. He is full direct transform:

!$acc parallel loop
		DO J = 0, N2 - 1
!$acc loop vector private(temp1)
			DO K = 0, N3 - 1
				TEMP1(1:N1) = DP(K,J,0:N1 - 1)
                CALL COSFFT(temp1(1:N1), N1, 1)
				DP(K,J,0:N1 - 1) = TEMP1(1:N1)
			END DO
		END DO
!$acc parallel loop
		DO I = 0, N1 - 1
!$acc loop vector private(temp2)
			DO K = 0, N3 - 1
				TEMP2(1:N2) = DP(K,0:N2 - 1,I)
                CALL COSFFT(temp2(1:n2), N2, 1)
				DP(K,0:N2 - 1,I) = TEMP2(1:N2)
			END DO
		END DO
!$acc parallel loop
        DO I = 0, N1 - 1
!$acc loop vector private(temp3)
			DO J = 0, N2 - 1
				TEMP3(1:N3) = DP(0:N3 - 1,J,I)
                CALL COSFFT(temp3(1:n3), N3, 1)
            	DP(0:N3 - 1,J,I) = TEMP3(1:N3)
			END DO
		END DO
!$acc update self(dp(10:10,10:10,10:10))
print*,'dir = ',dp(10,10,10)

At the end of transform I check correctness of the transform.

CPU dir = -30186.98952259555
GPU dir = 3456.776083129177

Something is clearly wrong here and I can’t understand exactly what. Any suggestions?

MatColgrove · April 19, 2018, 9:30pm

I don’t see anything obvious. Can you post or send a full reproducing example to PGI Customer Service (trs@pgroup.com) so I can take a look?

-Mat

MatColgrove · April 20, 2018, 7:31pm

FYI,

GR4EM sent me the example code and I was able to trace the problem down to a DO WHILE loop in one of the device routines. The compiler is incorrectly testing the exit condition for this loop thus causing it to execute one less time than the CPU version. I have created a problem report (TPR#25542) and sent to our compiler engineers for further investigation.

As work around, I advised him to change:

DO WHILE(N .GT. MMAX)
to
DO WHILE(N .GE. MMAX)

To get the GPU code to execute the last iteration of the loop.

-Mat

Topic		Replies	Views
openacc routine function efficiency Legacy PGI Compilers	1	3288	July 2, 2018
Complex loop carried dependence prevents parallelization Legacy PGI Compilers	5	4529	February 20, 2019
OpenACC: Best way to parallelize nested DO loops (continued) nvc, nvc++ and nvfortran	22	1724	March 28, 2023
OpenACC "threadprivate"? Legacy PGI Compilers	3	3805	October 20, 2017
Efficient Parallelization nvc, nvc++ and nvfortran	4	535	October 20, 2023
OpenACC 2.0 standard and nested loops Legacy PGI Compilers	6	10416	May 2, 2014
OpenACC: Best way to manage data transfer between host and device Legacy PGI Compilers	7	2647	September 27, 2021
a simple Openacc question from novice Legacy PGI Compilers	3	3343	September 5, 2017
C array of pointers in OpenACC Legacy PGI Compilers	4	5017	August 26, 2015
Problems with the gettime() function on OpenACC Legacy PGI Compilers	1	548	April 24, 2023

OpenACC on GPU help

Related topics