OpenACC - Parallelization of small sub-loops present inside large main loops

hemanthgrylls · October 25, 2021, 1:15am

Hi,

I am trying to parallelize a kernel with the following configuration:

integer :: i,j,k,nbl
double precision :: A(NI,NJ,NK,nblocks), rho,c
double precision, dimension(-2:3) :: Wp, Wp2

    !$acc parallel loop gang vector collapse(4) private(Wp) default(present)
	DO nbl = 1,nblocks
	DO k = 1,NK
	DO j = 1,NJ
	DO i = 1,NI
		
		rho = i*j
		c   = j*k
		
		!--------------------------------------------------------------------------------------------
		
		!$acc loop seq
		DO II=-2,3
			
			Wp(II) = rho-c
		
		ENDDO

        rho = i
		c   = k

        !$acc loop seq
		DO II=-2,3
			
			Wp2(II) = rho*Wp(II,1) +c
		
		ENDDO

        A(i,j,k,nbl) = SUM(Wp(-2:3)) + SUM(Wp2(-2:3))
		
	ENDDO
	ENDDO
	ENDDO
	ENDDO

There is good amount of parallelism in the exterior loops at i,j,k,nbl levels. But the small inner loops (with index II) are still running serially on each thread. Is there a way to parallelize these loops (with index II) as well?

With an aim of distributing the parallelism, I have changed the dirictive
‘!$acc parallel loop gang vector collapse(4) private(Wp) default(present)’ to
‘!$acc parallel loop gang vector_length(32) collapse(4) private(Wp) default(present)’ along with a change of inner loops directives from ‘!$acc loop seq’ to ‘!$acc loop vector’ but the performance deteriorated with that.

Also, am I correct in saying: ‘performance from parallelizing the inner sub-loops becomes insignificant with increasing values of NI,NJ and NK since the device will be already be fully occupied with the outer four loops itself’?

Yuki_Ni · October 25, 2021, 3:13pm

Please follow the instruction here to report us a ticket Getting Help with CUDA NVCC Compiler with the exact compile command line .

hemanthgrylls · October 25, 2021, 10:12pm

Compiler: pgfortran (21.9)
Flags: -fast -acc -ta=tesla:managed -Minfo=accel
Cuda toolkit version: 11.4

I don’t have any bugs or issues to report. But I am looking for help to optimize the loop scheduling to get faster run speed.

Topic		Replies	Views
Efficient Parallelization nvc, nvc++ and nvfortran	4	529	October 20, 2023
openacc routine function efficiency Legacy PGI Compilers	1	3274	July 2, 2018
Update C++ object in parallel loop using OpenAcc nvc, nvc++ and nvfortran cuda	8	394	January 17, 2024
Error in computed solution (giving NaN values) while using collapse directive in OpenACC Legacy PGI Compilers	6	909	September 27, 2021
OpenACC: Best way to parallelize nested DO loops with data dependency between loops? nvc, nvc++ and nvfortran	14	3209	October 4, 2021
grouping specific loops into a kernel Legacy PGI Compilers	1	1749	May 7, 2013
Nvc not vectorizing inner loop due to index calculation nvc, nvc++ and nvfortran	3	580	January 13, 2021
How to parallelize this loop... Legacy PGI Compilers	14	7816	December 18, 2012
Nested parallel loops data locality problems. Legacy PGI Compilers	3	2419	June 22, 2018
parallelizing a simple code by openacc ? Legacy PGI Compilers	3	2467	December 5, 2017

OpenACC - Parallelization of small sub-loops present inside large main loops

Related topics