Hi,
I am trying to parallelize a kernel with the following configuration:
integer :: i,j,k,nbl
double precision :: A(NI,NJ,NK,nblocks), rho,c
double precision, dimension(-2:3) :: Wp, Wp2
!$acc parallel loop gang vector collapse(4) private(Wp) default(present)
DO nbl = 1,nblocks
DO k = 1,NK
DO j = 1,NJ
DO i = 1,NI
rho = i*j
c = j*k
!--------------------------------------------------------------------------------------------
!$acc loop seq
DO II=-2,3
Wp(II) = rho-c
ENDDO
rho = i
c = k
!$acc loop seq
DO II=-2,3
Wp2(II) = rho*Wp(II,1) +c
ENDDO
A(i,j,k,nbl) = SUM(Wp(-2:3)) + SUM(Wp2(-2:3))
ENDDO
ENDDO
ENDDO
ENDDO
There is good amount of parallelism in the exterior loops at i,j,k,nbl levels. But the small inner loops (with index II) are still running serially on each thread. Is there a way to parallelize these loops (with index II) as well?
With an aim of distributing the parallelism, I have changed the dirictive
‘!$acc parallel loop gang vector collapse(4) private(Wp) default(present)’ to
‘!$acc parallel loop gang vector_length(32) collapse(4) private(Wp) default(present)’ along with a change of inner loops directives from ‘!$acc loop seq’ to ‘!$acc loop vector’ but the performance deteriorated with that.
Also, am I correct in saying: ‘performance from parallelizing the inner sub-loops becomes insignificant with increasing values of NI,NJ and NK since the device will be already be fully occupied with the outer four loops itself’?