OpenACC - Parallelization of small sub-loops present inside large main loops


I am trying to parallelize a kernel with the following configuration:

integer :: i,j,k,nbl
double precision :: A(NI,NJ,NK,nblocks), rho,c
double precision, dimension(-2:3) :: Wp, Wp2

    !$acc parallel loop gang vector collapse(4) private(Wp) default(present)
	DO nbl = 1,nblocks
	DO k = 1,NK
	DO j = 1,NJ
	DO i = 1,NI
		rho = i*j
		c   = j*k
		!$acc loop seq
		DO II=-2,3
			Wp(II) = rho-c

        rho = i
		c   = k

        !$acc loop seq
		DO II=-2,3
			Wp2(II) = rho*Wp(II,1) +c

        A(i,j,k,nbl) = SUM(Wp(-2:3)) + SUM(Wp2(-2:3))

There is good amount of parallelism in the exterior loops at i,j,k,nbl levels. But the small inner loops (with index II) are still running serially on each thread. Is there a way to parallelize these loops (with index II) as well?

With an aim of distributing the parallelism, I have changed the dirictive
‘!$acc parallel loop gang vector collapse(4) private(Wp) default(present)’ to
‘!$acc parallel loop gang vector_length(32) collapse(4) private(Wp) default(present)’ along with a change of inner loops directives from ‘!$acc loop seq’ to ‘!$acc loop vector’ but the performance deteriorated with that.

Also, am I correct in saying: ‘performance from parallelizing the inner sub-loops becomes insignificant with increasing values of NI,NJ and NK since the device will be already be fully occupied with the outer four loops itself’?

Please follow the instruction here to report us a ticket Getting Help with CUDA NVCC Compiler with the exact compile command line .

Compiler: pgfortran (21.9)
Flags: -fast -acc -ta=tesla:managed -Minfo=accel
Cuda toolkit version: 11.4

I don’t have any bugs or issues to report. But I am looking for help to optimize the loop scheduling to get faster run speed.