OpenACC - Parallelization of small sub-loops present inside large main loops


I am trying to parallelize a kernel with the following configuration:

integer :: i,j,k,nbl
double precision :: A(NI,NJ,NK,nblocks), rho,c
double precision, dimension(-2:3) :: Wp, Wp2

    !$acc parallel loop gang vector collapse(4) private(Wp) default(present)
	DO nbl = 1,nblocks
	DO k = 1,NK
	DO j = 1,NJ
	DO i = 1,NI
		rho = i*j
		c   = j*k
		!$acc loop seq
		DO II=-2,3
			Wp(II) = rho-c

        rho = i
		c   = k

        !$acc loop seq
		DO II=-2,3
			Wp2(II) = rho*Wp(II,1) +c

        A(i,j,k,nbl) = SUM(Wp(-2:3)) + SUM(Wp2(-2:3))

There is good amount of parallelism in the exterior loops at i,j,k,nbl levels. But the small inner loops (with index II) are still running serially on each thread. Is there a way to parallelize these loops (with index II) as well?

With an aim of distributing the parallelism, I have changed the dirictive
‘!$acc parallel loop gang vector collapse(4) private(Wp) default(present)’ to
‘!$acc parallel loop gang vector_length(32) collapse(4) private(Wp) default(present)’ along with a change of inner loops directives from ‘!$acc loop seq’ to ‘!$acc loop vector’ but the performance deteriorated with that.

Also, am I correct in saying: ‘performance from parallelizing the inner sub-loops becomes insignificant with increasing values of NI,NJ and NK since the device will be already be fully occupied with the outer four loops itself’?

Compiler: pgfortran (21.9)
Flags: -fast -acc -ta=tesla:managed -Minfo=accel
Cuda toolkit version: 11.4

I don’t have any bugs or issues to report. But I am looking for help to optimize the loop scheduling to get faster run speed.