OpenACC kernel running slower than expected

Hi, I am running a code on V100 using following compiler and flags:

Compiler= pgfortran
Flags = -acc -fast -mcmodel=medium -ta=tesla:managed -Minfo=accel

While executing the whole code, one of the loop structure is running far slower than expected. Other loops with same kind of loop scheduling and almost same amount of floating point operations were running ~2 times faster than this nested loop structure. I want to know why this is running slow.

Some of my observations:
My first suspect is the creation of arrays namely: Visc_Fflux,Visc_Gflux,Visc_Hflux, Total_Flux in the start using the data clause. These arrays are just needed in that subroutine only. I also notice in an other subroutine a similar kind of problem, an array was created during the start, and that subroutine is running slow similar to this one. Another observation is that both these subroutines are running fine for NK = 1 (other parameters remaining same). I can declare these variables as global but there are already so many global variables already.

I also notice that the GPU is idle and no computations were running while deallocating the arrays (Visc_Fflux,Visc_Gflux,Visc_Hflux, Total_Flux) in the end at ā€œ!$acc end dataā€.

I think this may also be because of the large number of large 5D arrays that are present in it (Ix, Qp_I, Jx, Qp_J, Kxā€¦). Is the latency in data reading from RAM that is the cause?

parameters:

nblocks = 1
NK = 59
NJ = 255
NI = 599

The openacc region that is running slow:

!$acc data create(Visc_Fflux,Visc_Gflux,Visc_Hflux, Total_Flux)  !dimension of this arrays NI x NJ x NK x nblocks x 4

!$acc loop seq
	DO nbl = 1,nblocks	
	!$acc parallel loop gang collapse(2) default(present)
	DO k = 1, NK(nbl)
	DO j = 1, NJ(nbl)
	!$acc loop vector
	DO i = 1, NI(nbl)
		
		rx = Ix(i,j,k,nbl)*Qp_I(i,j,k,nbl,1) + Jx(i,j,k,nbl)*Qp_J(i,j,k,nbl,1) + Kx(i,j,k,nbl)*Qp_K(i,j,k,nbl,1)
		ux = Ix(i,j,k,nbl)*Qp_I(i,j,k,nbl,2) + Jx(i,j,k,nbl)*Qp_J(i,j,k,nbl,2) + Kx(i,j,k,nbl)*Qp_K(i,j,k,nbl,2)
		vx = Ix(i,j,k,nbl)*Qp_I(i,j,k,nbl,3) + Jx(i,j,k,nbl)*Qp_J(i,j,k,nbl,3) + Kx(i,j,k,nbl)*Qp_K(i,j,k,nbl,3)
		wx = Ix(i,j,k,nbl)*Qp_I(i,j,k,nbl,4) + Jx(i,j,k,nbl)*Qp_J(i,j,k,nbl,4) + Kx(i,j,k,nbl)*Qp_K(i,j,k,nbl,4)
		px = Ix(i,j,k,nbl)*Qp_I(i,j,k,nbl,5) + Jx(i,j,k,nbl)*Qp_J(i,j,k,nbl,5) + Kx(i,j,k,nbl)*Qp_K(i,j,k,nbl,5)
		                                                                       
		ry = Iy(i,j,k,nbl)*Qp_I(i,j,k,nbl,1) + Jy(i,j,k,nbl)*Qp_J(i,j,k,nbl,1) + Ky(i,j,k,nbl)*Qp_K(i,j,k,nbl,1)
		uy = Iy(i,j,k,nbl)*Qp_I(i,j,k,nbl,2) + Jy(i,j,k,nbl)*Qp_J(i,j,k,nbl,2) + Ky(i,j,k,nbl)*Qp_K(i,j,k,nbl,2)
		vy = Iy(i,j,k,nbl)*Qp_I(i,j,k,nbl,3) + Jy(i,j,k,nbl)*Qp_J(i,j,k,nbl,3) + Ky(i,j,k,nbl)*Qp_K(i,j,k,nbl,3)
		wy = Iy(i,j,k,nbl)*Qp_I(i,j,k,nbl,4) + Jy(i,j,k,nbl)*Qp_J(i,j,k,nbl,4) + Ky(i,j,k,nbl)*Qp_K(i,j,k,nbl,4)
		py = Iy(i,j,k,nbl)*Qp_I(i,j,k,nbl,5) + Jy(i,j,k,nbl)*Qp_J(i,j,k,nbl,5) + Ky(i,j,k,nbl)*Qp_K(i,j,k,nbl,5)
		
		rz = Iz(i,j,k,nbl)*Qp_I(i,j,k,nbl,1) + Jz(i,j,k,nbl)*Qp_J(i,j,k,nbl,1) + Kz(i,j,k,nbl)*Qp_K(i,j,k,nbl,1)
		uz = Iz(i,j,k,nbl)*Qp_I(i,j,k,nbl,2) + Jz(i,j,k,nbl)*Qp_J(i,j,k,nbl,2) + Kz(i,j,k,nbl)*Qp_K(i,j,k,nbl,2)
		vz = Iz(i,j,k,nbl)*Qp_I(i,j,k,nbl,3) + Jz(i,j,k,nbl)*Qp_J(i,j,k,nbl,3) + Kz(i,j,k,nbl)*Qp_K(i,j,k,nbl,3)
		wz = Iz(i,j,k,nbl)*Qp_I(i,j,k,nbl,4) + Jz(i,j,k,nbl)*Qp_J(i,j,k,nbl,4) + Kz(i,j,k,nbl)*Qp_K(i,j,k,nbl,4)
		pz = Iz(i,j,k,nbl)*Qp_I(i,j,k,nbl,5) + Jz(i,j,k,nbl)*Qp_J(i,j,k,nbl,5) + Kz(i,j,k,nbl)*Qp_K(i,j,k,nbl,5)
		
		! ===============================
		
		T =  Gamma*Mach**2*Qp(i,j,k,nbl,5)/Qp(i,j,k,nbl,1)
		mu_L = (T**1.5d0)*(1.4045988d0)/(T + 0.4045988d0)/Re
		
		! ===============================
		
		Blk_Visc_Term = (-2.d0/3.d0)*(ux + vy + wz)
		
		tau_xx = mu_L*(2.d0*ux + Blk_Visc_Term)		;	tau_xy = mu_L*(uy + vx)
		tau_yy = mu_L*(2.d0*vy + Blk_Visc_Term)		;	tau_yz = mu_L*(vz + wy)
		tau_zz = mu_L*(2.d0*wz + Blk_Visc_Term)		;	tau_zx = mu_L*(wx + uz)
	
		bx = (tau_xx*Qp(i,j,k,nbl,2) + tau_xy*Qp(i,j,k,nbl,3) + tau_zx*Qp(i,j,k,nbl,4)) &
		+ mu_L*funGamma*(px/Qp(i,j,k,nbl,1) - Qp(i,j,k,nbl,5)/(Qp(i,j,k,nbl,1))**2.d0*rx)
		by = (tau_xy*Qp(i,j,k,nbl,2) + tau_yy*Qp(i,j,k,nbl,3) + tau_yz*Qp(i,j,k,nbl,4)) &
		+ mu_L*funGamma*(py/Qp(i,j,k,nbl,1) - Qp(i,j,k,nbl,5)/(Qp(i,j,k,nbl,1))**2.d0*ry)
		bz = (tau_zx*Qp(i,j,k,nbl,2) + tau_yz*Qp(i,j,k,nbl,3) + tau_zz*Qp(i,j,k,nbl,4)) &
		+ mu_L*funGamma*(pz/Qp(i,j,k,nbl,1) - Qp(i,j,k,nbl,5)/(Qp(i,j,k,nbl,1))**2.d0*rz)
		
		Visc_Fflux(i,j,k,nbl,1) =(Ix(i,j,k,nbl)*tau_xx+Iy(i,j,k,nbl)*tau_xy + Iz(i,j,k,nbl)*tau_zx)/Jac(i,j,k,nbl)
		Visc_Fflux(i,j,k,nbl,2) =(Ix(i,j,k,nbl)*tau_xy+Iy(i,j,k,nbl)*tau_yy + Iz(i,j,k,nbl)*tau_yz)/Jac(i,j,k,nbl)
		Visc_Fflux(i,j,k,nbl,3) =(Ix(i,j,k,nbl)*tau_zx+Iy(i,j,k,nbl)*tau_yz + Iz(i,j,k,nbl)*tau_zz)/Jac(i,j,k,nbl)*(1-f2D)
		Visc_Fflux(i,j,k,nbl,4) =(Ix(i,j,k,nbl)*bx    +Iy(i,j,k,nbl)*by     + Iz(i,j,k,nbl)*bz    )/Jac(i,j,k,nbl)
		
		Visc_Gflux(i,j,k,nbl,1) =(Jx(i,j,k,nbl)*tau_xx+Jy(i,j,k,nbl)*tau_xy + Jz(i,j,k,nbl)*tau_zx)/Jac(i,j,k,nbl)
		Visc_Gflux(i,j,k,nbl,2) =(Jx(i,j,k,nbl)*tau_xy+Jy(i,j,k,nbl)*tau_yy + Jz(i,j,k,nbl)*tau_yz)/Jac(i,j,k,nbl)
		Visc_Gflux(i,j,k,nbl,3) =(Jx(i,j,k,nbl)*tau_zx+Jy(i,j,k,nbl)*tau_yz + Jz(i,j,k,nbl)*tau_zz)/Jac(i,j,k,nbl)*(1-f2D)
		Visc_Gflux(i,j,k,nbl,4) =(Jx(i,j,k,nbl)*bx    +Jy(i,j,k,nbl)*by     + Jz(i,j,k,nbl)*bz    )/Jac(i,j,k,nbl)
		
		Visc_Hflux(i,j,k,nbl,1) =(Kx(i,j,k,nbl)*tau_xx+ Ky(i,j,k,nbl)*tau_xy+ Kz(i,j,k,nbl)*tau_zx)/Jac(i,j,k,nbl)
		Visc_Hflux(i,j,k,nbl,2) =(Kx(i,j,k,nbl)*tau_xy+ Ky(i,j,k,nbl)*tau_yy+ Kz(i,j,k,nbl)*tau_yz)/Jac(i,j,k,nbl)
		Visc_Hflux(i,j,k,nbl,3) =(Kx(i,j,k,nbl)*tau_zx+ Ky(i,j,k,nbl)*tau_yz+ Kz(i,j,k,nbl)*tau_zz)/Jac(i,j,k,nbl)
		Visc_Hflux(i,j,k,nbl,4) =(Kx(i,j,k,nbl)*bx    + Ky(i,j,k,nbl)*by    + Kz(i,j,k,nbl)*bz    )/Jac(i,j,k,nbl)
	
	ENDDO
	ENDDO
	ENDDO
	ENDDO

<some operations with Visc_Fflux,Visc_Gflux,Visc_Hflux, Total_Flux)
......
......>

!$acc end data

Are these allocatable, automatics, or fixed sized arrays? allocatable and automatics are allocated and since youā€™re ā€œmanagedā€, these would be managed by CUDA Unified Memory. Since these are write only arrays, it shouldnā€™t be copying data to the device, but if it did, you would see this time included in the kernel itself. Though, UM may be coping the data back to the host at the end, and would account for the extra time at the end.

Are you able to run youā€™re code without ā€œ-gpu=managedā€?

You might also want to run the code through the Nsight-Systems profiler to see where/if the data movement is occuring.

I think this may also be because of the large number of large 5D arrays that are present in it (Ix, Qp_I, Jx, Qp_J, Kxā€¦). Is the latency in data reading from RAM that is the cause?

Youā€™ll want to look at a Nsight-Compute profile to better understand whatā€™s going on. But my guess is that your not getting good caching. Thereā€™s re-use, but itā€™s across different lines so things may be getting evicted before itā€™s used again.

No idea if this will help, but Iā€™d try adding 12 scalars, assigning them to the various array elements at the top of the ā€œiā€ loop. Normally adding scalars may slow things down given an increase in register usage, but in this case is might actually reduce register usage since the code doesnā€™t need to compute the offsets into the arrays each time it fetches memory.

Night-Compute will give you the number of registers used as well as the caching throughput and occupancy.

1 Like

The size of these arrays is fixed and it is a 5 dimensional array with dimensions: NI x NJ x NK x nblocks x 4.

I think you are correct because, I donā€™t see any data movement in the profiler output. Please see the attached the .qdrep file if you think it may help. I have ran 6 iterations while generating this .qdrep file. The execution on GPU starts around 73.25 secs, after which everything should be carried out on gpu only. Please note that the kernel corresponding to this subroutine is named ā€˜viscous_fluxes_3665_gpuā€™ in the .qdrep file which took highest amount of time.) report1.qdrep (10.1 MB)

Unfortunately I donā€™t have access to Nsight compute at the moment, But I will be using it when I get get the access.

I did try this and it did help a little, but still the times are highā€¦Maybe there is a better way in these lines.

No I could not, but that maybe because I may have forgotten to copy some variables into device before launching the kernels.

The size of these arrays is fixed and it is a 5 dimensional array with dimensions: NI x NJ x NK x nblocks x 4

These are automatics since youā€™re using variables to define the size. Hence the arrays are implicitly allocated upon entry into the subroutine. Fixed size arrays would need to use parameters or literal values.

I think you are correct because, I donā€™t see any data movement in the profiler output.

The profile does say that unified memory is at 0%, but Iā€™m not 100% sure about that. The initial kernel launches are significantly longer than subsequent calls, leading me to believe that there might be some data movement in there. But Iā€™m not sure. I took a quick stab at trying to fix your data directives and add update directives, but it was going to take longer than I had available. Though presuming in a full run that the code does more that 6 iterations, this cost may amortize away and be more effort than itā€™s worth.

Please note that the kernel corresponding to this subroutine is named ā€˜viscous_fluxes_3665_gpuā€™

Note since the ā€œMain.f90ā€ you sent has only 2000 lines, Iā€™m assuming that you didnā€™t send the full application and this maps the VISOOUS_FLUXES routine around line 1031.

Unfortunately I donā€™t have access to Nsight compute at the moment,

We ship the command line profiler, ā€˜ncuā€™, with the HPC SDK so you can use it there. Though I prefer to use the GUI when looking at results.

I did try this and it did help a little, but still the times are highā€¦Maybe there is a better way in these lines.

This might be my miscommunication. I was meaning the arrays where you re-use the values, like ā€œIxā€, ā€œQp_Iā€, and ā€œJacā€. Looks like you added ā€œvars1-12ā€ which just get used to set ā€œVisc_[F|G|H]fluxā€. Also, it looks like you made an error where the ENDDO got moved up so the Visc arrays arenā€™t getting set.

Hereā€™s my modifications which get about a 10x speed-up. Though I donā€™t know how to check for correct answers, so please verify before actually using this version.

Main.f90 (51.7 KB)

Though, this loop isnā€™t too time consuming. The loop above this one take significantly more time. While it has non-stride-1 accesses so thereā€™s not much room for improvement, I suspect UM is copying in the ā€œQpā€ array.

Looking at the output from our runtime profiler (enabled by setting the environment variable ā€œNVCOMPILER_ACC_TIME=1ā€), thereā€™s a huge difference between the max and min kernel times indicating to me that thereā€™s data movement the first time itā€™s called.

/proj/scratch/mcolgrove/tmp/test2/Main.f90
  viscous_fluxes  NVIDIA  devicenum=0
    time(us): 111,739
    1052: compute region reached 6 times
        1052: kernel launched 6 times
            grid: [65535]  block: [128]
             device time(us): total=107,766 max=85,825 min=4,207 avg=17,961
            elapsed time(us): total=107,948 max=85,878 min=4,234 avg=17,991
    1052: data region reached 12 times
    1090: data region reached 12 times
    1094: compute region reached 6 times
        1094: kernel launched 6 times
            grid: [15045]  block: [128]
             device time(us): total=596 max=101 min=99 avg=99
            elapsed time(us): total=707 max=123 min=116 avg=117

Some difference is expected but this is quite a lot. Granted, I havenā€™t confirmed that UM is indeed the issue, I just suspect it. Again, hopefully this gets amortized as you run more iterations.

1 Like

You are correct. It runs for 1000s of iterations.

Oh, cool. Thanks.

I tried your modifications and sorry there is no improvement I can see.

Besides that, when I have declared Visc_Fflux,Visc_Gflux,Visc_Hflux, Total_Flux as global variables, I see a great speed up which I was exactly expecting, which I think is one possible solution for my original post (but this solution consumes a lot of RAM though). The kernel is running much faster now. I would like to know the problem with using automatics (particularly at high NI, NJ and NK and if there are better solutions that I can try. Thanks! report1.qdrep (9.8 MB)