Error in computed solution (giving NaN values) while using collapse directive in OpenACC


I am new to OpenACC. I am trying to parallelize a large Fortran 90 code. One of the code block in my program has these nested DO loops. I was trying to compute the array named ‘Fflux_iph’ through the following code using collapse directive. So I am using the collapse directive. The code has compiled fine, and is running fine too. But the ‘Fflux_iph’ is eventually resulting in NaN value. Is there any problem with the following code? The same code is running fine on a serial run.

compiler: pgfortran
flags: -acc -O3 -ta=tesla -Minfo=accel

Input variables: nblocks = 1, nprims = 5, NI(1) = 300, NJ(1) = 300, NK(1) = 300


double precision, dimension(0:NImax, NJmax, NKmax, nblocks, nconserv) :: Fflux_iph
double precision :: Rho_L, Rho_R, Ux_L, Ux_R, Uy_L, Uy_R, Uz_L, Uz_R, P_L, P_R, C_L, C_R, E_L, E_R, h_L, h_R, E_local
double precision :: U_contra_L, U_contra_R, S_L, S_R, S
double precision, dimension(nconserv) :: F_L, F_R
double precision :: sqrt_rho_L, sqrt_rho_R, Rho, divisor, u_average, v_average
double precision :: w_average, h_average, uvw_average,ucontra_ave, C_average
	DO nbl = 1,nblocks
	!$acc parallel loop collapse(3)
	DO k = 1,NK(nbl)
	DO j = 1,NJ(nbl)
	DO i = 0,NI(nbl)
		!Local variables
		Rho_L = Qp_iphL(i,j,k,nbl,1)
		Rho_R = Qp_iphR(i,j,k,nbl,1)
		Ux_L = Qp_iphL(i,j,k,nbl,2)
		Ux_R = Qp_iphR(i,j,k,nbl,2)
		Uy_L = Qp_iphL(i,j,k,nbl,3)
		Uy_R = Qp_iphR(i,j,k,nbl,3)
		Uz_L = Qp_iphL(i,j,k,nbl,4)
		Uz_R = Qp_iphR(i,j,k,nbl,4)
		P_L = Qp_iphL(i,j,k,nbl,5)
		P_R = Qp_iphR(i,j,k,nbl,5)
		C_L = Sqrt(Gamma*P_L/Rho_L*(Ix_iph(i,j,k,nbl)**2+Iy_iph(i,j,k,nbl)**2+Iz_iph(i,j,k,nbl)**2))
		C_R = Sqrt(Gamma*P_R/Rho_R*(Ix_iph(i,j,k,nbl)**2+Iy_iph(i,j,k,nbl)**2+Iz_iph(i,j,k,nbl)**2))
		E_L = P_L/(Gamma-1.d0) + Rho_L*(Ux_L**2+Uy_L**2+Uz_L**2)/2.d0
		E_R = P_R/(Gamma-1.d0) + Rho_R*(Ux_R**2+Uy_R**2+Uz_R**2)/2.d0
		h_L = (E_L + P_L)/Rho_L
		h_R = (E_R + P_R)/Rho_R
		U_contra_L = Ux_L*Ix_iph(i,j,k,nbl) + Uy_L*Iy_iph(i,j,k,nbl) + Uz_L*Iz_iph(i,j,k,nbl)
		U_contra_R = Ux_R*Ix_iph(i,j,k,nbl) + Uy_R*Iy_iph(i,j,k,nbl) + Uz_R*Iz_iph(i,j,k,nbl)
		! Roe averaging
		sqrt_rho_L = sqrt(Rho_L)
		sqrt_rho_R = sqrt(Rho_R)
		Rho = sqrt(Rho_R/Rho_L)*Rho_L
		divisor	   = 1.d0/(sqrt_rho_R+sqrt_rho_L)
		h_average   = ((h_L*sqrt_rho_L) + (h_R*sqrt_rho_R))*divisor
		u_average 	= ((Ux_L*sqrt_rho_L) + (Ux_R*sqrt_rho_R))*divisor
		v_average 	= ((Uy_L*sqrt_rho_L) + (Uy_R*sqrt_rho_R))*divisor
		w_average 	= ((Uz_L*sqrt_rho_L) + (Uz_R*sqrt_rho_R))*divisor
		ucontra_ave	= u_average*Ix_iph(i,j,k,nbl) + v_average*Iy_iph(i,j,k,nbl) + w_average*Iz_iph(i,j,k,nbl)
		uvw_average = 0.5d0 * (u_average**2.d0 + v_average**2.d0 + w_average**2.d0)
		C_average 	= sqrt((gamma-1.d0)*(h_average - uvw_average))*SQRT(Ix_iph(i,j,k,nbl)**2+Iy_iph(i,j,k,nbl)**2+Iz_iph(i,j,k,nbl)**2)
		S = ABS(ucontra_ave + C_average)
		!Step3: Estimate fluxes
		F_L(1) = (rho_L*U_contra_L) /Jac_iph(i,j,k,nbl)
		F_L(2) = (rho_L*U_contra_L*Ux_L + P_L*Ix_iph(i,j,k,nbl)) /Jac_iph(i,j,k,nbl)
		F_L(3) = (rho_L*U_contra_L*Uy_L + P_L*Iy_iph(i,j,k,nbl)) /Jac_iph(i,j,k,nbl)
		F_L(4) = (rho_L*U_contra_L*Uz_L + P_L*Iz_iph(i,j,k,nbl)) /Jac_iph(i,j,k,nbl)
		F_L(5) = (E_L + P_L) * U_contra_L /Jac_iph(i,j,k,nbl)
		F_R(1) = (rho_R*U_contra_R) /Jac_iph(i,j,k,nbl)
		F_R(2) = (rho_R*U_contra_R*Ux_R + P_R*Ix_iph(i,j,k,nbl)) /Jac_iph(i,j,k,nbl)
		F_R(3) = (rho_R*U_contra_R*Uy_R + P_R*Iy_iph(i,j,k,nbl)) /Jac_iph(i,j,k,nbl)
		F_R(4) = (rho_R*U_contra_R*Uz_R + P_R*Iz_iph(i,j,k,nbl)) /Jac_iph(i,j,k,nbl)
		F_R(5) = (E_R + P_R) * U_contra_R /Jac_iph(i,j,k,nbl)
		Fflux_iph(i,j,k,nbl,1) = 0.5d0*(F_L(1)+F_R(1)) - 0.5d0*S*(Rho_R - Rho_L)/Jac_iph(i,j,k,nbl)
		Fflux_iph(i,j,k,nbl,2) = 0.5d0*(F_L(2)+F_R(2)) - 0.5d0*S*(Rho_R*Ux_R - Rho_L*Ux_L)/Jac_iph(i,j,k,nbl)
		Fflux_iph(i,j,k,nbl,3) = 0.5d0*(F_L(3)+F_R(3)) - 0.5d0*S*(Rho_R*Uy_R - Rho_L*Uy_L)/Jac_iph(i,j,k,nbl)
		Fflux_iph(i,j,k,nbl,4) = (0.5d0*(F_L(4)+F_R(4)) - 0.5d0*S*(Rho_R*Uz_R - Rho_L*Uz_L)/Jac_iph(i,j,k,nbl))*(1-f2D)
		Fflux_iph(i,j,k,nbl,5) = 0.5d0*(F_L(5)+F_R(5)) - 0.5d0*S*(E_R - E_L)/Jac_iph(i,j,k,nbl)
	!$acc end parallel loop

Hi hemanthgrylls,

While scalars are private by default, arrays are shared. Hence the most likely cause is that “F_L”, and “F_R” are shared and hence getting collisions. Note if you look at the compiler feedback messages (-Minfo=accel) you’ll most likely see a message that these are getting implicitly getting copied to the device.

To fix, add these to a “private” clause on your parallel loop:

DO nbl = 1,nblocks
!$acc parallel loop collapse(3) private(F_L,F_R)
DO k = 1,NK(nbl)
DO j = 1,NJ(nbl)
DO i = 0,NI(nbl)

Note that I also responded to you’re StackOverflow post. You should be able to also parallelize the “nbl” loop if you use multiple levels of parallelism:

!$acc parallel loop
DO nbl = 1,nblocks
!$acc loop collapse(3) private(F_L,F_R)
DO k = 1,NK(nbl)
DO j = 1,NJ(nbl)
DO i = 0,NI(nbl)

The difference is the outer loop will be parallelized across the gangs (CUDA blocks) and the inner loops parallelized across the vectors (CUDA threads).

Hope this helps,

1 Like

Hi Mat,

Its working fine now! I did see the implicit data copy message you have mentioned through ‘-Minfo=accel’ flag.

1098, Generating Tesla code
1099, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
1100,   ! blockidx%x threadidx%x collapsed
1101,   ! blockidx%x threadidx%x collapsed
1098, Generating implicit copyin(iz_iph(0:ni,1:nj,1:nk,nbl)) [if not already present]
      Generating implicit copyout(fflux_iph(:ni,:nj,:nk,nbl,:5)) [if not already present]
      Generating implicit copyin(ix_iph(0:ni,1:nj,1:nk,nbl),iy_iph(0:ni,1:nj,1:nk,nbl),qp_iphr(0:ni,1:nj,1:nk,nbl,1:5),qp_iphl(0:ni,1:nj,1:nk,nbl,1:5),nk(nbl),jac_iph(0:ni,1:nj,1:nk,nbl),ni(nbl),nj(nbl)) [if not already present]

Thanks so much for the quick response.

I would like to ask a follow up question. When I used these directives in the loop exactly as you have shown in your answer, despite there is a massive scope for parallelism I am not seeing any performance raise. I am guessing that, it is because of the lack of data transfer management between host and device. Do you also think that it is the reason?

Assuming that this is an extended version of the program you posted at: OpenACC: Best way to parallelize nested DO loops with data dependency between loops?

It probably has some impact but not much. With CUDA Unified memory (-gpu=managed), the data will only be copied when it’s changed on either the host or device. So in this case it only gets copied once the first time through the timestep loop. So you can hoist the data movement before the timestep using explicit data regions so it doesn’t get included in your timer, but the overall time would be about the same.

More likely the problem is the same as your other program in the “nblocks” is only one so there’s not enough work for the GPU. I used a nblock size of 128 and see significant speed-up, though I don’t know if that’s a reasonable size.


Is it always like, the outer loops will be parallelized across gangs and the inner loops with vectors?

The loops schedules do need to be in a hierarchy: gang->worker->vector, but could be combined on the same loop. Sequential loops (‘seq’) can be interspersed.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.