Hi,
I am working in the Virginia Tech AOE dept. where we are currently trying to apply OpenACC acceleration to some Fortran CFD codes. We have encountered what appears to be a bug in the PGI 14.1 Fortran/acc compiler in which incorrect CUDA code is generated for scalar kernels regions. I have constructed a simple example code that reproduces the error:
https://drive.google.com/file/d/0B1fu3KCwysj1TGJlNS0wZ2JMMEU/edit?usp=sharing
As evident in the example, the problem appears to be certain operations being incorrectly moved or optimized away. The effect is to turn a scalar kernel like this excerpt:
!$acc kernels copy(soln(:,:,:),res(:,:,:))
i=2
j=2
local1=soln(i,j,1) + soln(i+1,j+1,1) + soln(i-1,j-1,1) + &
soln(i-1,j+1,1) + soln(i+1,j-1,1)
local2=soln(i,j,2) + soln(i+1,j+1,2) + soln(i-1,j-1,2) + &
soln(i-1,j+1,2) + soln(i+1,j-1,2)
local3=soln(i,j,3) + soln(i+1,j+1,3) + soln(i-1,j-1,3) + &
soln(i-1,j+1,3) + soln(i+1,j-1,3)
!NOTE: R1 is re-computed for each i,j value
local4 = soln(i,j,2)**2 + soln(i,j,3)**2 + add_val
R1 = local4*5.9_dp
res(i,j,1) = R1
R2 = soln(i,j,2) + local2 + local3
res(i,j,2) = R2
R3 = soln(i,j,3) + local2 + local3
res(i,j,3) = R3
i=2
j=y_nodes-1
local1=soln(i,j,1) + soln(i+1,j+1,1) + soln(i-1,j-1,1) + &
soln(i-1,j+1,1) + soln(i+1,j-1,1)
local2=soln(i,j,2) + soln(i+1,j+1,2) + soln(i-1,j-1,2) + &
soln(i-1,j+1,2) + soln(i+1,j-1,2)
local3=soln(i,j,3) + soln(i+1,j+1,3) + soln(i-1,j-1,3) + &
soln(i-1,j+1,3) + soln(i+1,j-1,3)
local4 = soln(i,j,2)**2 + soln(i,j,3)**2 + add_val
R1 = local4*5.9_dp
res(i,j,1) = R1
R2 = soln(i,j,2) + local2 + local3
res(i,j,2) = R2
R3 = soln(i,j,3) + local2 + local3
res(i,j,3) = R3
i=x_nodes-1
j=2
local1=soln(i,j,1) + soln(i+1,j+1,1) + soln(i-1,j-1,1) + &
soln(i-1,j+1,1) + soln(i+1,j-1,1)
local2=soln(i,j,2) + soln(i+1,j+1,2) + soln(i-1,j-1,2) + &
soln(i-1,j+1,2) + soln(i+1,j-1,2)
local3=soln(i,j,3) + soln(i+1,j+1,3) + soln(i-1,j-1,3) + &
soln(i-1,j+1,3) + soln(i+1,j-1,3)
local4 = soln(i,j,2)**2 + soln(i,j,3)**2 + add_val
R1 = local4*5.9_dp
res(i,j,1) = R1
R2 = soln(i,j,2) + local2 + local3
res(i,j,2) = R2
R3 = soln(i,j,3) + local2 + local3
res(i,j,3) = R3
...[etc]...
Into this:
!$acc kernels copy(soln(:,:,:),res(:,:,:))
i=2
j=2
local1=soln(i,j,1) + soln(i+1,j+1,1) + soln(i-1,j-1,1) + &
soln(i-1,j+1,1) + soln(i+1,j-1,1)
local2=soln(i,j,2) + soln(i+1,j+1,2) + soln(i-1,j-1,2) + &
soln(i-1,j+1,2) + soln(i+1,j-1,2)
local3=soln(i,j,3) + soln(i+1,j+1,3) + soln(i-1,j-1,3) + &
soln(i-1,j+1,3) + soln(i+1,j-1,3)
!NOTE: R1 is assigned only once and re-used, even as i,j change
local4 = soln(i,j,2)**2 + soln(i,j,3)**2 + add_val
R1 = local4*5.9_dp !computed only once for all i,j
res(i,j,1) = R1
R2 = soln(i,j,2) + local2 + local3
res(i,j,2) = R2
R3 = soln(i,j,3) + local2 + local3
res(i,j,3) = R3
i=2
j=y_nodes-1
local1=soln(i,j,1) + soln(i+1,j+1,1) + soln(i-1,j-1,1) + &
soln(i-1,j+1,1) + soln(i+1,j-1,1)
local2=soln(i,j,2) + soln(i+1,j+1,2) + soln(i-1,j-1,2) + &
soln(i-1,j+1,2) + soln(i+1,j-1,2)
local3=soln(i,j,3) + soln(i+1,j+1,3) + soln(i-1,j-1,3) + &
soln(i-1,j+1,3) + soln(i+1,j-1,3)
res(i,j,1) = R1
R2 = soln(i,j,2) + local2 + local3
res(i,j,2) = R2
R3 = soln(i,j,3) + local2 + local3
res(i,j,3) = R3
i=x_nodes-1
j=2
local1=soln(i,j,1) + soln(i+1,j+1,1) + soln(i-1,j-1,1) + &
soln(i-1,j+1,1) + soln(i+1,j-1,1)
local2=soln(i,j,2) + soln(i+1,j+1,2) + soln(i-1,j-1,2) + &
soln(i-1,j+1,2) + soln(i+1,j-1,2)
local3=soln(i,j,3) + soln(i+1,j+1,3) + soln(i-1,j-1,3) + &
soln(i-1,j+1,3) + soln(i+1,j-1,3)
res(i,j,1) = R1
R2 = soln(i,j,2) + local2 + local3
res(i,j,2) = R2
R3 = soln(i,j,3) + local2 + local3
res(i,j,3) = R3
...[etc]...
We have developed workarounds, but wanted to bring this possible bug to your attention.
Thank you,
Brent