PGI 14.1 Fortran/acc bug report

Hi,

I am working in the Virginia Tech AOE dept. where we are currently trying to apply OpenACC acceleration to some Fortran CFD codes. We have encountered what appears to be a bug in the PGI 14.1 Fortran/acc compiler in which incorrect CUDA code is generated for scalar kernels regions. I have constructed a simple example code that reproduces the error:

https://drive.google.com/file/d/0B1fu3KCwysj1TGJlNS0wZ2JMMEU/edit?usp=sharing

As evident in the example, the problem appears to be certain operations being incorrectly moved or optimized away. The effect is to turn a scalar kernel like this excerpt:

  !$acc kernels copy(soln(:,:,:),res(:,:,:))
  i=2
  j=2
       local1=soln(i,j,1) + soln(i+1,j+1,1) + soln(i-1,j-1,1) + &
                soln(i-1,j+1,1) + soln(i+1,j-1,1)
       local2=soln(i,j,2) + soln(i+1,j+1,2) + soln(i-1,j-1,2) + &
                soln(i-1,j+1,2) + soln(i+1,j-1,2)
       local3=soln(i,j,3) + soln(i+1,j+1,3) + soln(i-1,j-1,3) + &
                soln(i-1,j+1,3) + soln(i+1,j-1,3)

       !NOTE: R1 is re-computed for each i,j value

       local4 = soln(i,j,2)**2 + soln(i,j,3)**2 + add_val
       R1 = local4*5.9_dp

       res(i,j,1) = R1
       R2 = soln(i,j,2) + local2 + local3
       res(i,j,2) = R2
       R3 = soln(i,j,3) + local2 + local3
       res(i,j,3) = R3
  
  i=2
  j=y_nodes-1
       local1=soln(i,j,1) + soln(i+1,j+1,1) + soln(i-1,j-1,1) + &
                soln(i-1,j+1,1) + soln(i+1,j-1,1)
       local2=soln(i,j,2) + soln(i+1,j+1,2) + soln(i-1,j-1,2) + &
                soln(i-1,j+1,2) + soln(i+1,j-1,2)
       local3=soln(i,j,3) + soln(i+1,j+1,3) + soln(i-1,j-1,3) + &
                soln(i-1,j+1,3) + soln(i+1,j-1,3)

       local4 = soln(i,j,2)**2 + soln(i,j,3)**2 + add_val
       R1 = local4*5.9_dp

       res(i,j,1) = R1
       R2 = soln(i,j,2) + local2 + local3
       res(i,j,2) = R2
       R3 = soln(i,j,3) + local2 + local3
       res(i,j,3) = R3
  
  i=x_nodes-1
  j=2
       local1=soln(i,j,1) + soln(i+1,j+1,1) + soln(i-1,j-1,1) + &
                soln(i-1,j+1,1) + soln(i+1,j-1,1)
       local2=soln(i,j,2) + soln(i+1,j+1,2) + soln(i-1,j-1,2) + &
                soln(i-1,j+1,2) + soln(i+1,j-1,2)
       local3=soln(i,j,3) + soln(i+1,j+1,3) + soln(i-1,j-1,3) + &
                soln(i-1,j+1,3) + soln(i+1,j-1,3)

       local4 = soln(i,j,2)**2 + soln(i,j,3)**2 + add_val
       R1 = local4*5.9_dp

       res(i,j,1) = R1
       R2 = soln(i,j,2) + local2 + local3
       res(i,j,2) = R2
       R3 = soln(i,j,3) + local2 + local3
       res(i,j,3) = R3

       ...[etc]...

Into this:

  !$acc kernels copy(soln(:,:,:),res(:,:,:))
  i=2
  j=2
       local1=soln(i,j,1) + soln(i+1,j+1,1) + soln(i-1,j-1,1) + &
                soln(i-1,j+1,1) + soln(i+1,j-1,1)
       local2=soln(i,j,2) + soln(i+1,j+1,2) + soln(i-1,j-1,2) + &
                soln(i-1,j+1,2) + soln(i+1,j-1,2)
       local3=soln(i,j,3) + soln(i+1,j+1,3) + soln(i-1,j-1,3) + &
                soln(i-1,j+1,3) + soln(i+1,j-1,3)

       !NOTE: R1 is assigned only once and re-used, even as i,j change

       local4 = soln(i,j,2)**2 + soln(i,j,3)**2 + add_val    
       R1 = local4*5.9_dp   !computed only once for all i,j

       res(i,j,1) = R1
       R2 = soln(i,j,2) + local2 + local3
       res(i,j,2) = R2
       R3 = soln(i,j,3) + local2 + local3
       res(i,j,3) = R3
  
  i=2
  j=y_nodes-1
       local1=soln(i,j,1) + soln(i+1,j+1,1) + soln(i-1,j-1,1) + &
                soln(i-1,j+1,1) + soln(i+1,j-1,1)
       local2=soln(i,j,2) + soln(i+1,j+1,2) + soln(i-1,j-1,2) + &
                soln(i-1,j+1,2) + soln(i+1,j-1,2)
       local3=soln(i,j,3) + soln(i+1,j+1,3) + soln(i-1,j-1,3) + &
                soln(i-1,j+1,3) + soln(i+1,j-1,3)

       res(i,j,1) = R1
       R2 = soln(i,j,2) + local2 + local3
       res(i,j,2) = R2
       R3 = soln(i,j,3) + local2 + local3
       res(i,j,3) = R3
  
  i=x_nodes-1
  j=2
       local1=soln(i,j,1) + soln(i+1,j+1,1) + soln(i-1,j-1,1) + &
                soln(i-1,j+1,1) + soln(i+1,j-1,1)
       local2=soln(i,j,2) + soln(i+1,j+1,2) + soln(i-1,j-1,2) + &
                soln(i-1,j+1,2) + soln(i+1,j-1,2)
       local3=soln(i,j,3) + soln(i+1,j+1,3) + soln(i-1,j-1,3) + &
                soln(i-1,j+1,3) + soln(i+1,j-1,3)

       res(i,j,1) = R1
       R2 = soln(i,j,2) + local2 + local3
       res(i,j,2) = R2
       R3 = soln(i,j,3) + local2 + local3
       res(i,j,3) = R3

       ...[etc]...

We have developed workarounds, but wanted to bring this possible bug to your attention.

Thank you,
Brent

Hi Brent,

Thanks for the example. I’ll pass it on to engineering (TPR#19997).

I agree that there’s an over aggressive optimization being performed when using the “kernels”. Though, this is a case where I’d normally use the “parallel” construct instead since “kernels” is more for tightly nested loops while “parallel” is more for cases where there’s less structure like a sequential region. Granted, “kernels” shouldn’t give you wrong answers, but changing it for “parallel” will work as expected.

In addition to using “parallel” for sequential regions, I’d also recommend you use a data region to span all of your compute regions. As written, you have a lot of extra data movement.

Here’s a diff with my changes:

% diff -i scalar_test_ERROR.f90 scalar_test.f90
60c60,61
<   !$acc kernels copy(soln(:,:,:),res(:,:,:))
---
>   !$acc data copy(soln(:,:,:),res(:,:,:))
>   !$acc kernels
85c86
<   !$acc kernels copy(soln(:,:,:),res(:,:,:))
---
>   !$acc parallel
157c158
<   !$acc end kernels
---
>   !$acc end parallel
159c160
<   !$acc kernels copy(soln(:,:,:),res(:,:,:))
---
>   !$acc kernels
181c182
<
---
>   !$acc end data
% pgf90 -acc -Minfo=accel scalar_test.f90; a.out
kernel_test:
      0, Generating copy(soln(:,:,:))
         Generating copy(res(:,:,:))
     61, Generating Tesla code
     62, Loop is parallelizable
     63, Loop is parallelizable
         Accelerator kernel generated
         62, !$acc loop gang ! blockidx%y
         63, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     86, Accelerator kernel generated
         Generating Tesla code
    160, Generating Tesla code
    161, Loop is parallelizable
    162, Loop is parallelizable
         Accelerator kernel generated
        161, !$acc loop gang ! blockidx%y
        162, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
Corner (1,1) =                1.903
Corner (1,y_nodes) =                1.903
Corner (x_nodes,1) =               45.439
Corner (x_nodes,y_nodes) =               45.439

Hope this helps,
Mat

Thanks for the info. I did not even think to try an “acc parallel” region for code that was supposed to execute as a serial kernel.

(In the real code, we do utilize a data region to avoid extra data movement–I was sloppy in this example.)

-Brent

TPR 19997 -" UF: Using “kernels” region over sequential code give wrong answers "

has been corrected in the current 14.6 release.

Thanks for the original report.

regards,
dave