Version upgrade causes code failure

Hi,

I have recently upgraded from PGI fortran version 15.3 to version 16.4.

The code that ran well under 15.3 has significant issues when compiled with 16.4.

I assume this is down to a coding error that I have made which was tolerated under the previous version.

I have pasted a subroutine below which is intended to estimate a timestep value. This provided the correct value when compiled using 15.3 but gives dt=0 with 16.4

Can you advise on any apparent sources of this issue?

Would be very grateful for any advice

Tim.

 subroutine calc_timestep()

      

      use memory_allocation  
      
      
      real*4 	Fa 
      real*4    ax,ay,az
      real*4    dotval 
      
      real*4    dt_f,dt_cv
      
      real*4    t1 
      
      CFL_number = 0.3 
      
      dt_f       = 1000.0
      dt_cv      = 1000.0
       
! ----------------------------------------------------------------------  

      !$acc parallel loop private(xcell,ycell,zcell,p_j,j,dx,dy,dz,r, &   
      !$acc                 ax,ay,az,Fa)                              &   
      !$acc                 reduction(min:dt_f)                       &   
      !$acc                 reduction(min:dt_cv)                      & 
      !$acc                 reduction(max:t1)                         
      
      do i = 1,num_particles 
       
       xcell = ZONE_ID(1,i)

       ycell = ZONE_ID(2,i)

       zcell = ZONE_ID(3,i)   
      
       ax = ACCEL(1,i) 
       ay = ACCEL(2,i) 
       az = ACCEL(3,i)

       ax = 0.00
       ay = 0.00
       az = 0.00
       
       t1 = 0.0
       
       

       Fa = max(sqrt((ax*ax)+(ay*ay)+(az*az)),0.001) ! dont let Fa=0
       
       dt_f = min(dt_f,sqrt(h/Fa))    ! because you divide by it here...
       

       do p_j = 1,NPIZPL(xcell,ycell,zcell)



        j = ZPLIST(xcell,ycell,zcell,p_j)
      
        if(i.ne.j) then    
         

         dx = position(1,i) - position(1,j)

         dy = position(2,i) - position(2,j)  

         dz = position(3,i) - position(3,j) 

                  

         r  = sqrt((dx*dx)+(dy*dy)+(dz*dz))   
         
         if(r.gt.h10.and.r.le.h2) then  
         

          du      = velocity(1,i) - velocity(1,j)

          dv      = velocity(2,i) - velocity(2,j)  

          dw      = velocity(3,i) - velocity(3,j) 
         
          dotval  = (du*dx) + (dv*dy) + (dw*dz)
          
          t1      = max(t1,abs((h*dotval)/(r*r)))
          
         endif ! r le 2 h
      
        endif ! i ne j   
      
       enddo ! j loop             
      
       dt_cv = min(dt_cv,(h/(real_speed_c + t1)))
      
      enddo ! i loop  
      
      !$acc wait
      
      dt = CFL_number * min(dt_f,dt_cv)

[/code]

Hi Tim,

One issue I see is the max reduction of “t1”. You have the reduction clause on the outer “i” loop which means that you are telling the compiler to reduce it across all iteration of “i”. However, it looks like “t1” should be private to each iteration of “i” and only perform the max reduction across the “p_j” loop, which I assume is executed sequentially.

Note that I think your use of the “wait” directive as well as putting scalars in a “private” clause are extraneous here. Scalars are private by default and will be declared as local variables in the generated device routine. However by putting them in a “private” clause, the compiler creates an array of scalars in global memory which can be slower. While there are cases where scalars do need to be put into a “private” clause, I recommend only doing so when required but not as the first option.

If the “max” reduction of “t1” isn’t the issue, can you please post a reproducer or send on to PGI Customer Service (trs@pgroup.com)? Or post the compiler feedback messages (-Minfo=accel) with 15.3 and 16.4 so I can see if the compiler is doing anything different.

  • Mat

Hi Mat,

Many thanks for your help. You are right regarding the placement of the t1 reduction - it is not the prime cause of the issue so I will look at this later.

I have developed a much simpler bit of code that illustrates the issue.

      subroutine calc_timestep()
      
      use PGI_test_module  
      
      
      real*4 	Fa,dt_f      
      
      dt_f       = 1000.0
       
! ----------------------------------------------------------------------  

      !$acc parallel loop reduction(min:dt_f)              
      
      do i = 1,1 
       
       Fa = max(0.000,0.001) 
       
       dt_f = min(dt_f,sqrt(0.05/Fa)) 
      
      enddo ! i loop  
      
      print*,' dt_f   = ',dt_f
      
      stop
      
      return
      end



      
! --------------------------------------------------------------------   
      
!      PGI test module

! --------------------------------------------------------------------   
      
      module PGI_test_module 
           
!   ------------------------------------------------------- 
            
      integer 	i
 
!   ------------------------------------------------------- 

      end module PGI_test_module

If I compile this CPU only / without ACC then this gives the correct result (7.07). If I compile with ACC and comment out the module, I also get the correct result.

If I compile with ACC and uncomment the module then I get dt_f = 1000.0 suggesting that the loop is skipped or the host value for dt_f is not updated.

Any idea why a change of version would cause this behaviour?

Hi Tim,

Thanks for the example. I was able to recreate the error here and added a problem report (TPR#22514). Looks like a new regression in 16.4.

Interesting that it only occurs when the index variable “i” is in the module and probably why we didn’t catch it.

The work-around is to put “dt_f” in a copy clause:

% cat test.f90

      module PGI_test_module
       integer    i
      end module PGI_test_module

      subroutine calc_timestep()
       use PGI_test_module
       real*4    Fa,dt_f
       dt_f       = 1000.0

#ifdef PGI_WA
       !$acc parallel loop reduction(min:dt_f) copy(dt_f)
#else
       !$acc parallel loop reduction(min:dt_f)
#endif
       do i = 1,1
        Fa = max(0.000,0.001)
        dt_f = min(dt_f,sqrt(0.05/Fa))
       enddo ! i loop

       print*,' dt_f   = ',dt_f
       return
       end

       program foo
         call calc_timestep()
       end program foo
% pgf90 -acc -Minfo=accel -Mpreprocess test.f90 -V16.4 ; a.out
calc_timestep:
     14, Accelerator kernel generated
         Generating Tesla code
         14, Generating reduction(min:dt_f)
         16, !$acc loop gang ! blockidx%x threadidx%x
  dt_f   =     1000.000
% pgf90 -acc -Minfo=accel -Mpreprocess test.f90 -V16.4 -DPGI_WA ; a.out
calc_timestep:
     12, Generating copy(dt_f)
         Accelerator kernel generated
         Generating Tesla code
         12, Generating reduction(min:dt_f)
         16, !$acc loop gang ! blockidx%x threadidx%x
  dt_f   =     7.071068
  • Mat

Hi Mat,

Many thanks for your help with this.

Tim.

Hi Mat,

Just a quick follow up on this. I have been looking at my placement of scalars into a private clause. I have looked into the effect this has on performance and have been trying to determine when, where, and why data copies occur. I was hoping you could clarify a couple of queries I have?

My general code structure is :

Allocate memory and initialise arrays
Transfer arrays from host to device with a ‘enter data copyin’
Loop
Call numerous subroutines
End loop

The subroutines within the loop contain acc parallel loops which need to access the arrays - which are present_or_copyin by default - and scalars -which are private by default?

I have noticed that some of these parallel loops are performing copyin and copyout operations at each loop iteration. It was my intent that there should be no data transfer during the iterative process as all data should be resident on the device. Is there a way to determine which variables/arrays are being transferred?

I am particularly intrigued by the copyout. I can sort of understand the copyin - there is some data I have overlooked that is required - but am a bit lost on the copyout transfer.

Is there a compiler command that would give more information? Is there anything obvious that I am doing wrong?

Any advice would be gratefully received.

Tim.

Hi Tim,

First take a look at the compiler feedback messages (-Minfo=accel) for each of the parallel loops. If you’ve missed putting an array in a data directive, the compiler will automatically copy it to and/or from the device depending it’s use. If you’re unsure how to read them compiler feedback, post the output and I can walk you through it.

When you run your code, you can set the environment variable “PGI_ACC_NOTIFY=2”. This will show you each time data is moved to/from the device. There’s also “PGI_ACC_DEBUG=1” which dumps more detailed information about every kernel launch and copy if you need more information.

Does your code use reductions? If so, this could be the source of your copyout since the result of the reduction is implicitly copied back at the end of the compute region. However, if you put the reduction variable in a data directive, then you will control when the data is copied back (either via the copyout clause or an update directive).

Note that there will almost always be some data that needs to be copied over to the device. We wrap up the arguments to the device kernel into a struct and then copy the struct to the device before launching the kernel.

  • Mat

This problem has been fixed in the 16.5 release, currently available.

thanks,
dave