Nvfortran+openacc bug/feature related to the acc loop range

Hi,
In our application, the acc loop range is not fixed. In order to avoid the copy of the range loop from CPU to GPU, I tried to calculate the loop range on GPU with ‘acc serial’. This is the development background of the following demo code.

In the ‘!$acc loop’ line, I found it produces correct results without ‘collapse’. With ‘collapse’, all the elements of the array ‘a’ are still zeros. Is this a bug or ‘feature’ of openacc? Could you explain to me what is going on?

Thanks!

File Edit Options Buffers Tools F90 Help                                                     
program main                                                                                 
                                                                                             
  call sub1                                                                                  
                                                                                             
contains                                                                                     
  subroutine sub1()                                                                          
    real, dimension(3, 3, 3):: a                                                             
    !$acc declare create(a)                                                                  
                                                                                             
    integer:: imax = 0, jmax = 0, kmax = 0                                                   
    !$acc declare create(imax, jmax,  kmax)                                                  
                                                                                             
    integer:: i, j, k                                                                        
    !----------------------------------                                                      
                                                                                             
    a = 0                                                                                    
    !$acc update device(a)                                                                   
                                                                                             
    !$acc serial present(imax)                                                               
    imax = 2                                                                                 
    jmax = 2                                                                                 
    kmax = 2                                                                                 
    !$acc end serial                                                                         
                                                                                             
    ! The following loop produces expected results without 'collapse'                        
    !$acc parallel loop collapse(3) present(a, imax, jmax, kmax)                             
    do i = 1, imax                                                                           
       do j = 1, jmax                                                                        
          do k = 1, kmax                                                                     
             a(i,j,k) = -2                                                                   
          end do                                                                             
       end do                                                                                
    end do                                                                                   
                                                                                             
    !$acc update host(a)                                                                     
    write(*,*)'sub1 a = ',a                                                                  
  end subroutine sub1                                                                        
end program main  

Hi yuxichen,

What’s happening is that loop bounds variables value will be taken from the host copies since since they essential define the schedule used when launching the kernel. Here, you’ve updated the device copies of the variables, but not updated the host.

Though managing scalars is not needed in most cases, including here. By default scalars are first private, so when not using collapse, they are passed in to the kernel as stored as local variables in registers. By managing them, they’d be stored in global memory which could cause a slight performance slow-down.

Here’s a working version of your code:

 % cat test.f90
program main

  call sub1

contains
  subroutine sub1()
    real, dimension(3, 3, 3):: a
    !$acc declare create(a)

    integer:: imax = 0, jmax = 0, kmax = 0

    integer:: i, j, k
    !----------------------------------

    a = 0
    !$acc update device(a)

    imax = 2
    jmax = 2
    kmax = 2

    ! The following loop produces expected results without 'collapse'
    !$acc parallel loop collapse(3) present(a)
    do i = 1, imax
       do j = 1, jmax
          do k = 1, kmax
             a(i,j,k) = -2
          end do
       end do
    end do

    !$acc update host(a)
    write(*,*)'sub1 a = ',a
  end subroutine sub1
end program main
% nvfortran -acc -Minfo=accel test.f90; a.out
sub1:
      8, Generating create(a(:,:,:)) [if not already present]
     16, Generating update device(a(:,:,:))
     23, Generating present(a(:,:,:))
         Generating Tesla code
         24, !$acc loop gang collapse(3) ! blockidx%x
         25,   ! blockidx%x collapsed
         26,   ! blockidx%x collapsed
     32, Generating update self(a(:,:,:))
 sub1 a =    -2.000000       -2.000000        0.000000       -2.000000
   -2.000000        0.000000        0.000000        0.000000
    0.000000       -2.000000       -2.000000        0.000000
   -2.000000       -2.000000        0.000000        0.000000
    0.000000        0.000000        0.000000        0.000000
    0.000000        0.000000        0.000000        0.000000
    0.000000        0.000000        0.000000

-Mat

1 Like

Hi Mat,

Thanks for explaining!

I just want to confirm, in your version, will the loop bounds (imax,jmax,kmax) still be passed to GPU when launching the kernel? Does it mean both GPU and CPU need to know the loop bounds? Thanks!

Yes, at least in some form. Without collapse, scalars are firstprivate by default. With collapse, the loop bounds becomes the product of the bounds from the collapsed loops.

Does it mean both GPU and CPU need to know the loop bounds?

Sorry, I’m not clear on what you’re asking. The code does need to include the loop bounds no matter the target, but that’s inherent in the Fortran code.

-Mat