Keeping data on GPU while looping and calling subroutines

Hi,

I would like to put data on the GPU-memory once at the beginning of the program, and use this data (which is never modified) within a loop calling a subroutine.
Here is a trivial example:

subroutine accumulateTrigo(a, size, sum)
    integer :: ii,jj, size
    real, dimension(size) :: a
    real :: sum

!$acc data region copyin(a)
    do jj=1,500
    sum=0.0 
!$acc region
      do ii=1,size
            sum = sum + sin(a(ii)) ** 2 + cos(a(ii)) ** 2
      enddo
!$acc end region      
    enddo
!$acc end data region
    print *, "sum = ", sum
    return
end subroutine



                        
program main
    real, dimension(100000) ::  X
    integer :: Xsize,m,i,k,c1,c2    
    real :: lastSum
    
    Xsize = 100000    
    m = 5           ! m calls to subroutine accumulateTrigo
  
! GPU initialization
#ifdef _ACCEL
    call acc_init( acc_device_nvidia )
#endif    

! initialization of array X
    do i = 1,Xsize
        X(i) = (i*2.0)
    enddo

! computations on GPU    
    call system_clock( count=c1 )
    do k= 1, m      
        call accumulateTrigo(X, Xsize, lastSum)
    enddo
    
    print *, "LAST = ", lastSum
    call system_clock( count=c2 )
    print *, (c2-c1)/1000.0, ' milliseconds'
end program

This works perfectly, nevertheless, array X is copied 5 times to the GPU (at the data region entry, each time the subroutine is called). But X is never modified, so I’d like to copy it only once.

I try to use the “!$acc mirror” and “!$acc reflected” directives, but I cannot compile nor it gives me segfault. I probably miss something obvious, since I am not a FORTRAN expert and get confused with that “dummy array” thing. Thus, I’d really appreciate any help

Please note that I read this post to inline the subroutine, but when I try to put a region around the k loop in the main program, I get the same ACON unsupported operation described here

Thanks a lot
Nicolas

Hi Nicolas,

I try to use the “!$acc mirror” and “!$acc reflected” directives, but I cannot compile nor it gives me segfault. I probably miss something obvious, since I am not a FORTRAN expert and get confused with that “dummy array” thing. Thus, I’d really appreciate any help

Using mirror or reflected is the way to do this. The only thing you’re missing is an F90 interface. The easiest thing to do is put your routine in a module where an implicit interface is created. Though, you can also add an explicit interface. Examples for both are bellow. Note that I changed the module version a bit to show how mirror works, but you could use reflected instead.

  • Mat

Using reflected:

% cat reflect.f90 

subroutine accumulateTrigo(a, size, sum)
    integer :: ii,jj, size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)

    do jj=1,500
    sum=0.0
!$acc region
      do ii=1,size
            sum = sum + sin(a(ii)) ** 2 + cos(a(ii)) ** 2
      enddo
!$acc end region     
    enddo
    print *, "sum = ", sum
    return
end subroutine

                       
program main
    real, dimension(100000) ::  X
    integer :: Xsize,m,i,k,c1,c2   
    real :: lastSum
  
interface 
  subroutine accumulateTrigo(a, size, sum)
    integer :: size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)
  end subroutine accumulateTrigo
end interface 
 
    Xsize = 100000   
    m = 5           ! m calls to subroutine accumulateTrigo
 
! GPU initialization
#ifdef _ACCEL
    call acc_init( acc_device_nvidia )
#endif   

! initialization of array X
    do i = 1,Xsize
        X(i) = (i*2.0)
    enddo

!$acc data region copyin(X)
! computations on GPU   
    call system_clock( count=c1 )
    do k= 1, m     
        call accumulateTrigo(X, Xsize, lastSum)
    enddo
!$acc end data region
   
    print *, "LAST = ", lastSum
    call system_clock( count=c2 )
    print *, (c2-c1)/1000.0, ' milliseconds'
end program 
% pgfortran reflect.f90 -Mpreprocess -ta=nvidia -Minfo=accel -fast
accumulatetrigo:
      6, Generating reflected(a(:))
     10, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     11, Loop is parallelizable
         Accelerator kernel generated
         11, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 1.0 : 13 registers; 1080 shared, 132 constant, 28 local memory bytes; 66% occupancy
             CC 2.0 : 15 registers; 1032 shared, 140 constant, 4 local memory bytes; 100% occupancy
         12, Sum reduction generated for sum
main:
     48, Generating copyin(x(:))
% a.out
 sum =     100000.0    
 sum =     100000.0    
 sum =     100000.0    
 sum =     100000.0    
 sum =     100000.0    
 LAST =     100000.0    
    2011.558      milliseconds

Using Mirror:

% cat mirror.f90 
module mymod

!MEC Change this to be an allocatable array so Mirror can be used.
    real, allocatable, dimension(:) ::  X
!$acc mirror (X)

contains

! Since X is a module array, no need to pass it in
subroutine accumulateTrigo(size, sum)
    integer :: ii,jj, size
    real :: sum

    do jj=1,500
    sum=0.0
!$acc region
      do ii=1,size
            sum = sum + sin(X(ii)) ** 2 + cos(X(ii)) ** 2
      enddo
!$acc end region     
    enddo
    print *, "sum = ", sum
    return
end subroutine

end module mymod
                       
program main
    use mymod
    integer :: Xsize,m,i,k,c1,c2   
    real :: lastSum

! GPU initialization
#ifdef _ACCEL
    call acc_init( acc_device_nvidia )
#endif   
   
    Xsize = 100000  
!MEC X needs to be allocated. One copy is done on the device
!MEC and one on the host.  No data movement is done! 
    allocate(X(Xsize))
    m = 5           ! m calls to subroutine accumulateTrigo
 
! initialization of array X
!MEC Either put this into  a compute region
!$acc region do
    do i = 1,Xsize
        X(i) = (i*2.0)
    enddo
! 
!MEC or use the update clause to syncronize the host and device copies.
!acc update device(X) 

! computations on GPU   
    call system_clock( count=c1 )
    do k= 1, m     
        call accumulateTrigo(Xsize, lastSum)
    enddo
   
    print *, "LAST = ", lastSum
    call system_clock( count=c2 )
    print *, (c2-c1)/1000.0, ' milliseconds'
end program 
% pgf90 mirror.f90 -Mpreprocess -ta=nvidia -Minfo=accel -fast
accumulatetrigo:
     16, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     17, Loop is parallelizable
         Accelerator kernel generated
         17, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 1.0 : 13 registers; 1088 shared, 132 constant, 28 local memory bytes; 66% occupancy
             CC 2.0 : 15 registers; 1032 shared, 148 constant, 4 local memory bytes; 100% occupancy
         18, Sum reduction generated for sum
main:
     46, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     47, Loop is parallelizable
         Accelerator kernel generated
         47, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 1.0 : 3 registers; 48 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 6 registers; 8 shared, 56 constant, 0 local memory bytes; 100% occupancy
% a.out
 sum =     100000.0    
 sum =     100000.0    
 sum =     100000.0    
 sum =     100000.0    
 sum =     100000.0    
 LAST =     100000.0    
    1960.875      milliseconds

This is very interesting!

What is the difference between mirror and reflected? When would you use one over the other?

What is the difference between mirror and reflected?

“Mirror” mirrors the allocation status between the device and host of allocatable arrays. It also creates an implicit data region having the same scope as the array.

reflected is an attribute applied to a dummy argument indicating that it already has a device copy. reflected must contained within a higher level explicit or implicit data region.

When would you use one over the other?

Mirror gives a more ‘global’ view of your data but requires direct knowledge of variable name (this is why I changed the subroutine in the mirror.f90 example from using “a” to “X”). Reflected works on smaller scopes but allows for different device arrays to used.

The two ideas can work together. For example, let’s modify the reflect.f90 example to have different “mirror” arrays passed to the subroutine.

% cat reflect2.f90 
module mymod

    real, allocatable, dimension(:) ::  X,Y
!$acc mirror(X,Y)
!MEC This creates an implicit data region for X, Y within the same scope.

contains

subroutine accumulateTrigo(a, size, sum)
    integer :: ii,jj, size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)

    do jj=1,500
    sum=0.0
!$acc region
      do ii=1,size
            sum = sum + sin(a(ii)) ** 2 + cos(a(ii)) ** 2
      enddo
!$acc end region     
    enddo
    print *, "sum = ", sum
    return
end subroutine

end module mymod
                       
program main
    use mymod
    integer :: Xsize,Ysize,m,i,k,c1,c2   
    real :: lastXSum, lastYSum 
    Xsize = 100000   
    Ysize = 160000   
    m = 5           ! m calls to subroutine accumulateTrigo
 
! GPU initialization
#ifdef _ACCEL
    call acc_init( acc_device_nvidia )
#endif   

    allocate(X(Xsize),Y(Ysize))

! initialization of array X
    do i = 1,Xsize
        X(i) = (i*2.0)
    enddo
    do i = 1,Ysize
        Y(i) = (i*2.0)
    enddo

!$acc update device(X,Y)

! computations on GPU   
    call system_clock( count=c1 )
    do k= 1, m     
        call accumulateTrigo(X, Xsize, lastXSum)
        call accumulateTrigo(Y, Ysize, lastYSum)
    enddo
   
    print *, "LAST = ", lastXSum, lastYSum
    call system_clock( count=c2 )
    print *, (c2-c1)/1000.0, ' milliseconds'
end program 
% pgf90 -ta=nvidia -Mpreprocess reflect2.f90 -Minfo=accel
accumulatetrigo:
     13, Generating reflected(a(:))
     17, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     18, Loop is parallelizable
         Accelerator kernel generated
         18, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 1.0 : 13 registers; 1080 shared, 132 constant, 28 local memory bytes; 66% occupancy
             CC 2.0 : 15 registers; 1032 shared, 140 constant, 4 local memory bytes; 100% occupancy
         19, Sum reduction generated for sum
main:
     52, Generating update device(y(:))
         Generating update device(x(:))
% a.out
 sum =     100000.0    
 sum =     160000.0    
 sum =     100000.0    
 sum =     160000.0    
 sum =     100000.0    
 sum =     160000.0    
 sum =     100000.0    
 sum =     160000.0    
 sum =     100000.0    
 sum =     160000.0    
 LAST =     100000.0        160000.0    
    4220.299      milliseconds

Can the interface be between two subprograms? Or is it limited to a main program and a subprogram?

Can the interface be between two subprograms?

Yes.

Hi Mat,

Sorry for coming back to this post.

The interface and module stuff is new to me.

Could you show me how the mirrored example would look like without the use of a module (using an explicit interface with the subroutine).

And lets say that the call tree goes through an intermediate subroutine before the subroutine containing the gpu kernel:

program main > subroutine intermediate > subroutine accumulateTrigo

would i need to have an interface between main and intermediate and an interface between intermediate and accumulateTrigo?

Thanks,
Sarom

Here you go. Using the initial reflected example as the base, I modified X to be allocatable and then added the mirror directive. mirror creates an implicit data region with the same scope as the variable, hence I removed the original explicit data region. I also put X’s initialisation loop into an compute region so I didn’t need to copy the data (otherwise you need to use an update directive to get the data over to the GPU).

would i need to have an interface between main and intermediate and an interface between intermediate and accumulateTrigo?

Yes and I updated the example (reflect3.f90) to reflect this. However, it would be uncommon to do this. More likely you would move these routines into a module where an implicit interface is created or create a module that contains nothing but an interface (reflect4.f90). The C equivalent would be a header file with prototype functions.

Much of your challenge with GAMESS will be porting it to F90. But, at least in my opinion, F90 is a much better language than F77 and well worth the effort.

  • Mat
% cat reflect3.f90

subroutine accumulateTrigo(a, size, sum)
    integer :: ii,jj, size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)
    do jj=1,500
    sum=0.0
!$acc region
      do ii=1,size
            sum = sum + sin(a(ii)) ** 2 + cos(a(ii)) ** 2
      enddo
!$acc end region     
    enddo
    return
end subroutine


subroutine intermediate(a, size, sum)
    integer ::  size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)

interface
  subroutine accumulateTrigo(a, size, sum)
    integer :: size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)
  end subroutine accumulateTrigo
end interface

    print *, 'INTER size=', size 
    call accumulateTrigo(a, size, sum)
    print *, 'INTER sum=', sum

end subroutine intermediate
                       
program main
    real, allocatable, dimension(:) ::  X
    integer :: Xsize,m,i,k,c1,c2   
    real :: lastSum
!$acc mirror (X)
 
interface
  subroutine intermediate(a, size, sum)
    integer :: size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)
  end subroutine intermediate
end interface
 
    Xsize = 100000 
    allocate(X(Xsize))  
    m = 5           ! m calls to subroutine accumulateTrigo
 
! GPU initialization
#ifdef _ACCEL
    call acc_init( acc_device_nvidia )
#endif   

! initialization of array X
!$acc region do
    do i = 1,Xsize
        X(i) = (i*2.0)
    enddo

! computations on GPU   
    call system_clock( count=c1 )
    do k= 1, m     
        call intermediate(X, Xsize, lastSum)
    enddo
   
    print *, "LAST = ", lastSum
    call system_clock( count=c2 )
    print *, (c2-c1)/1000.0, ' milliseconds'
end program



% cat reflect4.f90

module myinter

interface
  subroutine accumulateTrigo(a, size, sum)
    integer :: size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)
  end subroutine accumulateTrigo

  subroutine intermediate(a, size, sum)
    integer :: size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)
  end subroutine intermediate

end interface

end module myinter


subroutine accumulateTrigo(a, size, sum)
    integer :: ii,jj, size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)
    do jj=1,500
    sum=0.0
!$acc region
      do ii=1,size
            sum = sum + sin(a(ii)) ** 2 + cos(a(ii)) ** 2
      enddo
!$acc end region     
    enddo
    return
end subroutine


subroutine intermediate(a, size, sum)
    use myinter
    integer ::  size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)

    print *, 'INTER size=', size 
    call accumulateTrigo(a, size, sum)
    print *, 'INTER sum=', sum

end subroutine intermediate
                       
program main
    use myinter
    real, allocatable, dimension(:) ::  X
    integer :: Xsize,m,i,k,c1,c2   
    real :: lastSum
!$acc mirror (X)
 
    Xsize = 100000 
    allocate(X(Xsize))  
    m = 5           ! m calls to subroutine accumulateTrigo
 
! GPU initialization
#ifdef _ACCEL
    call acc_init( acc_device_nvidia )
#endif   

! initialization of array X
!$acc region do
    do i = 1,Xsize
        X(i) = (i*2.0)
    enddo

! computations on GPU   
    call system_clock( count=c1 )
    do k= 1, m     
        call intermediate(X, Xsize, lastSum)
    enddo
   
    print *, "LAST = ", lastSum
    call system_clock( count=c2 )
    print *, (c2-c1)/1000.0, ' milliseconds'
end program

Thanks Mat.

You are right.

Utilizing mirrored and reflected requires replacing those X() arrays with allocatable arrays.

We also lack a Fortran 90 expert in the group. So getting a handle of the ‘interface’ and ‘module’ constructs will be interesting.

Which brings up my next question.

How do I properly treat allocatable arrays with an interface?

Utilizing mirrored and reflected requires replacing those X() arrays with allocatable arrays.

Mirror is for use with allocatable arrays since it “mirrors” the allocation status of the array on both the host and device. However, reflected can be used with static arrays (See the first reflected.f90 example).

How do I properly treat allocatable arrays with an interface?

Use assumed-shape array syntax. The size of “a” will be determined at runtime, while it’s shape (type, rank, kind) is known at compile time.

interface
  subroutine accumulateTrigo(a, size, sum)
    integer :: size
    real, dimension(:) :: a
    real :: sum
!$acc reflected (a)
  end subroutine accumulateTrigo
end interface

Hi Mat,

I was able to get reflected to work in GAMESS.

However, I would like to make use of mirror.

If you don’t mind, could you rewrite your mirror.f90 example without using modules.

I don’t want to start using the module construct yet until I have a better understanding of it.

I am comfortable with interfaces.

Many thanks again,
Sarom

Hi Sarom,

I was able to get reflected to work in GAMESS.

Excellent.

If you don’t mind, could you rewrite your mirror.f90 example without using modules.

Already done. That 's the ‘reflect3.f90’ example.

I don’t want to start using the module construct yet until I have a better understanding of it.
I am comfortable with interfaces.

You’ll find using modules is a lot easier (a least a lot less typing) then writing our explicit interfaces for every subroutine. Though, I certainly understand.

Best Regards,
Mat

Hi Mat,

I am trying to implement the mirror directive in GAMESS but I get a warning of:
Invalid accelerator data region: not a single-entry single-exit region

I suspected this to be due to the presence of an ENTRY statement in the subroutine. So, I modified the sample code from this thread to include an ENTRY statement and was able to reproduce the warning message.

Is there a work around for mirror directives and the presence of the ENTRY statement?

module myinter

interface
  subroutine accumulateTrigo(a, size, sum)
    integer :: size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)
  end subroutine accumulateTrigo

  subroutine intermediate(a, size, sum)
    integer :: size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)
  end subroutine intermediate
end interface
end module myinter


subroutine accumulateTrigo(a, size, sum)
    integer :: ii,jj, size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)
    do jj=1,500
    sum=0.0
!$acc region
      do ii=1,size
            sum = sum + sin(a(ii)) ** 2 + cos(a(ii)) ** 2
      enddo
!$acc end region     
    enddo
    return
end subroutine


subroutine intermediate(a, size, sum)
    use myinter
    integer ::  size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)

    print *, 'INTER size=', size
    call accumulateTrigo(a, size, sum)
    print *, 'INTER sum=', sum

end subroutine intermediate
                       
subroutine mirror
    use myinter
    real, allocatable, dimension(:) ::  X
    integer :: Xsize,m,i,k,c1,c2   
    real :: lastSum
!$acc mirror (X)
    
    Xsize = 100000

    entry mirror_jump(dummy_arg)

    allocate(X(Xsize)) 
    m = 5           ! m calls to subroutine accumulateTrigo
 
! GPU initialization
#ifdef _ACCEL
    call acc_init( acc_device_nvidia )
#endif   

! initialization of array X
    do i = 1,Xsize
        X(i) = (i*2.0)
    enddo
!$acc update device(X)

! computations on GPU   
    call system_clock( count=c1 )
    do k= 1, m     
        call intermediate(X, Xsize, lastSum)
    enddo

    print *, "LAST = ", lastSum
    call system_clock( count=c2 )
    print *, (c2-c1)/1000.0, ' milliseconds'
end subroutine

program main
    call mirror
end program

Here is a reflected version that works with an ENTRY statement.

module myinter

interface
  subroutine accumulateTrigo(a, size, sum)
    integer :: size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)
  end subroutine accumulateTrigo

  subroutine intermediate(a, size, sum)
    integer :: size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)
  end subroutine intermediate
end interface
end module myinter


subroutine accumulateTrigo(a, size, sum)
    integer :: ii,jj, size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)
    do jj=1,500
    sum=0.0
!$acc region
      do ii=1,size
            sum = sum + sin(a(ii)) ** 2 + cos(a(ii)) ** 2
      enddo
!$acc end region     
    enddo
    return
end subroutine


subroutine intermediate(a, size, sum)
    use myinter
    integer ::  size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)

    print *, 'INTER size=', size
    call accumulateTrigo(a, size, sum)
    print *, 'INTER sum=', sum

end subroutine intermediate
                       
subroutine reflected(Xsize)
    use myinter
    integer :: Xsize,m,i,k,c1,c2
    real :: lastSum
    real, dimension(:) ::  X(Xsize)
    
    entry mirror_jump(dummy_args)
    
    m = 5           ! m calls to subroutine accumulateTrigo
 
! GPU initialization
#ifdef _ACCEL
    call acc_init( acc_device_nvidia )
#endif   

! initialization of array X
    do i = 1,Xsize
        X(i) = (i*2.0)
    enddo
!$acc data region copyin(X)

! computations on GPU   
    call system_clock( count=c1 )
    do k= 1, m     
        call intermediate(X, Xsize, lastSum)
    enddo
!$acc end data region
    
    print *, "LAST = ", lastSum
    call system_clock( count=c2 )
    print *, (c2-c1)/1000.0, ' milliseconds'
end subroutine

program main
    call reflected(100000)
end program

Hi Sarom,

Is there a work around for mirror directives and the presence of the ENTRY statement?

Data and compute regions must have a single entry and exit point. So the work around in the mirror case is to move the data region so it’s declared after the entry statement.

The directive “!$acc mirror(X)” creates an implicit data region having the same scope as the X array. However, “mirror” can also be a clause in an explicit data region.

    entry mirror_jump(dummy_arg)

! GPU initialization
#ifdef _ACCEL
    call acc_init( acc_device_nvidia )
#endif   

!$acc data region mirror (X)
    allocate(X(Xsize))
    m = 5           ! m calls to subroutine accumulateTrigo

Though, there’s no real advantage of using a mirror clause here. The same thing could be expressed with the local clause depending on when the X array is allocated.

    entry mirror_jump(dummy_arg)

    allocate(X(Xsize))

! GPU initialization
#ifdef _ACCEL
    call acc_init( acc_device_nvidia )
#endif   

!$acc data region local (X)
    m = 5           ! m calls to subroutine accumulateTrigo
  • Mat