Example of use of Accelerator code and Cuda Fortran

Hello,
I’m trying to compile the code below
with the command
pgfortran -Mcuda -ta=nvidia main.f90
and i obtain:
/tmp/pgfortranCWpgu3AOGo5Q.s: Assembler messages:
/tmp/pgfortranCWpgu3AOGo5Q.s:739: Error: symbol `.STATICS3’ is already defined
Can you help me?
Thank you,
Fedele Stabile

This is the example code i prepared:

program main

use cudafor

implicit none
integer, parameter :: n1=10, nlev=60
real, dimension(n1,nlev) :: a
integer :: loop
!$acc mirror(a)

a=0.1

!$acc update device(a)
call compute(n1,nlev)
!$acc update host(a)

print*, sum(a)

end program main

subroutine compute

real, dimension(n1,nlev) :: a
real, dimension(n1,nlev), device :: adev
integer :: i,k
!$acc mirror(a)

!$acc region
adev = 0.0
do i=1,nvec
do k=1,nlev
adev(i,k)=a(i,k)*a(i,k)
end do
end do
a = adev
!$acc end region
end subroutine compute

Hi Fedele.Stabile,

While your program has a number of issues, the “.STATICS3” error does appear to be a compiler issue when a program using ACC data regions outside of a module is compiled with -Mcuda. I have written a report (TPR#18533) and sent it to our compiler engineers for further investigation.

Note that your code has a number of errors, such as the wrong loop bounds (i.e. it uses nvec instead of n1), it doesn’t pass in arguments to “compute”, uses mirror for the dummy argument “a” (it should be reflected), missing an interface to “compute”, and “adev” is unnecessary since “a” is mirrored.

Here’s the corrected code, however the “STATIC3” error will persist if you add -Mcuda.

% cat test.f90
program main

implicit none
integer, parameter :: n1=10, nlev=60
real, dimension(n1,nlev) :: a
integer :: loop
!$acc mirror(a)

interface
  subroutine compute (n1,nlev,a)
     integer :: n1, nlev
     real, dimension(n1,nlev) :: a
!$acc reflected(a)
  end subroutine compute
end interface

a=0.1

!$acc update device(a)
call compute(n1,nlev,a)
!$acc update host(a)

print*, sum(a)

end program main

subroutine compute (n1,nlev,a)
integer :: n1, nlev
real, dimension(n1,nlev) :: a
integer :: i,k
!$acc reflected(a)

!$acc region
do i=1,n1
do k=1,nlev
a(i,k)=a(i,k)*a(i,k)
end do
end do
!$acc end region
end subroutine compute
% pgfortran test.f90 -V12.3 -Minfo=accel -ta=nvidia
main:
      7, Generating local(a(:,:))
     19, Generating update device(a(:,:))
     21, Generating update host(a(:,:))
compute:
     31, Generating reflected(a(:,:))
     33, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     34, Loop is parallelizable
     35, Loop is parallelizable
         Accelerator kernel generated
         34, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
         35, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
             CC 1.0 : 6 registers; 48 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 8 registers; 8 shared, 56 constant, 0 local memory bytes; 100% occupancy
% a.out
    6.000042

Moving “compute” into a module, will work around the STATIC3 issue.

% cat test2.f90 
module foo

contains

subroutine compute (n1,nlev,a)
integer :: n1, nlev
real, dimension(n1,nlev) :: a
integer :: i,k
!$acc reflected(a)

!$acc region
do i=1,n1
do k=1,nlev
a(i,k)=a(i,k)*a(i,k)
end do
end do
!$acc end region
end subroutine compute

end module foo

program main

use foo
implicit none
integer, parameter :: n1=10, nlev=60
real, dimension(n1,nlev) :: a
integer :: loop
!$acc mirror(a)

a=0.1

!$acc update device(a)
call compute(n1,nlev,a)
!$acc update host(a)

print*, sum(a)

end program main

% pgfortran test2.f90 -V12.3 -Minfo=accel -ta=nvidia -Mcuda
compute:
      9, Generating reflected(a(:,:))
     12, Loop is parallelizable
     13, Loop is parallelizable
         Accelerator kernel generated
         12, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
         13, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
main:
     29, Generating local(a(:,:))
     33, Generating update device(a(:,:))
     35, Generating update host(a(:,:))
% a.out
    6.000042

Thanks!
Mat

Hello,
thank you for your answer,
I suppose I have to upgrade my 12.1 PGI version to 12.3
But my question was about use of device variables
using an example code (wrong written, I apologize for this)
in compute subroutine I declared
real, dimension(n1,nlev), device :: adev
but I’m not able to compile.
Another question:
can you explain the use of reflected clause in conjunction with mirror ?


Thank you,
Fedele

Can you help to find the error in this code?
Thank you,
Fedele

module foo
implicit none
integer, parameter :: n1=10, nlev=60
real, dimension(n1,nlev) :: a

!$acc mirror(a)

contains
subroutine compute
integer :: i,k

!$acc region
do i=1,n1
do k=1,nlev
a(i,k)=a(i,k)*a(i,k)
end do
end do
!$acc end region
end subroutine compute

end module foo

program main

use foo

a=1.0

!$acc update device(a)

call compute

!$acc update host(a)

print*, sum(a)

end program main

This is, I think, the right code:
no errors , correct use of common array a
and device array adev.

Compiled with
pgfortran -Mcuda -ta=nvidia -Minfo=accel main.f90

Can you give a lock for other hidden mistakes?

Thank you,
Fedele

module foo
implicit none
integer, parameter :: n1=200, nlev=200
real, dimension(:,:), allocatable :: a

!$acc mirror(a)

contains

subroutine compute

real, dimension(n1,nlev), device :: adev
integer :: i,k

!$acc region local (adev)
do i=1,n1
do k=1,nlev
adev(i,k)=a(i,k)*a(i,k)
!a(i,k)=a(i,k)*a(i,k)
end do
end do
a = adev
!$acc end region
end subroutine compute

end module foo

program main

use foo

allocate (a(n1,nlev))
a=1.0
!$acc update device(a)
call compute
!$acc update host(a)
print*, sum(a)
deallocate(a)
end program main

Hi Fedele,

Yes, the mistake in the other program was that for Module data, mirror can only be applied to allocatable arrays.

While your new program is correct, the CUDA Fortran device array, adev, is redundant. The mirror clause has already created a device array for “a” and the “update” clause is copying the data back and forth. So you’re duplicating efforts by using adev and making your program less portable.

  • Mat

The purpose of my code is to test usability of a array declared as stored in device and use of mirror clause.
My real code has a lot of parallelizable loops in different subroutines. I’m trying to reduce data moviments using either local temporary variables stored in gpu either using the save feature of common variables.
I have two questions:

  1. If the compiler doesn’t introduce a copy of variables entering in a region defined as
    !$acc region
    is there real movement of data?
  2. For a complex loop to be executed in parallel on a GPU, I’ve noticed that is better to split the loop to create many “region” construct: what is the impact in terms of performance?

Thank you,
Fedele

  1. If the compiler doesn’t introduce a copy of variables entering in a region defined as
    !$acc region
    is there real movement of data?

The compiler will copy all necessary data to the device when a compute region begins, and from the device when the compute region ends. It is able to recognise cases where it is only necessary to copy the data one direction. Using the “copy”, “copyout”, “copyin”, or “local” clause only overrides the compiler’s default. Review the compiler feedback messages (-Minfo=accel). It will tell you when and how data is being transferred.

Note that scalar variables are privatised by default (i.e. each thread has their own copy of the scalar variable), while arrays are shared by default.

  1. For a complex loop to be executed in parallel on a GPU, I’ve noticed that is better to split the loop to create many “region” construct: what is the impact in terms of performance?

Provided you use data regions to handle sharing of common device data, there isn’t a performance impact. Though without data regions, data would be copied back and forth at each compute region, severely impacting performance.

  • Mat

I’m instructing the compiler to obtain information about time
(-ta=nvidia,time),
and I obtain this from one of my subroutines:

avanf_e
423: region entered 500 times
time(us): total=146853 init=86 region=146767
kernels=33382 data=0
w/o init: total=146767 max=403 min=289 avg=293
428: kernel launched 500 times
grid: [2] block: [32]
time(us): total=5032 max=16 min=9 avg=10
459: kernel launched 500 times
grid: [50x8] block: [8x8]
time(us): total=13500 max=31 min=26 avg=27
473: kernel launched 500 times
grid: [2] block: [32]
time(us): total=5548 max=15 min=11 avg=11
495: kernel launched 500 times
grid: [4x26] block: [16x16]
time(us): total=9302 max=23 min=17 avg=18

Can you help me to understand the meaning of quantities
region=146767 and kernels=33382
I supposed that region = kernels + data
but this is not my case.

Other question: I arranged arrays with use of mirror and local clause so that compiler doesn’t signal me any movement of variables (no copy, copyin or copyout): this means that there is not data movement, is it? (The output shown from subroutine avanf_e prints data = 0 seems to indicate this)
Last question: is it possible to define a data region in the main program and an acc region into the subroutines?

Thank you for all,
Fedele

I supposed that region = kernels + data

Nope, region = CPU time + kernels + data. There is CPU overhead to setting up and launching kernels as well as host side data movement (if any).

this means that there is not data movement, is it?

Correct, and the profile information confirms that no data movement occurred.

is it possible to define a data region in the main program and an acc region into the subroutines?

Yes. In Fortran, data regions can span across subroutine boundaries using the “reflected” directive. Also, the “mirror” directive allows for module allocable arrays to be declared on the device which have the same scope and lifetime as the host variable.

Take a look at this post: Keeping data on GPU while looping and calling subroutines I wrote up several example on using reflected and mirror.

Hope this helps,
Mat

Thank you for your help,
I understand that performance are poor if my code generates many accelerator regions in sequence during the main loop of the program.
So I decided to insert only one accelerator region that include the loop and instruct the compiler to inline all the subroutines called during cycle …
I’m studying the problem

Fedele