Example of use of Accelerator code and Cuda Fortran

Fedele.Stabile · March 12, 2012, 1:23pm

Hello,
I’m trying to compile the code below
with the command
pgfortran -Mcuda -ta=nvidia main.f90
and i obtain:
/tmp/pgfortranCWpgu3AOGo5Q.s: Assembler messages:
/tmp/pgfortranCWpgu3AOGo5Q.s:739: Error: symbol `.STATICS3’ is already defined
Can you help me?
Thank you,
Fedele Stabile

This is the example code i prepared:

program main

use cudafor

implicit none
integer, parameter :: n1=10, nlev=60
real, dimension(n1,nlev) :: a
integer :: loop
!$acc mirror(a)

a=0.1

!$acc update device(a)
call compute(n1,nlev)
!$acc update host(a)

print*, sum(a)

end program main

subroutine compute

real, dimension(n1,nlev) :: a
real, dimension(n1,nlev), device :: adev
integer :: i,k
!$acc mirror(a)

!$acc region
adev = 0.0
do i=1,nvec
do k=1,nlev
adev(i,k)=a(i,k)*a(i,k)
end do
end do
a = adev
!$acc end region
end subroutine compute

MatColgrove · March 12, 2012, 4:51pm

Hi Fedele.Stabile,

While your program has a number of issues, the “.STATICS3” error does appear to be a compiler issue when a program using ACC data regions outside of a module is compiled with -Mcuda. I have written a report (TPR#18533) and sent it to our compiler engineers for further investigation.

Note that your code has a number of errors, such as the wrong loop bounds (i.e. it uses nvec instead of n1), it doesn’t pass in arguments to “compute”, uses mirror for the dummy argument “a” (it should be reflected), missing an interface to “compute”, and “adev” is unnecessary since “a” is mirrored.

Here’s the corrected code, however the “STATIC3” error will persist if you add -Mcuda.

% cat test.f90
program main

implicit none
integer, parameter :: n1=10, nlev=60
real, dimension(n1,nlev) :: a
integer :: loop
!$acc mirror(a)

interface
  subroutine compute (n1,nlev,a)
     integer :: n1, nlev
     real, dimension(n1,nlev) :: a
!$acc reflected(a)
  end subroutine compute
end interface

a=0.1

!$acc update device(a)
call compute(n1,nlev,a)
!$acc update host(a)

print*, sum(a)

end program main

subroutine compute (n1,nlev,a)
integer :: n1, nlev
real, dimension(n1,nlev) :: a
integer :: i,k
!$acc reflected(a)

!$acc region
do i=1,n1
do k=1,nlev
a(i,k)=a(i,k)*a(i,k)
end do
end do
!$acc end region
end subroutine compute
% pgfortran test.f90 -V12.3 -Minfo=accel -ta=nvidia
main:
      7, Generating local(a(:,:))
     19, Generating update device(a(:,:))
     21, Generating update host(a(:,:))
compute:
     31, Generating reflected(a(:,:))
     33, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     34, Loop is parallelizable
     35, Loop is parallelizable
         Accelerator kernel generated
         34, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
         35, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
             CC 1.0 : 6 registers; 48 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 8 registers; 8 shared, 56 constant, 0 local memory bytes; 100% occupancy
% a.out
    6.000042

Moving “compute” into a module, will work around the STATIC3 issue.

% cat test2.f90 
module foo

contains

subroutine compute (n1,nlev,a)
integer :: n1, nlev
real, dimension(n1,nlev) :: a
integer :: i,k
!$acc reflected(a)

!$acc region
do i=1,n1
do k=1,nlev
a(i,k)=a(i,k)*a(i,k)
end do
end do
!$acc end region
end subroutine compute

end module foo

program main

use foo
implicit none
integer, parameter :: n1=10, nlev=60
real, dimension(n1,nlev) :: a
integer :: loop
!$acc mirror(a)

a=0.1

!$acc update device(a)
call compute(n1,nlev,a)
!$acc update host(a)

print*, sum(a)

end program main

% pgfortran test2.f90 -V12.3 -Minfo=accel -ta=nvidia -Mcuda
compute:
      9, Generating reflected(a(:,:))
     12, Loop is parallelizable
     13, Loop is parallelizable
         Accelerator kernel generated
         12, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
         13, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
main:
     29, Generating local(a(:,:))
     33, Generating update device(a(:,:))
     35, Generating update host(a(:,:))
% a.out
    6.000042

Thanks!
Mat

Fedele.Stabile · March 13, 2012, 9:42am

Hello,
thank you for your answer,
I suppose I have to upgrade my 12.1 PGI version to 12.3
But my question was about use of device variables
using an example code (wrong written, I apologize for this)
in compute subroutine I declared
real, dimension(n1,nlev), device :: adev
but I’m not able to compile.
Another question:
can you explain the use of reflected clause in conjunction with mirror ?

Thank you,
Fedele

Fedele.Stabile · March 13, 2012, 12:42pm

Can you help to find the error in this code?
Thank you,
Fedele

module foo
implicit none
integer, parameter :: n1=10, nlev=60
real, dimension(n1,nlev) :: a

!$acc mirror(a)

contains
subroutine compute
integer :: i,k

!$acc region
do i=1,n1
do k=1,nlev
a(i,k)=a(i,k)*a(i,k)
end do
end do
!$acc end region
end subroutine compute

end module foo

program main

use foo

a=1.0

!$acc update device(a)

call compute

!$acc update host(a)

print*, sum(a)

end program main

Fedele.Stabile · March 13, 2012, 4:07pm

This is, I think, the right code:
no errors , correct use of common array a
and device array adev.

Compiled with
pgfortran -Mcuda -ta=nvidia -Minfo=accel main.f90

Can you give a lock for other hidden mistakes?

Thank you,
Fedele

module foo
implicit none
integer, parameter :: n1=200, nlev=200
real, dimension(:,:), allocatable :: a

!$acc mirror(a)

contains

subroutine compute

real, dimension(n1,nlev), device :: adev
integer :: i,k

!$acc region local (adev)
do i=1,n1
do k=1,nlev
adev(i,k)=a(i,k)*a(i,k)
!a(i,k)=a(i,k)*a(i,k)
end do
end do
a = adev
!$acc end region
end subroutine compute

end module foo

program main

use foo

allocate (a(n1,nlev))
a=1.0
!$acc update device(a)
call compute
!$acc update host(a)
print*, sum(a)
deallocate(a)
end program main

MatColgrove · March 13, 2012, 4:19pm

Hi Fedele,

Yes, the mistake in the other program was that for Module data, mirror can only be applied to allocatable arrays.

While your new program is correct, the CUDA Fortran device array, adev, is redundant. The mirror clause has already created a device array for “a” and the “update” clause is copying the data back and forth. So you’re duplicating efforts by using adev and making your program less portable.

Mat

Fedele.Stabile · March 14, 2012, 7:40am

The purpose of my code is to test usability of a array declared as stored in device and use of mirror clause.
My real code has a lot of parallelizable loops in different subroutines. I’m trying to reduce data moviments using either local temporary variables stored in gpu either using the save feature of common variables.
I have two questions:

If the compiler doesn’t introduce a copy of variables entering in a region defined as
!$acc region
is there real movement of data?
For a complex loop to be executed in parallel on a GPU, I’ve noticed that is better to split the loop to create many “region” construct: what is the impact in terms of performance?

Thank you,
Fedele

MatColgrove · March 14, 2012, 3:00pm

If the compiler doesn’t introduce a copy of variables entering in a region defined as
!$acc region
is there real movement of data?

The compiler will copy all necessary data to the device when a compute region begins, and from the device when the compute region ends. It is able to recognise cases where it is only necessary to copy the data one direction. Using the “copy”, “copyout”, “copyin”, or “local” clause only overrides the compiler’s default. Review the compiler feedback messages (-Minfo=accel). It will tell you when and how data is being transferred.

Note that scalar variables are privatised by default (i.e. each thread has their own copy of the scalar variable), while arrays are shared by default.

For a complex loop to be executed in parallel on a GPU, I’ve noticed that is better to split the loop to create many “region” construct: what is the impact in terms of performance?

Provided you use data regions to handle sharing of common device data, there isn’t a performance impact. Though without data regions, data would be copied back and forth at each compute region, severely impacting performance.

Mat

Fedele.Stabile · March 14, 2012, 5:00pm

I’m instructing the compiler to obtain information about time
(-ta=nvidia,time),
and I obtain this from one of my subroutines:

avanf_e
423: region entered 500 times
time(us): total=146853 init=86 region=146767
kernels=33382 data=0
w/o init: total=146767 max=403 min=289 avg=293
428: kernel launched 500 times
grid: [2] block: [32]
time(us): total=5032 max=16 min=9 avg=10
459: kernel launched 500 times
grid: [50x8] block: [8x8]
time(us): total=13500 max=31 min=26 avg=27
473: kernel launched 500 times
grid: [2] block: [32]
time(us): total=5548 max=15 min=11 avg=11
495: kernel launched 500 times
grid: [4x26] block: [16x16]
time(us): total=9302 max=23 min=17 avg=18

Can you help me to understand the meaning of quantities
region=146767 and kernels=33382
I supposed that region = kernels + data
but this is not my case.

Other question: I arranged arrays with use of mirror and local clause so that compiler doesn’t signal me any movement of variables (no copy, copyin or copyout): this means that there is not data movement, is it? (The output shown from subroutine avanf_e prints data = 0 seems to indicate this)
Last question: is it possible to define a data region in the main program and an acc region into the subroutines?

Thank you for all,
Fedele

MatColgrove · March 14, 2012, 5:18pm

I supposed that region = kernels + data

Nope, region = CPU time + kernels + data. There is CPU overhead to setting up and launching kernels as well as host side data movement (if any).

this means that there is not data movement, is it?

Correct, and the profile information confirms that no data movement occurred.

is it possible to define a data region in the main program and an acc region into the subroutines?

Yes. In Fortran, data regions can span across subroutine boundaries using the “reflected” directive. Also, the “mirror” directive allows for module allocable arrays to be declared on the device which have the same scope and lifetime as the host variable.

Take a look at this post: Keeping data on GPU while looping and calling subroutines I wrote up several example on using reflected and mirror.

Hope this helps,
Mat

Fedele.Stabile · March 18, 2012, 8:10am

Thank you for your help,
I understand that performance are poor if my code generates many accelerator regions in sequence during the main loop of the program.
So I decided to insert only one accelerator region that include the loop and instruct the compiler to inline all the subroutines called during cycle …
I’m studying the problem

Fedele

Topic		Replies	Views
Problems with FORTRAN Accelerator and subroutines Legacy PGI Compilers	21	11923	August 17, 2011
declarative data error in PGI Fortran 10 Legacy PGI Compilers	10	10824	December 28, 2010
Error and huge slowdown from !$acc region Legacy PGI Compilers	4	2954	March 26, 2012
The Fortran OpenACC acceleration code compiles successfully but still runs on the CPU nvc, nvc++ and nvfortran	14	42	December 28, 2024
compiler ask acc routine information for internal function Legacy PGI Compilers	12	20312	October 25, 2017
PGI Accelerator programming concepts questions Legacy PGI Compilers	10	12019	November 29, 2010
OpenACC: Problem with present directive and module array Legacy PGI Compilers	14	9244	August 14, 2012
Fake data movement triggered by implicit copies and present clauses traced with nsys nvc, nvc++ and nvfortran	1	397	January 3, 2023
constant coefficients matrices defined in external module Legacy PGI Compilers	2	2493	October 25, 2011
CUDA Fortran and Fortran 77 Legacy PGI Compilers	13	8230	March 12, 2012

Example of use of Accelerator code and Cuda Fortran

Related topics