declarative data error in PGI Fortran 10

I believe FULL Accelerate spec. v1.0 has been implemented in PGI Fortran 10. However, when I use with declarative data in my program, i got a compile error.
CODE.

       program test
          real, dimension(10,10) :: a, b
          integer :: i
!$ACC copyin(a), local(b)             
          i = 19
          a = 1.1
          b = 3.0
        end program

Compile ERROR:

PGF90-S-0034-Syntax error at or near COPYIN (main.f: 4)
0 inform, 0 warnings, 1 severes, 0 fatal for test

I got the same error if I declare in a subroutine.

Tuan.

Hi Tuan,

The copyin and local clauses need to be part of a region or data region directive. For example:

       program test 
          real, dimension(10,10) :: a, b 
          integer :: i 
!$ACC data region copyin(a), local(b)              
          i = 19 
          a = 1.1 
          b = 3.0 
!$ACC end data region
        end program

Hope this helps,
Mat

Hi Mat,
Thanks for the prompt response. Could you please explain me about section 2.6 (Declarative Data Directives) in the “PGI Fortran&C Accelerate Programming Model” document. In this section, the syntax is

!$acc declclause [,declclause]...

with declclause can be copy(list), copyin(list)…

As described in the docucment, programmers don’t have to define a region as it is implicitly defined as the whole subprogram unit.

Tuan.

Hi Tuan,

The design doc is a bit ahead of the implimentation. Declarative data directives are still under development but should be available early next year.

Thanks,
Mat

Dear Mat, are the declarative data directives still not working?

I tried the below using PGI compiler 10.9 and I get the below warning:
[rambo@superbeast working-fortran-example-with-gpu]$ mpif90 -fast -ta=nvidia,time -Minfo=all,accel -mcmodel=medium -Minline mm.f

PGF90-W-0155-Unrecognized ACC directive: declclausecopyin (mm.f: 13)


The sample Fortran code I am testing with is below:

       program main
         call MM()
         call MM()
         end program main


         subroutine MM ()
         integer dim1, dim2, dim3
         parameter (dim1 = 10000, dim2 = 10000, dim3 = 10000)
         double precision A(dim1, dim2), B(dim2, dim3), C(dim1, dim3)
         real start, finish
!$acc declclause copyin(A) copyin(B)

      call srand(86456)
      do i = 1, dim1
        do j = 1, dim2
          A(i, j) = rand()
        enddo
      enddo
      do i = 1, dim2
        do j = 1, dim3
          B(i, j) = rand()
        enddo
      enddo

      call cpu_time(start)

!$acc data region copyout(C)

!$acc region
        do j = 1, dim3
        do i = 1, dim1
          C(i, j) = 0
        enddo
        do k = 1, dim2
          do i = 1, dim1
            C(i, j) = C(i, j) + A(i, k)*B(k, j)
          enddo
        enddo
       enddo
!$acc end region

!$acc end data region

 call cpu_time(finish)
      print *,'time for C(',dim1,',',dim3,') = A(',dim1,',',dim2,') B(',
     1dim2,',',dim3,') is',finish - start,' s'

      end subroutine MM

Thanks for your help.

Hi sindimo,

Implicit data regions will be available in the upcoming 2011 (11.0) release.

Note that ‘declclause’ in the documentation is meant be one of the clauses (copy, copying, etc) not a clause itself. Hence, change:

!$acc declclause copyin(A) copyin(B)

to

!$acc copyin(A) copyin(B)

However, since the copy to the device happens at the line where you added the directive, you’re actually coping junk values to the device. Instead, you need to have A and B copied after they are initialized. The simplest way to do this would be to add 'copyin(A,B)" to the explicit data region (this will work in 10.9 as well). To use an implicit data region you would need to declare A and B as local and then use the updatein clause to copy the data to the device. (See example below)

Finally, since the total size of your arrays are larger than 2GB, you will need to compile with the “-mcmodel=medium” flag.

Hope this helps,
Mat

% cat decl.f90

       program main
         call MM()
         call MM()
         end program main


         subroutine MM ()
         integer dim1, dim2, dim3
         parameter (dim1 = 10000, dim2 = 10000, dim3 = 10000)
         double precision A(dim1, dim2), B(dim2, dim3), C(dim1, dim3)
         real start, finish
!$acc local(A,B) 

      call srand(86456)
      do i = 1, dim1
        do j = 1, dim2
          A(i, j) = rand()
        enddo
      enddo
      do i = 1, dim2
        do j = 1, dim3
          B(i, j) = rand()
        enddo
      enddo

!$acc updatein(A,B)

      call cpu_time(start)

!$acc data region copyout(C)
! in 10.9 and earlier use 
!  acc data region copyin(A,B), copyout(C)

!$acc region
        do j = 1, dim3
        do i = 1, dim1
          C(i, j) = 0
        enddo
        do k = 1, dim2
          do i = 1, dim1
            C(i, j) = C(i, j) + A(i, k)*B(k, j)
          enddo
        enddo
       enddo
!$acc end region

!$acc end data region

     call cpu_time(finish)
     print *,'time for C(',dim1,',',dim3,') = A(',dim1,',',dim2,') B(', &
      dim2,',',dim3,') is',finish - start,' s'

      end subroutine MM

% pgf90 -ta=nvidia -Minfo=accel decl.f90 -V11.0 -mcmodel=medium
mm:
     13, Generating local(b(:,:))
         Generating local(a(:,:))
     27, Generating !$acc update device(b(:,:))
         Generating !$acc update device(a(:,:))
     31, Generating copyout(c(:,:))
     33, Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     34, Loop is parallelizable
     35, Loop is parallelizable
         Accelerator kernel generated
         34, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
         35, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
             CC 1.3 : 8 registers; 32 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 10 registers; 8 shared, 40 constant, 0 local memory bytes; 100% occupancy
     38, Loop carried reuse of 'c' prevents parallelization
     39, Loop is parallelizable
         Accelerator kernel generated
         34, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
         38, !$acc do seq(16)
             Cached references to size [16x16] block of 'a'
             Cached references to size [16x16] block of 'b'
         39, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
             Using register for 'c'
             CC 1.3 : 26 registers; 4400 shared, 32 constant, 0 local memory bytes; 50% occupancy
             CC 2.0 : 40 registers; 4360 shared, 60 constant, 0 local memory bytes; 50% occupancy
% a.out
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    47.59940      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    47.10842      s

Dear Mat, thanks for your quick response and clarifications.

One more side question please, how were you able to determine that the total size of the arrays are larger than 2GB, can the PGI compiler point that out during compilation?

This would be helpful since what we are trying to accomplish here is migrate one of our programs to run on GPUs and our goal with experimenting with the data regions is to try to minimize the data movement as much as possible between the CPU and GPU since we are seeing a bottleneck in the PICe communication between the two (data movement is taking much more time than actual kernel processing time which is causing an overhead).

The code I posted earlier is just a simple matrix multiplication which mimics a portion of our actual program that we are trying to port to GPUs using the PGI directives, the array sizes of the actual program might be even larger than this one.

If you have any suggestions or hints from your experience on reducing data movement, please feel free sharing it.

Many thanks for your help.

Hi Mohamad Sindi,

One more side question please, how were you able to determine that the total size of the arrays are larger than 2GB, can the PGI compiler point that out during compilation?

The first time I tried to compile your program I got a “relocation truncated to fit: R_X86_64_PC32 against symbol …” error. They typically means that your static data size is greater than 2GB. I then looked a code and found you have 3 10000x10000 double precision arrays, totaling 2.2GB.

If you have any suggestions or hints from your experience on reducing data movement, please feel free sharing it.

The “reflected” and “mirror” directives will be available in the 11.0 release. Reflected will allow you have data regions that span across subroutine calls thus allowing you keep more data over on device for larger portions of your code.

Mirrored is applied to module data allocatable arrays. It will mirror the allocation status of the array on the host and gpu. The caveat is that you will need to manage the data movement using the update directives and be carefully to understand which copy of the array, host or device, you’re working on.

Other things:

  • have a basic understand the GPU architecture. This article is good place to start Account Login | PGI
  • get good at understanding the -Minfo=accel messages that the compiler prints during compilation. They hold valuable clues on ways to improve performance.
  • Specifically, look for “non-stride-1” messages. This means that the GPU threads are not sequentially accessing global data. Data movement between the device’s global and local memory can be more important to performance than optimizing the host/GPU data movement.
  • compile with basic profiling enabled during development, i.e. “-ta=nvidia,time”. This will highlight the performance bottle necks and show the actual cost to move data to the GPU.
  • Make sure your algorithm is data parallel and can utilizes 10’s of thousands of threads. Without this you may still be able to run your code on a GPU, but the performance will most likely be poor.

Please feel free to ask specific questions as you encounter them. While the PGI Accelerator model does make GPU programing easier, GPU program is still hard (at least at first).

  • Mat

Thanks Mat for your valuable feedback!

Dear Mat, I hope you’re doing well and happy holidays.

It seems that PGI 11 was released on Dec 22 so I have it installed now and very eager to test the new “reflected” feature.

As you mentioned previously, reflected should allow us to have data regions that span across subroutine calls, thus allowing to keep more data over on device for a larger portion of the code. This will also help avoid having to do manual inlining.

I have the below program in which I try to multiple matrices A and B several times while only loading A and B once into the GPU’s memory. Ultimate goal it to reduce data copying between CPU and GPU using “reflected”.

I am just trying to figure out how to do it using a simple program before actually implementing it in my actual application.

I tried following an example you posted earlier:
http://www.pgroup.com/userforum/viewtopic.php?t=2202&postdays=0&postorder=asc&start=10

However I get the below error regarding “EventSynchronize” when I run my program:

[sindimo@slcb100 working-fortran-example-with-gpu]$ /usr/local/pgi11/linux86-64/11.0/bin/pgfortran -fast -ta=nvidia,time -Minfo=all,accel -mcmodel=medium -Minline reflected.f
main:
     12, Loop not vectorized/parallelized: contains call
     17, Loop not vectorized/parallelized: contains call
     23, Loop not vectorized/parallelized: contains call
     24, Generating copyout(c(:,:))
         Generating copyin(b(:,:))
         Generating copyin(a(:,:))
mm:
     40, Generating reflected(b(:,:))
         Generating reflected(a(:,:))
     45, Generating copyout(c(1:10000,1:10000))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     46, Loop is parallelizable
     47, Loop is parallelizable
         Accelerator kernel generated
         46, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
         47, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
             CC 1.3 : 8 registers; 32 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 10 registers; 8 shared, 40 constant, 0 local memory bytes; 100% occupancy
     50, Loop carried reuse of 'c' prevents parallelization
     51, Loop is parallelizable
         Accelerator kernel generated
         46, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
         50, !$acc do seq(16)
             Cached references to size [16x16] block of 'a'
             Cached references to size [16x16] block of 'b'
         51, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
             Using register for 'c'
             CC 1.3 : 25 registers; 4400 shared, 24 constant, 0 local memory bytes; 50% occupancy
             CC 2.0 : 35 registers; 4360 shared, 60 constant, 0 local memory bytes; 50% occupancy

[sindimo@slcb100 working-fortran-example-with-gpu]$ ./a.out
call to EventSynchronize returned error 700: Launch failed
CUDA driver version: 3010

Accelerator Kernel Timing data
reflected.f
  mm
    45: region entered 1 time
        time(us): init=1
        47: kernel launched 1 times
            grid: [625x625]  block: [16x16]
            time(us): total=16549 max=16549 min=16549 avg=16549
        51: kernel launched 1 times
            grid: [625x625]  block: [16x16]
            time(us): total=0 max=0 min=0 avg=0
reflected.f
  main
    24: region entered 1 time
        time(us): init=1361723
                  data=566929

If I comment out the data region and the reflected directive, it works fine:

[sindimo@slcb100 working-fortran-example-with-gpu]$ ./a.out
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    38.48516      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    36.61761      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    36.61651      s

Accelerator Kernel Timing data
reflected.f
  main
    25: region entered 3 times
        time(us): total=111719220 init=1517525 region=110201695
                  kernels=108132453 data=2010386
        w/o init: total=110201695 max=36967588 min=36616499 avg=36733898
        25: kernel launched 6 times
            grid: [625x625]  block: [16x16]
            time(us): total=108132453 max=36028300 min=16542 avg=18022075

This is the code, I am not sure why it’s not working when using the reflected directive:

[sindimo@slcb100 working-fortran-example-with-gpu]$ cat reflected.f 
         program main

         use accel_lib

         integer dim1, dim2, dim3
         parameter (dim1 = 10000, dim2 = 10000, dim3 = 10000)
         double precision A(dim1, dim2), B(dim2, dim3), C(dim1, dim3)

              !populate 2 random matrices
              call srand(86456)
                do i = 1, dim1
                do j = 1, dim2
                  A(i, j) = rand()
               enddo
               enddo
               do i = 1, dim2
               do j = 1, dim3
               B(i, j) = rand()
               enddo
               enddo

           !Trying to multiple the 2 matricies several times (only load them once into the GPU memory)
!$acc data region copyin(A,B) copyout(C)
           do i = 1, 3
             call MM(A,B,C)
           enddo
!$acc end data region

         end program main


         subroutine MM (A,B,C) 
         integer dim1, dim2, dim3
         parameter (dim1 = 10000, dim2 = 10000, dim3 = 10000)
         double precision A(dim1, dim2), B(dim2, dim3), C(dim1, dim3)
         real start, finish

!$acc reflected(A,B)     

      call cpu_time(start)


!$acc region
        do j = 1, dim3
        do i = 1, dim1
          C(i, j) = 0
        enddo
        do k = 1, dim2
          do i = 1, dim1
            C(i, j) = C(i, j) + A(i, k)*B(k, j)
          enddo
        enddo
       enddo
!$acc end region


      call cpu_time(finish)

      print *,'time for C(',dim1,',',dim3,') = A(',dim1,',',dim2,') B(',
     1dim2,',',dim3,') is',finish - start,' s'
     
      end subroutine MM

Many thanks for your help, I really appreciate it.

Mohamad Sindi

Ok I think I figured out what the issue was and got it working.

I had to make the subroutine part of a module then use that module in the main program.

This is a run comparison between using “reflected” and not (matrices A and B get multiplied 10 times):

#With reflected (data movement is around 3.5 seconds)

[sindimo@slcb100 working-fortran-example-with-gpu]$ ./a.out
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    36.60642      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    36.32226      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    36.33978      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    36.31944      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    36.32152      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    36.34007      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    36.32129      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    36.32207      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    36.33026      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    36.32959      s

Accelerator Kernel Timing data
reflected.f
  mm
    46: region entered 10 times
        time(us): total=363552644 init=4 region=363552640
                  kernels=360440121 data=3037902
        w/o init: total=363552640 max=36606408 min=36319440 avg=36355264
        48: kernel launched 10 times
            grid: [625x625]  block: [16x16]
            time(us): total=165389 max=16550 min=16533 avg=16538
        52: kernel launched 10 times
            grid: [625x625]  block: [16x16]
            time(us): total=360274732 max=36028468 min=36025824 avg=36027473
reflected.f
  main
    23: region entered 1 time
        time(us): total=365187413 init=1079992 region=364107421
                  data=541007
        w/o init: total=364107421 max=364107421 min=364107421 avg=364107421

#Without reflected (data movement is around 8.5 seconds)

[sindimo@slcb100 working-fortran-example-with-gpu]$ ./a.out
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    38.23182      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    36.88260      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    36.89074      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    36.88908      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    36.87273      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    36.89082      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    36.89038      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    36.89151      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    36.89142      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000 
 ) B(        10000 ,        10000 ) is    36.88925      s

Accelerator Kernel Timing data
reflected.f
  main
    25: region entered 10 times
        time(us): total=370220259 init=1085643 region=369134616
                  kernels=360431990 data=8507312
        w/o init: total=369134616 max=37146134 min=36872727 avg=36913461
        25: kernel launched 20 times
            grid: [625x625]  block: [16x16]
            time(us): total=360431990 max=36027676 min=16539 avg=18021599

So 3.5 v.s. 8.5 seconds is around 58% cut in data movement.

Here’s the code if anyone else is interested to look at it:

[sindimo@slcb100 working-fortran-example-with-gpu]$ cat reflected.f 
         program main
         use myModule
         use accel_lib

         integer dim1, dim2, dim3, seed
         parameter (dim1 = 10000, dim2 = 10000, dim3 = 10000)
         double precision A(dim1, dim2), B(dim2, dim3), C(dim1, dim3)
         
              !populate 2 random matrices
                seed=7654321
                do i = 1, dim1
                do j = 1, dim2
                  A(i, j) = ran(seed)
               enddo
               enddo
               do i = 1, dim2
               do j = 1, dim3
               B(i, j) = ran(seed)
               enddo
               enddo

           !Trying to multiple the 2 matricies several times (only load them once into the GPU memory)
!$acc data region copyin(A,B) 
           do i = 1, 10
             call MM(A,B,C)
           enddo
!$acc end data region

         end program main


         module myModule
         contains
         subroutine MM (X,Y,Z) 
         integer dim1, dim2, dim3
         parameter (dim1 = 10000, dim2 = 10000, dim3 = 10000)
         double precision X(dim1, dim2), Y(dim2, dim3), Z(dim1, dim3)
         real start, finish

!$acc reflected(X,Y)     

      call cpu_time(start)


!$acc region
        do j = 1, dim3
        do i = 1, dim1
          Z(i, j) = 0
        enddo
        do k = 1, dim2
          do i = 1, dim1
            Z(i, j) = Z(i, j) + X(i, k)*Y(k, j)
          enddo
        enddo
       enddo
!$acc end region


      call cpu_time(finish)

      print *,'time for C(',dim1,',',dim3,') = A(',dim1,',',dim2,') B(',
     1dim2,',',dim3,') is',finish - start,' s'
     
      end subroutine MM
      end module myModule

I hope others find this useful.

Mohamad Sindi