matrix reduction using cuda fortran and GPU

Hi all,

I have this short program which reduces a matrix A(nx,ny) to ak

integer :: nx,ny,i,j,ak
integer, allocatable, dimension (:,:) :: A

nx = 306
ny = 306
ak = 0

allocate (a(nx,ny))
A(1:nx,1:ny) = 2

do i = 1, nx
do j = 1, ny
ak = ak + A(i,j)
enddo
enddo

what is the easiest way to reduce A on GPU??

thanks,
Dolf

Hi Dolf,

Just put an OpenACC kernels directive on the outer loop. The compiler is smart enough to auto-detect reductions and generate the appropriate code. If you want to make it explicit, you can also use the “reduction” clause.

% cat reduce.f90 
integer :: nx,ny,i,j,ak
integer, allocatable, dimension (:,:) :: A

nx = 306
ny = 306
ak = 0

allocate (a(nx,ny))
A(1:nx,1:ny) = 2

!$acc kernels loop
do i = 1, nx
do j = 1, ny
ak = ak + A(i,j)
enddo
enddo 

end
% pgf90 -acc -Minfo=accel reduce.f90 
MAIN:
     11, Generating present_or_copyin(a(1:306,1:306))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     12, Loop is parallelizable
     13, Loop is parallelizable
         Accelerator kernel generated
         12, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
         13, !$acc loop gang ! blockidx%y
             CC 1.0 : 11 registers; 48 shared, 36 constant, 0 local memory bytes
             CC 2.0 : 16 registers; 0 shared, 64 constant, 0 local memory bytes
         14, Sum reduction generated for ak

Hope this helps,
Mat

Hi Matt,

this is very helpful, but where can I read up on that? I have not seen those clauses in any book or online forum.

many thanks.
Dolf

You can find information on openACC either at the OpenACC Community or on PGI’s OpenACC page. Also, there are many posts about OpenACC in this forum and several articles in the PGInsider Newsletter.

  • Mat

Awesome!

I just went to OpenACC and downloaded quick reference, does PGI compiler have openacc_lib.h ??

if I want to include acc clauses, what do I need to do first to compile the project??

Dolf

does PGI compiler have openacc_lib.h ??

Yes, but it’s a C header file. For Fortran, you would use the openacc lib module, i.e. “use openacc_lib”.

what do I need to do first to compile the project??

You need to add the ACC directives around the loop(s) you wish to accelerate and then compile with “-acc -Minfo=accel”.

This article walks through the basic steps of porting a code using directives. Account Login | PGI. I wrote it using the PGI Accelerator Model syntax. But since OpenACC is based in large part on the PGI Accelerator Model the process is the same.

  • Mat

when I put “use openacc_lib” in the program, it does not compile, it says that unable to open MODULE file openacc_lib.mod

what is the required directory I need to add to my VS 2010 project “include directory” ??

Dolf

My fault. It’s “use openacc”, not “use openacc_lib”.

  • Mat

still not able to open module openacc.mod

I am using visual studio 2010 with PGI compiler 12, what should I configure to make it look for openacc module??

thanks,
Dolf

Which version are you using? OpenACC was added in the 12.6 release.

  • Mat

what I did, I used the keys you use when compile into the command line inside visual studio project properties, it gave me the same parallel messages and compiled successfully, but I had to remove “use openacc”

is that correct way of doing it?

also, I cannot do parallalization for matrix of (3000,3000), do you know why?


Dolf

Hi Dolf,

Within PVF, you’d set the properties option “Fortran->Target Accelerator->Target NVIDIA Accelerator” to ‘Yes’. From the command line, you would compile using the flag “-acc” and/or “-ta=nvidia”. Either are fine.

but I had to remove “use openacc”

This doesn’t make sense. This module is in the standard PGI include directory so should be found by the compiler

PGI$ ls /c/PROGRA~1/PGI/win64/12.10/include/openacc.mod
/c/PROGRA~1/PGI/win64/12.10/include/openacc.mod



I cannot do parallalization for matrix of (3000,3000), do you know why?

No, the size shouldn’t prevent parallelization. Can you post an example?

  • Mat

yeah, sure, here is the program:

!
! Fortran Console Application
! Generated by PGI Visual Fortran(R)
! 12/10/2012 3:02:36 PM
!

program openacc1

!use openacc
implicit none

integer :: nx,ny,i,j,ak
integer, allocatable, dimension (:,:) :: A
integer :: start_time(8), end_time(8)
CHARACTER (LEN = 12) REAL_CLOCK (3)
CALL DATE_AND_TIME (REAL_CLOCK (1), REAL_CLOCK (2),&
REAL_CLOCK (3), start_time )

nx = 3000
ny = 3000
ak = 0

allocate (a(nx,ny))
A(1:nx,1:ny) = 2
!$acc kernels loop
do i = 1, nx
do j = 1, ny
ak = ak + A(i,j)
enddo
enddo
write(,) 'ak = ’ ,ak
write(,)
CALL DATE_AND_TIME (REAL_CLOCK (1), REAL_CLOCK (2),&
REAL_CLOCK (3), end_time )
write(,10) 'PROGRAM STARTED AT: ', START_TIME(5), START_TIME(6),&
START_TIME(7), START_TIME(8)
write(
,15) 'PROGRAM ENDED AT: ', end_time(5), end_time(6), &
end_time(7),end_time(8)
continue
deallocate(a)
10 format(1X, A, I2.2, ‘:’, I2.2, ‘:’, I2.2, ‘:’, I3.3)
15 format(1X, A, I2.2, ‘:’, I2.2, ‘:’, I2.2, ‘:’, I3.3)

end program openacc1


Fortran->target accelerators->Target NVIDIA Accelerators = yes
Fortran->Command Line → -acc -Minfo=accel

the output after I compile: ( no “use openacc”)

------ Rebuild All started: Project: OpenACC1, Configuration: Release x64 ------
Deleting intermediate and output files for project ‘OpenACC1’, configuration ‘Release’
Compiling Project …
OpenACC1.f90
openacc1:
25, Generating copyin(a(1:3000,1:3000))
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
26, Loop is parallelizable
27, Loop is parallelizable
Accelerator kernel generated
26, !$acc loop gang, vector(8) ! blockidx%x threadidx%x
27, !$acc loop gang, vector(8) ! blockidx%y threadidx%y
CC 1.0 : 8 registers; 304 shared, 32 constant, 0 local memory bytes; 66% occupancy
CC 2.0 : 10 registers; 264 shared, 64 constant, 0 local memory bytes; 33% occupancy
28, Sum reduction generated for ak
Linking…
OpenACC1 build succeeded.

Build log was saved at “file://D:\Cuda Dev\OpenACC1\x64\Release\BuildLog.htm”

========== Rebuild All: 1 succeeded, 0 failed, 0 skipped ==========

still can use “use openacc”, don’t know why.

output when execute the code: → (using !$acc before do loops)
ak = 18000000

PROGRAM STARTED AT: 11:56:30:675
PROGRAM ENDED AT: 11:56:30:776
Press any key to continue . . .

time of execution = 101 msec

output when execute the code: → (without using !$acc before do loops)
ak = 18000000

PROGRAM STARTED AT: 11:58:14:167
PROGRAM ENDED AT: 11:58:14:179
Press any key to continue . . .
time of execution = 12 msec !!

so, this means time using parallel acc loops is longer than using cpu to do loops!
does that make sense to you?? I think I am doing something wrong here.

Dolf

still can use “use openacc”, don’t know why.

Sorry, I don’t know either. It works fine for me.

this means time using parallel acc loops is longer than using cpu to do loops!
does that make sense to you??

It makes perfect sense. Accelerator code does have some overhead in launching kernels, initializing the device, and copying data to the device. Also, a performing a parallel reduction requires a partial reduction step followed by a second kernel launch to perform the final reduction.

Now, if I change your code so these loops are executed 10,000 times, the overhead gets amortized across multiple invocations and we begin to see speed-up. Note that the call to foo is needed else the compiler will optimize away the outer iteration loop. Also, this example was run on Linux but should be the same for Windows.

% cat test.f90 
program openacc1

use openacc
implicit none

integer :: nx,ny,i,j,ak,it,ak2,foo
integer, allocatable, dimension (:,:) :: A
integer :: start_time(8), end_time(8)
CHARACTER (LEN = 12) REAL_CLOCK (3)
CALL DATE_AND_TIME (REAL_CLOCK (1), REAL_CLOCK (2),&
REAL_CLOCK (3), start_time )

nx = 3000
ny = 3000
ak = 0

allocate (a(nx,ny))
A(1:nx,1:ny) = 2
!$acc data copyin(A)
do it=1,10000
!$acc kernels loop
do i = 1, nx
do j = 1, ny
ak = ak + A(i,j)
enddo
enddo
ak2 = foo(ak)
enddo
!$acc end data

write(*,*) 'ak = ' ,ak2
write(*,*)
CALL DATE_AND_TIME (REAL_CLOCK (1), REAL_CLOCK (2),&
REAL_CLOCK (3), end_time )
write(*,10) 'PROGRAM STARTED AT: ', START_TIME(5), START_TIME(6),&
START_TIME(7), START_TIME(8)
write(*,15) 'PROGRAM ENDED AT: ', end_time(5), end_time(6), &
end_time(7),end_time(8)
continue
deallocate(a)
10 format(1X, A, I2.2, ':', I2.2, ':', I2.2, ':', I3.3)
15 format(1X, A, I2.2, ':', I2.2, ':', I2.2, ':', I3.3)

end program openacc1 

function foo(ak)
  integer foo
  integer ak, tmp
  tmp=ak
  ak = 0
  foo=tmp
end function foo

% pgf90 test.f90 -fast -o cpu.out 
% pgf90 test.f90 -fast -acc -o gpu.out
% time cpu.out
 ak =      18000000
 
 PROGRAM STARTED AT: 13:03:18:798
 PROGRAM ENDED AT: 13:03:53:390
34.302u 0.066s 0:34.59 99.3%	0+0k 0+0io 0pf+0w
% setenv PGI_ACC_TIME 1
% time gpu.out
 ak =      18000000
 
 PROGRAM STARTED AT: 13:03:58:855
 PROGRAM ENDED AT: 13:04:12:457

Accelerator Kernel Timing data
  openacc1
    27: region entered 10000 times
        time(us): total=13,511,799 init=549 region=13,511,250
                  kernels=10,656,934
        w/o init: total=13,511,250 max=110,737 min=1,317 avg=1,351
        29: kernel launched 10000 times
            grid: [24x3000]  block: [128]
            time(us): total=9,698,070 max=1,463 min=966 avg=969
        30: kernel launched 10000 times
            grid: [1]  block: [256]
            time(us): total=958,864 max=400 min=95 avg=95
test.f90
  openacc1
    25: region entered 1 time
        time(us): total=13,593,179 init=69,314 region=13,523,865
                  data=9,027
        w/o init: total=13,523,865 max=13,523,865 min=13,523,865 avg=13,523,865
5.495u 7.962s 0:13.64 98.6%	0+0k 0+56io 0pf+0w

Granted, this speed-up here is relatively small, but as you increase the amount of computation, typically you increase your speed-up.

  • Mat

I see now!
I am using PGI 12.3 only, could this be the problem??

how can I compile using 12.3 and using acc since acc only added to 12.6 and above??

thanks,
Dolf

I am using PGI 12.3 only, could this be the problem??

Yep, that’s it. 12.3 did have a Linux Beta version of OpenACC, but 12.6 was the first full release and the first release to support Windows.

how can I compile using 12.3 and using acc since acc only added to 12.6 and above??

You can use the PGI Accelerator Model instead. The syntax is mostly the same:

% cat test.f90 
!
! Fortran Console Application
! Generated by PGI Visual Fortran(R)
! 12/10/2012 3:02:36 PM
!

program pgiacc1

use accel_lib
implicit none

integer :: nx,ny,i,j,ak,it,ak2,foo
integer, allocatable, dimension (:,:) :: A
integer :: start_time(8), end_time(8)
CHARACTER (LEN = 12) REAL_CLOCK (3)
CALL DATE_AND_TIME (REAL_CLOCK (1), REAL_CLOCK (2),&
REAL_CLOCK (3), start_time )

nx = 3000
ny = 3000
ak = 0

allocate (a(nx,ny))
A(1:nx,1:ny) = 2
!$acc data region copyin(A)
do it=1,10000
!$acc region  
do i = 1, nx
do j = 1, ny
ak = ak + A(i,j)
enddo
enddo
!$acc end region
ak2 = foo(ak)
enddo
!$acc end data region

write(*,*) 'ak = ' ,ak2
write(*,*)
CALL DATE_AND_TIME (REAL_CLOCK (1), REAL_CLOCK (2),&
REAL_CLOCK (3), end_time )
write(*,10) 'PROGRAM STARTED AT: ', START_TIME(5), START_TIME(6),&
START_TIME(7), START_TIME(8)
write(*,15) 'PROGRAM ENDED AT: ', end_time(5), end_time(6), &
end_time(7),end_time(8)
continue
deallocate(a)
10 format(1X, A, I2.2, ':', I2.2, ':', I2.2, ':', I3.3)
15 format(1X, A, I2.2, ':', I2.2, ':', I2.2, ':', I3.3)

end program pgiacc1 

function foo(ak)
  integer foo
  integer ak, tmp
  tmp=ak
  ak = 0
  foo=tmp
end function foo

It worked fine for me now, but still not impressed with the speed.
If I have to do it using kernel subroutines (no acc), how would I do the reduction using this method?

also, I want to compile my code to run on linux machine, is PGI fortran compiler free for linux?
I just downloaded the tar.gz file (Fortran workstation for linux), is it straight forward installation process? or I have to do something special?

thanks.
Dolf

It worked fine for me now, but still not impressed with the speed.

A reduction is not the best algorithm for an Accelerator. You can get some speed-up if the problem is large enough, but not a huge amount. Using a reduction in combination of a larger data parallel algorithm is usually ok.

If I have to do it using kernel subroutines (no acc), how would I do the reduction using this method?

To do this algorithm in CUDA is a fairly difficult. I should say it’s easy to do poorlly, but hard to do it well.

I have an example of a simple (i.e. poor) partial sum reduction in this article: Account Login | PGI

A better one can be found in the source code from this article: Account Login | PGI

also, I want to compile my code to run on linux machine, is PGI fortran compiler free for linux?

No. We do offer cross-platform licenses, however, it looks like your license is from Windows only. Please contact PGI Sales (sales@pgroup.com) for more information on upgrading your license.

I just downloaded the tar.gz file (Fortran workstation for linux), is it straight forward installation process? or I have to do something special?

It’s fairly simple, the only complexity is getting the license manager running. See PGI Installation Guide for details.

  • Mat

Hi Matt,

I have installed the pgi for linux, created a trial license keys with pgroup.com as per installation document requested, turned on the license server on, but still can’t compile a simple fortran code using pgf90 acc1.f90 -o acc.exe
it gives me the following error:
[root@cmlds5 bin]# pgf90 acc1.f90 -o acc.exe
pgf90-Error-Please run makelocalrc to complete your installation

anything I missed??

pgf90-Error-Please run makelocalrc to complete your installation

anything I missed?

This means that your installation failed. Which OS are you using and were there any errors during installation?

  • Mat