performance of PGI openacc directives

sjz · January 9, 2013, 4:48pm

We tried to use the pgi fortran compiler for openacc
to port climate and weather physics models. Somehow we cannot reduce the time of using openacc directives
below 0.1 second even with the vector add code+ small size of arrays. Is this a typical time needed for setting up the
communication between cpu and gnu?

Thanks,

sjz

Here is the hardware specification for a node:

2 Hex-core 2.8 GHz Intel Xeon Westemere Processors (4 flop/s per clock)
48 GB of memory per node
2 NVidia M2070 GPUs each connected through a dedicated x16 PCIe Gen2 connection
Interconnect: Infiniband QDR

Here are the compilation and running environment and commands

module load comp/pgi-12.4.0
module load other/mpi/openmpi/1.4.5-pgi-12.4.0

pgf90 -o vecadd_openacc -acc -ta=nvidia,fastmath vecadd_openacc.F90

pgcudainit &

Here are the performance numbers:

szhou@warp01a002:~/test_gpu/acc_cuda_fortran/acc> ./vecadd_openacc 1000
110842 microseconds on gpu
2 microseconds on host
0 errors found
szhou@warp01a002:~/test_gpu/acc_cuda_fortran/acc> ./vecadd_openacc 10000
105249 microseconds on gpu
22 microseconds on host
0 errors found
szhou@warp01a002:~/test_gpu/acc_cuda_fortran/acc> ./vecadd_openacc 100000
110235 microseconds on gpu
232 microseconds on host
0 errors found
szhou@warp01a002:~/test_gpu/acc_cuda_fortran/acc> ./vecadd_openacc 1000000
110693 microseconds on gpu
2206 microseconds on host
0 errors found

Here are the source codes:

szhou@discover25:~/test_gpu/acc_cuda_fortran/acc> cat vecadd_openacc.F90
module vecaddmod

implicit none

contains

subroutine vecaddgpu( r, a, b, n )

real, dimension(:) :: r, a, b

integer :: n
integer :: i

!$acc kernels do copyin(a(1:n),b(1:n)) copyout(r(1:n)) gang vector(256)

do i = 1, n

r(i) = a(i) + b(i)
! r(i) = sqrt(a(i)) + b(i)*a(i)+sin(a(i))*exp(-b(i))*sqrt(a(i)+b(i))

enddo

end subroutine

subroutine vecaddcpu( r, a, b, n )

real, dimension(:) :: r, a, b

integer :: n
integer :: i

do i = 1, n

r(i) = a(i) + b(i)
! r(i) = sqrt(a(i)) + b(i)*a(i)+sin(a(i))*exp(-b(i))*sqrt(a(i)+b(i))

enddo

end subroutine

end module

program main

use vecaddmod

implicit none

integer :: n, i, errs, argcount
integer :: cpu_s, cpu_e, gpu_s, gpu_e

real, dimension(:), allocatable :: a, b, r, e

character*10 :: arg1

argcount = command_argument_count()

n = 1000000 ! default value

if( argcount >= 1 )then

call get_command_argument( 1, arg1 )

read( arg1, ‘(i)’ ) n

if( n <= 0 ) n = 100000

endif

allocate( a(n), b(n), r(n), e(n) )

do i = 1, n

a(i) = i

b(i) = 1000*i

enddo

! compute on the GPU

call system_clock (count=gpu_s)
call vecaddgpu( r, a, b, n )
call system_clock (count=gpu_e)

! compute on the host to compare

!

call system_clock (count=cpu_s)
call vecaddcpu( e, a, b, n )
call system_clock (count=cpu_e)

print *, gpu_e - gpu_s, ’ microseconds on gpu’
print *, cpu_e - cpu_s, ’ microseconds on host’

! compare results

errs = 0

do i = 1, n

if( abs((r(i) - e(i))/ e(i)) > 1.1 )then

errs = errs + 1

endif

enddo

print *, errs, ’ errors found’

if( errs ) call exit(errs)

end program

\

MatColgrove · January 9, 2013, 6:01pm

Hi sjz,

Is this a typical time needed for setting up the communication between cpu and gnu?

Typically there is a ~1 second per device warm-up cost on Linux, but this can be removed by running pgcudainit to hold open the devices (which you use here).

Next there is ~0.1 second cost to establish a context between the host and the device.

Finally, there is some overhead in copying the kernel code itself over to the device, as well as any arguments. This cost varies depending upon the kernel.

What you can do here is call “acc_init” before your timers to remove the initialization time. It’s still part of your overall time, but hopefully in a larger application this overhead would be meaningless.

Hope this helps,
Mat

% cat vecadd_openacc.F90 
module vecaddmod

implicit none

contains

subroutine vecaddgpu( r, a, b, n )
real, dimension(:) :: r, a, b
integer :: n
integer :: i

!$acc kernels do copyin(a(1:n),b(1:n)) copyout(r(1:n)) gang vector(256)
do i = 1, n
r(i) = a(i) + b(i)
! r(i) = sqrt(a(i)) + b(i)*a(i)+sin(a(i))*exp(-b(i))*sqrt(a(i)+b(i))
enddo

end subroutine

subroutine vecaddcpu( r, a, b, n )
real, dimension(:) :: r, a, b
integer :: n
integer :: i

do i = 1, n
r(i) = a(i) + b(i)
! r(i) = sqrt(a(i)) + b(i)*a(i)+sin(a(i))*exp(-b(i))*sqrt(a(i)+b(i))
enddo

end subroutine

end module

program main

use vecaddmod
use openacc

implicit none

integer :: n, i, errs, argcount
integer :: cpu_s, cpu_e, gpu_s, gpu_e
real, dimension(:), allocatable :: a, b, r, e
character*10 :: arg1

argcount = command_argument_count()
n = 1000000 ! default value
if( argcount >= 1 )then
call get_command_argument( 1, arg1 )
read( arg1, '(i)' ) n
if( n <= 0 ) n = 100000
endif

call acc_init(acc_get_device_type())

allocate( a(n), b(n), r(n), e(n) )
do i = 1, n
a(i) = i
b(i) = 1000*i
enddo

! compute on the GPU
call system_clock (count=gpu_s)
call vecaddgpu( r, a, b, n )
call system_clock (count=gpu_e)

! compute on the host to compare
!
call system_clock (count=cpu_s)
call vecaddcpu( e, a, b, n )
call system_clock (count=cpu_e)

print *, gpu_e - gpu_s, ' microseconds on gpu'
print *, cpu_e - cpu_s, ' microseconds on host'

! compare results
errs = 0
do i = 1, n
if( abs((r(i) - e(i))/ e(i)) > 1.1 )then
errs = errs + 1
endif
enddo

print *, errs, ' errors found'
if( errs ) call exit(errs)

end program
% pgf90 -acc -ta=nvidia,4.2 -Minfo=accel vecadd_openacc.F90 -o vecadd_openacc
vecaddgpu:
     12, Generating present_or_copyout(r(:n))
         Generating present_or_copyin(b(:n))
         Generating present_or_copyin(a(:n))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     13, Loop is parallelizable
         Accelerator kernel generated
         13, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
             CC 1.0 : 9 registers; 64 shared, 0 constant, 0 local memory bytes
             CC 2.0 : 14 registers; 0 shared, 80 constant, 0 local memory bytes
% setenv PGI_ACC_TIME 1
% ./vecadd_openacc 100000
         8169  microseconds on gpu
          195  microseconds on host
            0  errors found

Accelerator Kernel Timing data
vecadd_openacc.F90
  vecaddgpu
    12: region entered 1 time
        time(us): total=8,165
                  kernels=33 data=2,598
        13: kernel launched 1 times
            grid: [391]  block: [256]
            time(us): total=33 max=33 min=33 avg=33
acc_init.c
  acc_init
    50: region entered 1 time
        time(us): init=101,173

sjz · January 10, 2013, 6:58pm

Hi, Mat:

Hi sjz,
Quote:
Is this a typical time needed for setting up the communication between cpu and gnu?
Typically there is a ~1 second per device warm-up cost on Linux, but this can be removed by running pgcudainit to hold open the devices (which you use here).

----> We did this.

Next there is ~0.1 second cost to establish a context between the host and the device.

Finally, there is some overhead in copying the kernel code itself over to the device, as well as any arguments. This cost varies depending upon the kernel.

----> Are these two costs occurring each time when this kernel is called. Will these two costs be smaller if using cuda fortran directly? Thanks, SJZ

What you can do here is call “acc_init” before your timers to remove the initialization time. It’s still part of your overall time, but hopefully in a larger application this overhead would be meaningless.

Hope this helps,
Mat

MatColgrove · January 10, 2013, 7:15pm

----> Are these two costs occurring each time when this kernel is called. Will these two costs be smaller if using cuda fortran directly?

For the copying of kernels, if the kernel is called multiple times in succession, then the cost to copy the kernel to the device occurs only once. However, if there are many other kernels in between calls, then there is the potential that the kernel code needs to be copied over again.

Arguments will be copied over each time but until the size of your argument list grows to >256 bytes on older devices or >1024 bytes on newer (this is a CUDA limit), it will have very little impact. In order to support larger argument lists, we will wrap the arguments up into a single struct, copy the struct to the device, and then pass a pointer to the struct as an argument. This can have some impact on performance.

One thing to keep in mind is, yes, there is some overhead here but it really is quite small (10-100us). In my opinion, if you are writing kernels where this overhead greatly impacts your performance, then you many not want to be putting these algorithms on an accelerator. Not every algorithm works well on an accelerator.

We only use vecadd in our examples because it’s easy to illustrate the mechanics of OpenACC, but it isn’t really a good algorithm for an accelerator since there’s not enough computation to make it worthwhile.

Mat

sjz · January 11, 2013, 6:35pm

"In order to support larger argument lists, we will wrap the arguments up into a single struct, copy the struct to the device, and then pass a pointer to the struct as an argument. This can have some impact on performance. "

Do you have a sample code on this trick?

Thanks

MatColgrove · January 11, 2013, 6:56pm

Do you have a sample code on this trick?

Not off hand. This is all done “under the hood” using CUDA and not something exposed at the user level.

Mat

sjz · January 14, 2013, 11:30pm

Hi,

I tried vector addition in pgi cuda fortran and got the first run with ~0.1 second overhead. After that first call, the rest of calls did not see the cost of ~0.1 second. So this kernel setup overhead for the first call is true for cuda fortran as well as openacc. Is that right?

SJZ

MatColgrove · January 14, 2013, 11:56pm

So this kernel setup overhead for the first call is true for cuda fortran as well as openacc. Is that right?

Correct. This is the cost to create the device context which would be the same for CUDA and OpenACC.

Mat

hana · March 6, 2013, 5:07pm

The question is regarding the following comment in this post:

“For the copying of kernels, if the kernel is called multiple times in succession, then the cost to copy the kernel to the device occurs only once. However, if there are many other kernels in between calls, then there is the potential that the kernel code needs to be copied over again.”

Is there any document that provides more detail on the behavior of kernel intializations and how they are copied/treated by CPU/GPU? “many other kernels in between calls”, how many? Can user control the code and the environment such that the kernel copies can reside on GPU as long as necessary? Is GPU cache/shared memory is holding these copies? what the architecture looks like? To what extent can this environment (and Kernel Copying) be controllable?

Thanks in advance!

MatColgrove · March 6, 2013, 6:58pm

Is there any document that provides more detail on the behavior of kernel intializations and how they are copied/treated by CPU/GPU?

Nothing from us since they can’t be controlled by the user and it can change depending upon the target device and the underlying tools being used.

“many other kernels in between calls”, how many?

It’s my understanding that if the kernel is in the device’s kernel queue then it doesn’t need to be reinitialized. Though, the length of the queue will vary by device. The arguments to the kernel will still need to be copied over to the device, it’s just the kernel binary itself doesn’t need to be copied.

Can user control the code and the environment such that the kernel copies can reside on GPU as long as necessary?

Not that I’m aware of. There might be something in CUDA to have the kernel be persistent, but I’m not sure.

Is GPU cache/shared memory is holding these copies?

No.

what the architecture looks like?

Which architecture? This article is a few years old, but gives a high level view of Fermi and Tesla Account Login | PGI.

To what extent can this environment (and Kernel Copying) be controllable?

The end user can control the size and number of kernels created as well as if the kernels are launched asynchronously. However, you have no control over how kernels are copied to the device.

Mat

Topic		Replies	Views
cuModuleGetGlobal error Legacy PGI Compilers	12	6653	December 21, 2012
matrix reduction using cuda fortran and GPU Legacy PGI Compilers	33	13746	December 21, 2012
questions from a test code Legacy PGI Compilers	2	2579	October 18, 2012
OpenACC diff between GPU + CPU codes Legacy PGI Compilers	5	4086	May 31, 2012
OpenACC: Problem with present directive and module array Legacy PGI Compilers	14	9366	August 14, 2012
Check performance Legacy PGI Compilers	4	3302	September 28, 2017
Error and huge slowdown from !$acc region Legacy PGI Compilers	4	2996	March 26, 2012
finding executed time using PGI_ACC_TIME Legacy PGI Compilers	1	2640	February 10, 2014
accelerator parallization issues Legacy PGI Compilers	18	26848	April 12, 2010
kernel launch overhead Legacy PGI Compilers	8	12662	July 24, 2014

performance of PGI openacc directives

Related topics