We tried to use the pgi fortran compiler for openacc

to port climate and weather physics models. Somehow we cannot reduce the time of using openacc directives

below 0.1 second even with the vector add code+ small size of arrays. Is this a typical time needed for setting up the

communication between cpu and gnu?

Thanks,

sjz

Here is the hardware specification for a node:

2 Hex-core 2.8 GHz Intel Xeon Westemere Processors (4 flop/s per clock)

48 GB of memory per node

2 NVidia M2070 GPUs each connected through a dedicated x16 PCIe Gen2 connection

Interconnect: Infiniband QDR

Here are the compilation and running environment and commands

module load comp/pgi-12.4.0

module load other/mpi/openmpi/1.4.5-pgi-12.4.0

pgf90 -o vecadd_openacc -acc -ta=nvidia,fastmath vecadd_openacc.F90

pgcudainit &

Here are the performance numbers:

szhou@warp01a002:~/test_gpu/acc_cuda_fortran/acc> ./vecadd_openacc 1000

110842 microseconds on gpu

2 microseconds on host

0 errors found

szhou@warp01a002:~/test_gpu/acc_cuda_fortran/acc> ./vecadd_openacc 10000

105249 microseconds on gpu

22 microseconds on host

0 errors found

szhou@warp01a002:~/test_gpu/acc_cuda_fortran/acc> ./vecadd_openacc 100000

110235 microseconds on gpu

232 microseconds on host

0 errors found

szhou@warp01a002:~/test_gpu/acc_cuda_fortran/acc> ./vecadd_openacc 1000000

110693 microseconds on gpu

2206 microseconds on host

0 errors found

Here are the source codes:

szhou@discover25:~/test_gpu/acc_cuda_fortran/acc> cat vecadd_openacc.F90

module vecaddmod

implicit none

contains

subroutine vecaddgpu( r, a, b, n )

real, dimension(:) :: r, a, b

integer :: n

integer :: i

!$acc kernels do copyin(a(1:n),b(1:n)) copyout(r(1:n)) gang vector(256)

do i = 1, n

r(i) = a(i) + b(i)

! r(i) = sqrt(a(i)) + b(i)*a(i)+sin(a(i))*exp(-b(i))*sqrt(a(i)+b(i))

enddo

end subroutine

subroutine vecaddcpu( r, a, b, n )

real, dimension(:) :: r, a, b

integer :: n

integer :: i

do i = 1, n

r(i) = a(i) + b(i)

! r(i) = sqrt(a(i)) + b(i)*a(i)+sin(a(i))*exp(-b(i))*sqrt(a(i)+b(i))

enddo

end subroutine

end module

program main

use vecaddmod

implicit none

integer :: n, i, errs, argcount

integer :: cpu_s, cpu_e, gpu_s, gpu_e

real, dimension(:), allocatable :: a, b, r, e

character*10 :: arg1

argcount = command_argument_count()

n = 1000000 ! default value

if( argcount >= 1 )then

call get_command_argument( 1, arg1 )

read( arg1, ‘(i)’ ) n

if( n <= 0 ) n = 100000

endif

allocate( a(n), b(n), r(n), e(n) )

do i = 1, n

a(i) = i

b(i) = 1000*i

enddo

! compute on the GPU

call system_clock (count=gpu_s)

call vecaddgpu( r, a, b, n )

call system_clock (count=gpu_e)

! compute on the host to compare

!

call system_clock (count=cpu_s)

call vecaddcpu( e, a, b, n )

call system_clock (count=cpu_e)

print *, gpu_e - gpu_s, ’ microseconds on gpu’

print *, cpu_e - cpu_s, ’ microseconds on host’

! compare results

errs = 0

do i = 1, n

if( abs((r(i) - e(i))/ e(i)) > 1.1 )then

errs = errs + 1

endif

enddo

print *, errs, ’ errors found’

if( errs ) call exit(errs)

end program

\