We tried to use the pgi fortran compiler for openacc
to port climate and weather physics models. Somehow we cannot reduce the time of using openacc directives
below 0.1 second even with the vector add code+ small size of arrays. Is this a typical time needed for setting up the
communication between cpu and gnu?
Thanks,
sjz
Here is the hardware specification for a node:
2 Hex-core 2.8 GHz Intel Xeon Westemere Processors (4 flop/s per clock)
48 GB of memory per node
2 NVidia M2070 GPUs each connected through a dedicated x16 PCIe Gen2 connection
Interconnect: Infiniband QDR
Here are the compilation and running environment and commands
module load comp/pgi-12.4.0
module load other/mpi/openmpi/1.4.5-pgi-12.4.0
pgf90 -o vecadd_openacc -acc -ta=nvidia,fastmath vecadd_openacc.F90
pgcudainit &
Here are the performance numbers:
szhou@warp01a002:~/test_gpu/acc_cuda_fortran/acc> ./vecadd_openacc 1000
110842 microseconds on gpu
2 microseconds on host
0 errors found
szhou@warp01a002:~/test_gpu/acc_cuda_fortran/acc> ./vecadd_openacc 10000
105249 microseconds on gpu
22 microseconds on host
0 errors found
szhou@warp01a002:~/test_gpu/acc_cuda_fortran/acc> ./vecadd_openacc 100000
110235 microseconds on gpu
232 microseconds on host
0 errors found
szhou@warp01a002:~/test_gpu/acc_cuda_fortran/acc> ./vecadd_openacc 1000000
110693 microseconds on gpu
2206 microseconds on host
0 errors found
Here are the source codes:
szhou@discover25:~/test_gpu/acc_cuda_fortran/acc> cat vecadd_openacc.F90
module vecaddmod
implicit none
contains
subroutine vecaddgpu( r, a, b, n )
real, dimension(:) :: r, a, b
integer :: n
integer :: i
!$acc kernels do copyin(a(1:n),b(1:n)) copyout(r(1:n)) gang vector(256)
do i = 1, n
r(i) = a(i) + b(i)
! r(i) = sqrt(a(i)) + b(i)*a(i)+sin(a(i))*exp(-b(i))*sqrt(a(i)+b(i))
enddo
end subroutine
subroutine vecaddcpu( r, a, b, n )
real, dimension(:) :: r, a, b
integer :: n
integer :: i
do i = 1, n
r(i) = a(i) + b(i)
! r(i) = sqrt(a(i)) + b(i)*a(i)+sin(a(i))*exp(-b(i))*sqrt(a(i)+b(i))
enddo
end subroutine
end module
program main
use vecaddmod
implicit none
integer :: n, i, errs, argcount
integer :: cpu_s, cpu_e, gpu_s, gpu_e
real, dimension(:), allocatable :: a, b, r, e
character*10 :: arg1
argcount = command_argument_count()
n = 1000000 ! default value
if( argcount >= 1 )then
call get_command_argument( 1, arg1 )
read( arg1, ‘(i)’ ) n
if( n <= 0 ) n = 100000
endif
allocate( a(n), b(n), r(n), e(n) )
do i = 1, n
a(i) = i
b(i) = 1000*i
enddo
! compute on the GPU
call system_clock (count=gpu_s)
call vecaddgpu( r, a, b, n )
call system_clock (count=gpu_e)
! compute on the host to compare
!
call system_clock (count=cpu_s)
call vecaddcpu( e, a, b, n )
call system_clock (count=cpu_e)
print *, gpu_e - gpu_s, ’ microseconds on gpu’
print *, cpu_e - cpu_s, ’ microseconds on host’
! compare results
errs = 0
do i = 1, n
if( abs((r(i) - e(i))/ e(i)) > 1.1 )then
errs = errs + 1
endif
enddo
print *, errs, ’ errors found’
if( errs ) call exit(errs)
end program
\