Poor perfomance of OpenACC code comparing to serial code

I am a novice in OpenACC Fortran programming using PGI 17.4 Community Edition. Following Michael Wolfe slides “OpenACC for Fortran programmers”, I have a serial code and OpenACC code as follows:

The serial code:

program sequential_code
  implicit none
  integer, parameter              :: dp = selected_real_kind(15,307)
  real, dimension(:), allocatable :: a, b
  real(dp)                        :: start_t, end_t
  integer, parameter              :: n = 1000000


  call cpu_time(start_t)
  call random_seed
  allocate(a(n), b(n))
  call random_number(a)
  call process(a, b, n)
  deallocate(a, b)
  call cpu_time(end_t)

  write(*,20) end_t-start_t
  20 format('Total elapsed time is ', f10.5, ' seconds.')

  contains
  subroutine process( a, b, n )
    real, intent(inout)    :: a(n), b(n)
    integer, intent(in)    :: n
    integer                :: i

    do i = 1, n
        b(i) = exp(sin(a(i)))
    enddo
  end subroutine process
end program sequential_code

The OpenACC code:

 program OpenACC_code
  implicit none
  integer, parameter              :: dp = selected_real_kind(15,307)
  real, dimension(:), allocatable :: a, b
  real(dp)                        :: start_t, end_t
  integer, parameter              :: n = 1000000


  call cpu_time(start_t)
  call random_seed
  allocate(a(n), b(n))
  call random_number(a)
  
  !$acc data copy(a,b)
  call process(a, b, n)
  !$acc end data
  
  deallocate(a, b)
  call cpu_time(end_t)

  write(*,20) end_t-start_t
  20 format('Total elapsed time is ', f10.5, ' seconds.')

  contains
  subroutine process( a, b, n )
    real, intent(inout)    :: a(n), b(n)
    integer, intent(in)    :: n
    integer                :: i

    !$acc parallel loop
	do i = 1, n
        b(i) = exp(sin(a(i)))
    enddo
	
  end subroutine process
end program OpenACC_code

And bellow are command lines and output of the serial code and the OpenACC code:

The serial code:
pgf90 -o sequential_code.exe sequential_code.f90
./sequential_code.exe
Total elapsed time is 0.09600 seconds.

The OpenACC code:
export PGI_ACC_NOTIFY=1
pgf90 -acc -ta=tesla -o OpenACC_code.exe OpenACC_code.f90
./OpenACC_code.exe
launch CUDA kernel file=C:\Users\HP\Downloads\FORTRAN CODES\CUDA and OpenACC\Op
enACC\OpenACC_code.f90 function=process line=30 device=0 threadid=1 num_gangs=78
13 num_workers=1 vector_length=128 grid=7813 block=128
Total elapsed time is 0.13400 seconds.

My question is, what causes the OpenACC code slower than the serial code?

Thank you in advance.

I added a do loop in the code block between the two calls of cpu_time, which ran the program 100 times. The OpenACC code ran eight times faster than the serial code. I am impressed.

Hi Truong Dang,

What you’re seeing in the first example is the overhead of initializing the GPU which can take between 0.5 - 1 seconds. Since you’re problem is so small, this overhead dominates the overall time. As you add more compute on the device, this overhead is amortized and why you start to see speed-up.

-Mat

Hi Mat,

Thank you very much for your explanation.

Truong.