Why my OpenACC code remains slower than OpenMP?

Hi Everyone,

I am a newbie in accelerator programming. I encounter a problem when I try to compare the execution time of a simple one-dimensional vector addition accelerated by OpenMP with that by OpenACC. To my surprise, the execution with OpenMP is far faster than that with OpenACC, no matter how big the array size is. In the case where the size of array is set to 2**26, OpenMP takes 73 (ms) while OpenACC needs to spend 396 (ms) to complete the same computation.

Can anyone tell me anything wrong in my code? I attached the code I used for this experiment. Please see below.

Thanks,
Li

    subroutine saxpy_openmp(n,a,x,y)
    implicit none
    integer :: n,i
    real, intent(in) :: x(n),a
    real, intent(inout) :: y(n)
    !$omp parallel do
      do i=1,n
        y(i)=a*x(i)+y(i)     
      enddo
    !$omp end parallel do
    end subroutine saxpy_openmp
    
    subroutine saxpy(n,a,x,y)
    implicit none
    integer :: n, i
    real, intent(in) :: x(n), a
    real, intent(inout) :: y(n)
      do i=1,n
        y(i)=a*x(i)+y(i)    
      enddo
    end subroutine saxpy

    subroutine saxpy_openacc(m,a,x,y1)
    implicit none
    integer :: m, i
    real :: x(m), a
    real :: y1(m)
    !$acc kernels loop present(x,y1)
      do i=1,m
        y1(i)=a*x(i)+y1(i)    
      enddo 

    end subroutine saxpy_openacc    
    
    program p 
     use lapack95 
     use blas95
     use omp_lib
     use accel_lib
     implicit none
     integer :: m=2**26 !don't set the power of 2 to exceed 26
     real :: x(m),y1(m),y2(m),y3(m)
     integer :: r1,r0
     integer :: i,j
        
     do i=1,m
      y1(i)=1.0    
      y2(i)=1.0
      y3(i)=1.0
      x(i)=1.0
     enddo 
     
     call system_clock(r0)
     call saxpy_openmp(m,2.0,x,y2)
     call system_clock(r1)     
     print*,' time: ',r1-r0
     do i=1,10
       print*,y2(i)      
     enddo
     
     call system_clock(r0)
     call saxpy(m,2.0,x,y3)
     call system_clock(r1)
     print*,' time: ',r1-r0
     do i=1,10
       print*,y3(i)      
     enddo   
     
     call acc_init( acc_device_nvidia )
     call system_clock(r0)   
     !$acc data copy(x(:),y1(:))
     call saxpy_openacc(m,2.0,x,y1)
     !$acc end data
     call system_clock(r1)
     print*,' time: ',r1-r0
     do i=1,10
       print*,y1(i)      
     enddo     

    end program



-g -Bstatic -Mbackslash -mp -acc -I"C:\Program Files (x86)\Intel\Composer XE 2013\mkl\include" -I"C:\Program Files (x86)\Intel\Composer XE 2013\mkl\interfaces\lapack95\lapack95\include\intel64\lp64" -I"C:\Program Files (x86)\Intel\Composer XE 2013\mkl\interfaces\blas95\lib95\include\intel64\lp64" -I"c:\program files\pgi\win64\12.10\include" -I"C:\Program Files\PGI\Microsoft Open Tools 10\include" -I"C:\Program Files\PGI\Microsoft Open Tools 10\PlatformSDK\include" -I"C:\Program Files\PGI\win64\2012\cuda\4.2\include" -fastsse -Mipa=fast,inline -tp=bulldozer-64 -ta=nvidia,nowait,host -Minform=warn -Minfo=accel

Hi catfishwolf,

This is not too suprising. The problem here is the device data allocatation, free, and movement time. So while your compute time goes down by quite a bit, the data overhead cost overwhelms the overall time.

While you’ll see saxpy used as examples for OpenACC, it’s actually not a great example for performance since there’s not enough computation to justify the data costs. If I modify your example so that each routine is executed many times (I’m using 100 below), then you’ll see the GPU giving some speed-up.

   % cat test.f90 
    subroutine saxpy_openmp(n,a,x,y)
    implicit none
    integer :: n,i
    real, intent(in) :: x(n),a
    real, intent(inout) :: y(n)
    !$omp parallel do
      do i=1,n
        y(i)=a*x(i)+y(i)     
      enddo
    !$omp end parallel do
    end subroutine saxpy_openmp
   
    subroutine saxpy(n,a,x,y)
    implicit none
    integer :: n, i
    real, intent(in) :: x(n), a
    real, intent(inout) :: y(n)
      do i=1,n
        y(i)=a*x(i)+y(i)   
      enddo
    end subroutine saxpy

    subroutine saxpy_openacc(m,a,x,y1)
    implicit none
    integer :: m, i
    real :: x(m), a
    real :: y1(m)
    !$acc kernels loop present(x,y1) 
      do i=1,m
        y1(i)=a*x(i)+y1(i)   
      enddo

    end subroutine saxpy_openacc   
   
    program p
!     use lapack95
!     use blas95
     use omp_lib
     use accel_lib
     implicit none
     integer,parameter :: m=2**26 !don't set the power of 2 to exceed 26
     real :: x(m),y1(m),y2(m),y3(m)
     integer :: r1,r0
     integer :: i,j, iter
       
     do i=1,m
      y1(i)=1.0   
      y2(i)=1.0
      y3(i)=1.0
      x(i)=1.0
     enddo
     
     call system_clock(r0)
     do iter=1,100
     call saxpy_openmp(m,2.0,x,y2)
     enddo
     call system_clock(r1)     
     print*,' time: ',r1-r0
     do i=1,10
       print*,y2(i)     
     enddo
     
     call system_clock(r0)
     do iter=1,100
     call saxpy(m,2.0,x,y3)
     enddo
     call system_clock(r1)
     print*,' time: ',r1-r0
     do i=1,10
       print*,y3(i)     
     enddo   
     
     call acc_init( acc_device_nvidia )
     call system_clock(r0)   
     !$acc data copyin(x(:)), copy(y1(:))
     do iter=1,100
     call saxpy_openacc(m,2.0,x,y1)
     enddo
     !$acc end data
     call system_clock(r1)
     print*,' time: ',r1-r0
     do i=1,10
       print*,y1(i)     
     enddo     

    end program 
% pgf90 -acc -Minfo=accel -fast test.f90 -V13.7 -mp ; a.out
saxpy_openacc:
     28, Generating present(x(:))
         Generating present(y1(:))
         Generating NVIDIA code
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     29, Loop is parallelizable
         Accelerator kernel generated
         29, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
p:
     75, Generating copy(y1(:))
         Generating copyin(x(:))
  time:       2796251
    201.0000    
    201.0000    
    201.0000    
    201.0000    
    201.0000    
    201.0000    
    201.0000    
    201.0000    
    201.0000    
    201.0000    
  time:       4135116
    201.0000    
    201.0000    
    201.0000    
    201.0000    
    201.0000    
    201.0000    
    201.0000    
    201.0000    
    201.0000    
    201.0000    
  time:       1254468
    201.0000    
    201.0000    
    201.0000    
    201.0000    
    201.0000    
    201.0000    
    201.0000    
    201.0000    
    201.0000    
    201.0000

Hi, mkcolg

I felt relieved after seeing your result. But I still got a different outcome. The OpenACC part costs more than 10 seconds to complete. Does it have anything to do with my visual studio environment setting (Dropbox - Error), or my hardware specification (CPU: AMD FX-4100 @ 4.0 GHz. GPU: Geforece GT 610)?

Li

  time:       7451000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
  time:       7490001
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
  time:      10263000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
Press any key to continue . . .



-g -Bstatic -Mbackslash -mp -acc -fastsse -Mipa=fast,inline -tp=bulldozer-64 -ta=nvidia,nowait,host -Minform=warn -Minfo=accel

Hi Li,

A GT 601 is a fairly weak card so most likely accounts for the differnece. I’m running a Tesla M2090 with 512 cores with a clock of 1301MHz versus your GT 610 which has 48 cores running at 810MHz. Also, my memory is DDR5 versus your DDR3.

Your card is fine for develpment but you’ll want to move to a Tesla card for production runs.

  • Mat