Hi everyone,
While exploring PGI accelerator programming I noticed a program I wrote performed better in the CPU.
program Acc_Jacobi_Relax
use accel_lib
real,dimension(:,:,:),allocatable :: AN, AS, AE, AW, AP, PHI, PHIN, PHINK
integer ni, nj, mba
integer i, j, m, n
integer t1, t2, GPUtime
ni = 300
nj = 300
mba = 300
! Allocate matrices
allocate(AN(ni,nj,mba),AS(ni,nj,mba),AE(ni,nj,mba),AW(ni,nj,mba),AP( ni,nj,mba),PHI(ni,nj,mba),PHIN(ni,nj,mba), PHINK(ni,nj,mba))
! Place numbers in matrices
do m = 1, mba
do j = 1, nj
do i = 1, ni
AN(i,j,m) = 1
AS(i,j,m) = 1
AE(i,j,m) = 1
AW(i,j,m) = 1
AP(i,j,m) = 1
PHI(i,j,m) = 1
PHIN(i,j,m) =1
PHINK(i,j,m) = 0
enddo
enddo
enddo
! Initialize GPU
call acc_init(acc_device_nvidia)
! Tell me which GPU I use
n = acc_get_device_num(acc_device_nvidia)
print *,'device number', n
! Accelerate Jacobi Calculation
call system_clock( count=t1 )
!$acc region
!acc do parallel
!acc region do
!acc do vector
do m = 1, mba
do j = 2, nj-1
do i = 2, ni-1
PHINK(i, j, m) = AN(i,j,m) * PHI(i,j+1,m)&
+ AS(i,j,m) * PHI(i,j-1,m)&
+ AE(i,j,m) * PHI(i+1,j,m)&
+ AW(i,j,m) * PHI(i-1,j,m)&
+ AP(i,j,m) * PHI(i,j,m)
enddo
enddo
enddo
!$acc end region
call system_clock( count=t2 )
GPUtime = t2 - t1
print *, 'GPU execution time: ', GPUtime, 'microseconds'
deallocate(AN, AS, AE, AW, AP, PHI, PHIN)
end program Acc_Jacobi_Relax
Usually, unless I add the -O2 flag, the CPU code will take around the same amount microseconds to execute as the GPU code (around 600,000 usec)
The way I compile it is:
pgfortran -ta=nvidia -Minfo=accel -fast -o AccRegion_Jacobi_Relaxation.x AccRegion_Jacobi_Relaxation.f90
As you may have noticed, I modeled my program after the sample codes f1, f2, and f3. Also, I have tried different directives with no luck and I am wondering if the way I measured the times is correct (Should I try the way they do it in the Monte Carlo example?). If you have any suggestions please let me know, I am only a student after all. Thank you!
-Chris