Help making code perform better using GPU rather than CPU

rotteweiler · July 21, 2010, 2:12pm

Hi everyone,

While exploring PGI accelerator programming I noticed a program I wrote performed better in the CPU.

program  Acc_Jacobi_Relax

   use accel_lib

   real,dimension(:,:,:),allocatable :: AN, AS, AE, AW, AP, PHI, PHIN, PHINK
   integer ni, nj, mba
   integer i, j, m, n
   integer t1, t2, GPUtime

   ni = 300
   nj = 300
   mba = 300

! Allocate matrices
   allocate(AN(ni,nj,mba),AS(ni,nj,mba),AE(ni,nj,mba),AW(ni,nj,mba),AP( ni,nj,mba),PHI(ni,nj,mba),PHIN(ni,nj,mba), PHINK(ni,nj,mba))

! Place numbers in matrices
do m = 1, mba
  do j = 1, nj
    do i = 1, ni
    AN(i,j,m) = 1
    AS(i,j,m) = 1
    AE(i,j,m) = 1
    AW(i,j,m) = 1
    AP(i,j,m) = 1
    PHI(i,j,m) = 1
    PHIN(i,j,m) =1
    PHINK(i,j,m) = 0
    enddo
  enddo
enddo


  ! Initialize GPU
   call acc_init(acc_device_nvidia)

  ! Tell me which GPU I use
  n = acc_get_device_num(acc_device_nvidia)
  print *,'device number', n

  ! Accelerate Jacobi Calculation
  call system_clock( count=t1 )

!$acc region

!acc do parallel
!acc region do
!acc do vector

 do m = 1, mba
    do j = 2, nj-1
       do i = 2, ni-1
           PHINK(i, j, m) = AN(i,j,m) * PHI(i,j+1,m)&
                          + AS(i,j,m) * PHI(i,j-1,m)&
                          + AE(i,j,m) * PHI(i+1,j,m)&
                          + AW(i,j,m) * PHI(i-1,j,m)&
                          + AP(i,j,m) * PHI(i,j,m)

        enddo
    enddo
 enddo
 !$acc end region

  call system_clock( count=t2 )


GPUtime = t2 - t1
print *, 'GPU execution time:  ', GPUtime, 'microseconds'

deallocate(AN, AS, AE, AW, AP, PHI, PHIN)

end program Acc_Jacobi_Relax

Usually, unless I add the -O2 flag, the CPU code will take around the same amount microseconds to execute as the GPU code (around 600,000 usec)

The way I compile it is:
pgfortran -ta=nvidia -Minfo=accel -fast -o AccRegion_Jacobi_Relaxation.x AccRegion_Jacobi_Relaxation.f90

As you may have noticed, I modeled my program after the sample codes f1, f2, and f3. Also, I have tried different directives with no luck and I am wondering if the way I measured the times is correct (Should I try the way they do it in the Monte Carlo example?). If you have any suggestions please let me know, I am only a student after all. Thank you!

-Chris

MatColgrove · July 21, 2010, 6:08pm

Hi Chris,

For the accelerator model, you can get basic profiling information using the flag “-ta=nvidia,time”. From this I show that your biggest bottle neck is the data transfer cost:

% jac_cpu.out
 GPU execution time:          84599 microseconds
% jac_gpu.out
 device number            0
 GPU execution time:         408143 microseconds

Accelerator Kernel Timing data
/tmp/qa/jac.f90
  acc_jacobi_relax
    45: region entered 1 time
        time(us): total=408138
                  kernels=23339 data=360361
        53: kernel launched 1 times
            grid: [150x38]  block: [16x8x2]
            time(us): total=23339 max=23339 min=23339 avg=23339
acc_init.c
  acc_init
    41: region entered 1 time
        time(us): init=123314

The kernel time is roughly a quarter of CPU time (0.2 seconds versus 0.8 on the CPU), but the data transfer time is nearly 4 seconds. Let’s look at the -Minfo messages and see how the compiler is copying the data:

acc_jacobi_relax:
     45, Generating copyout(phink(2:299,2:299,1:300))
         Generating copyin(an(2:299,2:299,1:300))
         Generating copyin(as(2:299,2:299,1:300))
         Generating copyin(phi(1:300,1:300,1:300))
         Generating copyin(ae(2:299,2:299,1:300))
         Generating copyin(aw(2:299,2:299,1:300))
         Generating copyin(ap(2:299,2:299,1:300))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary

The compiler must be conservative and only copies the smallest amount of data to the GPU. Since you don’t use all of your array’s data, only the used sections are being copied. If you have the available memory on the GPU, it’s often better to copy the entire array than a section. An entire array can be copied in a single DMA transfer, versus the many DMA transfers it takes to copy array sections.

We can fix this by using the ‘copyin’ and ‘copyout’ directives to tell the compiler to copy the whole array.

!$acc region copyin(AN,AS,AE,AW,AP,PHI), copyout(PHINK)

Our new data transfer time is about 1.5 seconds.

% jac_gpu2.out
 device number            0
 GPU execution time:         192631 microseconds

Accelerator Kernel Timing data
/tmp/qa/jac.f90
  acc_jacobi_relax
    45: region entered 1 time
        time(us): total=192629 init=1 region=192628
                  kernels=40144 data=143868
        w/o init: total=192628 max=192628 min=192628 avg=192628
        53: kernel launched 1 times
            grid: [150x38]  block: [16x8x2]
            time(us): total=40144 max=40144 min=40144 avg=40144
acc_init.c
  acc_init
    41: region entered 1 time
        time(us): init=85270

Even if we got the compute kernel time down to zero, the overall time would still be more than the CPU. Hence, the next step would be to add more work to the kernel (such as more time steps or more stencils) or stop since the code isn’t worth accelerating.

You might want to take a look at the Himeo benchmark results using the PGI accelerator directive. (http://www.pgroup.com/lit/samples/accel_files/himeno.tar, http://www.pgroup.com/resources/accel_files/index.htm). Himeno uses a 19-point stencil Jacobian Solver and might give you some ideas.

Mat

rotteweiler · July 22, 2010, 7:08pm

Hi Mat,

I am still studying the program for learning purposes. Unfortunately, I do not seem to get the same results as you got unless I run the program a couple of times. Also, I would like to ask about something strange that is going on when I run the sample code f3.f90.

The machine tells me there is a segmentation fault and when I called the mydevice = acc_get_device_num(acc_device_nvidia) I receive a -1 value. Trying to set the device using acc_set_device was of no use either. Do you think there might be a problem with compatibility with the motherboard?

I ran a code similar to mine and the results they reported are quite different as well:
http://www.pgroup.com/userforum/viewtopic.php?p=6811&sid=0398e5bb00739956d8e5a5bf339d6f6c

2846987 microseconds on GPU
45473 microseconds on host

I am starting to suspect there may be some compatibility issue with the motherboard (We’ve recently acquired two GPUs for it and I am now using a Tesla C2050) Please let me know your thoughts on it and thank you again for helping me. You are awesome! d(^^)b

-Chris

MatColgrove · July 22, 2010, 7:43pm

Hi Chris,

going on when I run the sample code f3.f90.

Sorry about that. We accidentally broken this example in the 10.6 release. It will be fixed in August’s 10.8 release. Though, I don’t think this bug should impact your code.

Unfortunately, I do not seem to get the same results as you got unless I run the program a couple of times.

Check the output from the ‘pgaccelinfo’ utility and compare your bandwith with the output from my Fermi card. Depending upon the motherboard, adding additional card can sometimes reduce the bandwidth by half or even fourth. If this is the case, try taking one of the cards out to see if your bandwidth returns.

Mat

% pgaccelinfo
CUDA Driver Version:           3000

Device Number:                 0
Device Name:                   Tesla C2050
Device Revision Number:        2.0
Global Memory Size:            2817720320
Number of Multiprocessors:     14
Number of Cores:               448
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 1
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1147 MHz
Initialization time:           2699045 microseconds
Current free memory:           2758803456
Upload time (4MB):              966 microseconds ( 735 ms pinned)
Download time:                 1197 microseconds ( 692 ms pinned)
Upload bandwidth:              4341 MB/sec (5706 MB/sec pinned)
Download bandwidth:            3504 MB/sec (6061 MB/sec pinned)

Tuan · September 2, 2010, 4:52pm

you can get basic profiling information using the flag “-ta=nvidia,time”

Accelerator Kernel Timing data
/tmp/qa/jac.f90
acc_jacobi_relax
45: region entered 1 time
time(us): total=192629 init=1 region=192628
kernels=40144 data=143868
w/o init: total=192628 max=192628 min=192628 avg=192628
53: kernel launched 1 times
grid: [150x38] block: [16x8x2]
time(us): total=40144 max=40144 min=40144 avg=40144
acc_init.c
acc_init
41: region entered 1 time
time(us): init=85270

Hi Mat,
Is there a way in CUDA Fortran that we can get similar benchmark information.

Thanks,
Tuan[/quote]

hongyon · September 2, 2010, 5:57pm

Hi,

You can use pgcollect to get the profile information and use pgprof to display it.

Here is some example:

%pgcollect -cuda=gmem executable_name
%pgprof -exe executable_name

The information will be just a bit different.

Hope this helps.
Hongyon

Topic		Replies	Views
Loop optimisation question Legacy PGI Compilers	5	5627	March 30, 2011
Getting Performance on Titan Legacy PGI Compilers	12	11808	December 27, 2016
PGI Acc: Matrix-matrix-multiplication Legacy PGI Compilers	3	5176	September 10, 2010
No array assignment replaced by call to pgf90_mcopy4 in 10.2 Legacy PGI Compilers	9	10268	August 3, 2010
Code not accelerated using acc kernels Legacy PGI Compilers	2	3400	January 30, 2017
Check performance Legacy PGI Compilers	4	3256	September 28, 2017
GPU time measuring using accel.h routines PGI 20.1 Legacy PGI Compilers	5	648	May 29, 2020
how to optimize parallel computation using cuda fortran Legacy PGI Compilers	5	4627	May 7, 2012
compiler ask acc routine information for internal function Legacy PGI Compilers	12	20307	October 25, 2017
matrix reduction using cuda fortran and GPU Legacy PGI Compilers	33	13511	December 21, 2012

Help making code perform better using GPU rather than CPU

Related topics