Help making code perform better using GPU rather than CPU

Hi everyone,

While exploring PGI accelerator programming I noticed a program I wrote performed better in the CPU.

program  Acc_Jacobi_Relax

   use accel_lib

   real,dimension(:,:,:),allocatable :: AN, AS, AE, AW, AP, PHI, PHIN, PHINK
   integer ni, nj, mba
   integer i, j, m, n
   integer t1, t2, GPUtime

   ni = 300
   nj = 300
   mba = 300

! Allocate matrices
   allocate(AN(ni,nj,mba),AS(ni,nj,mba),AE(ni,nj,mba),AW(ni,nj,mba),AP( ni,nj,mba),PHI(ni,nj,mba),PHIN(ni,nj,mba), PHINK(ni,nj,mba))

! Place numbers in matrices
do m = 1, mba
  do j = 1, nj
    do i = 1, ni
    AN(i,j,m) = 1
    AS(i,j,m) = 1
    AE(i,j,m) = 1
    AW(i,j,m) = 1
    AP(i,j,m) = 1
    PHI(i,j,m) = 1
    PHIN(i,j,m) =1
    PHINK(i,j,m) = 0
    enddo
  enddo
enddo


  ! Initialize GPU
   call acc_init(acc_device_nvidia)

  ! Tell me which GPU I use
  n = acc_get_device_num(acc_device_nvidia)
  print *,'device number', n

  ! Accelerate Jacobi Calculation
  call system_clock( count=t1 )

!$acc region

!acc do parallel
!acc region do
!acc do vector

 do m = 1, mba
    do j = 2, nj-1
       do i = 2, ni-1
           PHINK(i, j, m) = AN(i,j,m) * PHI(i,j+1,m)&
                          + AS(i,j,m) * PHI(i,j-1,m)&
                          + AE(i,j,m) * PHI(i+1,j,m)&
                          + AW(i,j,m) * PHI(i-1,j,m)&
                          + AP(i,j,m) * PHI(i,j,m)

        enddo
    enddo
 enddo
 !$acc end region

  call system_clock( count=t2 )


GPUtime = t2 - t1
print *, 'GPU execution time:  ', GPUtime, 'microseconds'

deallocate(AN, AS, AE, AW, AP, PHI, PHIN)

end program Acc_Jacobi_Relax

Usually, unless I add the -O2 flag, the CPU code will take around the same amount microseconds to execute as the GPU code (around 600,000 usec)

The way I compile it is:
pgfortran -ta=nvidia -Minfo=accel -fast -o AccRegion_Jacobi_Relaxation.x AccRegion_Jacobi_Relaxation.f90

As you may have noticed, I modeled my program after the sample codes f1, f2, and f3. Also, I have tried different directives with no luck and I am wondering if the way I measured the times is correct (Should I try the way they do it in the Monte Carlo example?). If you have any suggestions please let me know, I am only a student after all. Thank you!

-Chris

Hi Chris,

For the accelerator model, you can get basic profiling information using the flag “-ta=nvidia,time”. From this I show that your biggest bottle neck is the data transfer cost:

% jac_cpu.out
 GPU execution time:          84599 microseconds
% jac_gpu.out
 device number            0
 GPU execution time:         408143 microseconds

Accelerator Kernel Timing data
/tmp/qa/jac.f90
  acc_jacobi_relax
    45: region entered 1 time
        time(us): total=408138
                  kernels=23339 data=360361
        53: kernel launched 1 times
            grid: [150x38]  block: [16x8x2]
            time(us): total=23339 max=23339 min=23339 avg=23339
acc_init.c
  acc_init
    41: region entered 1 time
        time(us): init=123314

The kernel time is roughly a quarter of CPU time (0.2 seconds versus 0.8 on the CPU), but the data transfer time is nearly 4 seconds. Let’s look at the -Minfo messages and see how the compiler is copying the data:

acc_jacobi_relax:
     45, Generating copyout(phink(2:299,2:299,1:300))
         Generating copyin(an(2:299,2:299,1:300))
         Generating copyin(as(2:299,2:299,1:300))
         Generating copyin(phi(1:300,1:300,1:300))
         Generating copyin(ae(2:299,2:299,1:300))
         Generating copyin(aw(2:299,2:299,1:300))
         Generating copyin(ap(2:299,2:299,1:300))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary

The compiler must be conservative and only copies the smallest amount of data to the GPU. Since you don’t use all of your array’s data, only the used sections are being copied. If you have the available memory on the GPU, it’s often better to copy the entire array than a section. An entire array can be copied in a single DMA transfer, versus the many DMA transfers it takes to copy array sections.

We can fix this by using the ‘copyin’ and ‘copyout’ directives to tell the compiler to copy the whole array.

!$acc region copyin(AN,AS,AE,AW,AP,PHI), copyout(PHINK)

Our new data transfer time is about 1.5 seconds.

% jac_gpu2.out
 device number            0
 GPU execution time:         192631 microseconds

Accelerator Kernel Timing data
/tmp/qa/jac.f90
  acc_jacobi_relax
    45: region entered 1 time
        time(us): total=192629 init=1 region=192628
                  kernels=40144 data=143868
        w/o init: total=192628 max=192628 min=192628 avg=192628
        53: kernel launched 1 times
            grid: [150x38]  block: [16x8x2]
            time(us): total=40144 max=40144 min=40144 avg=40144
acc_init.c
  acc_init
    41: region entered 1 time
        time(us): init=85270

Even if we got the compute kernel time down to zero, the overall time would still be more than the CPU. Hence, the next step would be to add more work to the kernel (such as more time steps or more stencils) or stop since the code isn’t worth accelerating.

You might want to take a look at the Himeo benchmark results using the PGI accelerator directive. (http://www.pgroup.com/lit/samples/accel_files/himeno.tar, http://www.pgroup.com/resources/accel_files/index.htm). Himeno uses a 19-point stencil Jacobian Solver and might give you some ideas.

  • Mat

Hi Mat,

I am still studying the program for learning purposes. Unfortunately, I do not seem to get the same results as you got unless I run the program a couple of times. Also, I would like to ask about something strange that is going on when I run the sample code f3.f90.

The machine tells me there is a segmentation fault and when I called the mydevice = acc_get_device_num(acc_device_nvidia) I receive a -1 value. Trying to set the device using acc_set_device was of no use either. Do you think there might be a problem with compatibility with the motherboard?

I ran a code similar to mine and the results they reported are quite different as well:
http://www.pgroup.com/userforum/viewtopic.php?p=6811&sid=0398e5bb00739956d8e5a5bf339d6f6c

2846987 microseconds on GPU
45473 microseconds on host

I am starting to suspect there may be some compatibility issue with the motherboard (We’ve recently acquired two GPUs for it and I am now using a Tesla C2050) Please let me know your thoughts on it and thank you again for helping me. You are awesome! d(^^)b

-Chris

Hi Chris,

going on when I run the sample code f3.f90.

Sorry about that. We accidentally broken this example in the 10.6 release. It will be fixed in August’s 10.8 release. Though, I don’t think this bug should impact your code.


Unfortunately, I do not seem to get the same results as you got unless I run the program a couple of times.

Check the output from the ‘pgaccelinfo’ utility and compare your bandwith with the output from my Fermi card. Depending upon the motherboard, adding additional card can sometimes reduce the bandwidth by half or even fourth. If this is the case, try taking one of the cards out to see if your bandwidth returns.

  • Mat
% pgaccelinfo
CUDA Driver Version:           3000

Device Number:                 0
Device Name:                   Tesla C2050
Device Revision Number:        2.0
Global Memory Size:            2817720320
Number of Multiprocessors:     14
Number of Cores:               448
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 1
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1147 MHz
Initialization time:           2699045 microseconds
Current free memory:           2758803456
Upload time (4MB):              966 microseconds ( 735 ms pinned)
Download time:                 1197 microseconds ( 692 ms pinned)
Upload bandwidth:              4341 MB/sec (5706 MB/sec pinned)
Download bandwidth:            3504 MB/sec (6061 MB/sec pinned)

you can get basic profiling information using the flag “-ta=nvidia,time”

Accelerator Kernel Timing data
/tmp/qa/jac.f90
acc_jacobi_relax
45: region entered 1 time
time(us): total=192629 init=1 region=192628
kernels=40144 data=143868
w/o init: total=192628 max=192628 min=192628 avg=192628
53: kernel launched 1 times
grid: [150x38] block: [16x8x2]
time(us): total=40144 max=40144 min=40144 avg=40144
acc_init.c
acc_init
41: region entered 1 time
time(us): init=85270

Hi Mat,
Is there a way in CUDA Fortran that we can get similar benchmark information.

Thanks,
Tuan[/quote]

Hi,

You can use pgcollect to get the profile information and use pgprof to display it.

Here is some example:

%pgcollect -cuda=gmem executable_name
%pgprof -exe executable_name

The information will be just a bit different.

Hope this helps.
Hongyon