How to make loops with reduction faster?

I have some loops need OpenACC loop reduction. The code run a lot faster one a CPU than on a GPU:

wall time on a single CPU
convergencet= 0.6248478889465332 secs

wall time on a GPU
convergencet= 3.379598140716553 secs

The codelet for convergencet is shown below. The loop is 128 x 128. How can I make the loop faster on GPU?


!$acc kernels loop present(atp, atw, ate, ats, atn, bt, t) &
!$acc reduction(+:local_sum_num, local_sum_den)
do j=2,129
do i=2,129
temp1 = atp(i,j)*t(i,j)
temp2 = atw(i,j)*t(i-1,j) + ate(i,j)*t(i+1,j) + ats(i,j)*t(i,j-1) + &
atn(i,j)*t(i,j+1)
local_sum_num = local_sum_num + dabs(temp1 - temp2 - &
bt(i,j))
local_sum_den = local_sum_den + dabs(temp1)
end do
end do
!$acc end kernels


Thanks,

Ping

Hi Ping,

What’s the compiler feedback say? (i.e. -Minfo=accel).
Profiler? (i.e. set the environment variable PGI_ACC_TIME=1 and LD_LIBRARY_PATH to the PGI run time libraries).

Things to look for are the schedule used, if caching is used (try with and without -ta=nvidia,nocache), and if the device initialization is dominating your time.

  • Mat

Mat,

I tried with both 12.8 and 13.2 compilers. The code generated by the new compiler is much faster, but is till three times slower than the CPU code. Execution time is listed below. All measured in seconds. For the accelerator code, no time spends in data move.

CPU 12.8 13.2
0.62484789 4.13616442 1.87525368

Compiler output from v12.8

convergencet.f90:
convergencet:
12, Accelerator kernel generated
12, CC 1.3 : 28 registers; 32 shared, 284 constant, 0 local memory bytes
CC 2.0 : 27 registers; 0 shared, 308 constant, 0 local memory bytes
13, !$acc loop gang ! blockidx%x
15, !$acc loop vector(256) ! threadidx%x
12, Generating present(t(:,:))
Generating present(bt(:,:))
Generating present(atn(:,:))
Generating present(ats(:,:))
Generating present(ate(:,:))
Generating present(atw(:,:))
Generating present(atp(:,:))
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
15, Loop is parallelizable


Compiler output from v13.2

convergencet.f90:
convergencet:
12, Generating present(t(:,:))
Generating present(bt(:,:))
Generating present(atn(:,:))
Generating present(ats(:,:))
Generating present(ate(:,:))
Generating present(atw(:,:))
Generating present(atp(:,:))
Accelerator kernel generated
13, !$acc loop gang ! blockidx%x
15, !$acc loop vector(256) ! threadidx%x
12, Generating NVIDIA code
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
Generating compute capability 3.0 binary
15, Loop is parallelizable

Profiler information:

=========Code 12.8=========
convergencet convergencet
12: region entered 10000 times
time(us): total=4,110,304 init=904 region=4,109,400
kernels=399,804
w/o init: total=4,109,400 max=865 min=404 avg=410
12: kernel launched 10000 times
grid: [128] block: [256]
time(us): total=309,510 max=37 min=28 avg=30
13: kernel launched 10000 times
grid: [2] block: [256]
time(us): total=90,294 max=15 min=8 avg=9

==========Code 13.2==========
convergencet NVIDIA devicenum=0
time(us): 393,772
12: kernel launched 10000 times
grid: [128] block: [256]
device time(us): total=315,369 max=59 min=29 avg=31
elapsed time(us): total=712,665 max=207 min=68 avg=71
12: reduction kernel launched 10000 times
grid: [2] block: [256]
device time(us): total=78,403 max=22 min=7 avg=7
elapsed time(us): total=464,016 max=129 min=45 avg=46


I have a couple of questions.

  1. Compiler 12.8 generates information about registers, constants, shared memory, and local memory usage. Why this information disappears in v13.2? Can this information help optimize the code?

  2. The actual kernel time for the code compiled by 12.8 is 0.3998 seconds, however the region time is 4.1 seconds, which is equal to my measured time. Where the majority of the time has been spend besides the kernels?

  3. The actual kernel time for the code compiled by 13.2 is 0.3938 seconds, and total elapsed time is 1.1767 seconds. This is different from my measure 1.8752 seconds. Why they differ so much? Again, where has the time (1.1767-0.3938) been spend on?

Thanks,

Ping

  1. Compiler 12.8 generates information about registers, constants, shared memory, and local memory usage. Why this information disappears in v13.2? Can this information help optimize the code?

Actually, I’m not sure why the ptxas info was removed. I’ll need to ask. My guess it’s because we’re moving towards multiple device types, but I’ll need to double check.

  1. The actual kernel time for the code compiled by 12.8 is 0.3998 seconds, however the region time is 4.1 seconds, which is equal to my measured time. Where the majority of the time has been spend besides the kernels?

It was performance bug in the 12.x compilers when a reduction was present. The run time was taking too much time to set-up the reduction code. It was fixed in the 13.x compilers.

  1. The actual kernel time for the code compiled by 13.2 is 0.3938 seconds, and total elapsed time is 1.1767 seconds. This is different from my measure 1.8752 seconds. Why they differ so much? Again, where has the time (1.1767-0.3938) been spend on?

Most likely the overhead of launching the kernels and creating the reduction. While the performance of this code is much improved, there is still some overhead involved.

One thing I do notice is that there are double the number of threads then what’s needed. The schedule is for 128x256 but you only have 128x128 sized loops. I’d try explicitly setting the schedule to reduce the number of threads. You many want to experiment with different combinations.

  • Mat