Help getting better performance with OpenACC

Hi,

porting the code I’m working right now to OpenACC was a bit tricky, but I’m now confident that the code is working OK with OpenACC and our P100 card. Now the question is to try to get better performance.

The code in question is at: https://filenurse.com/download/5fe52487fabb829eef7b8f75d2f85403.html

To compile both the regular and the OpenACC version I do:
make clean ; make OACC=0 ; make clean ; make OACC=1

And to run the code in both versions:
./rii_acc.x > rii.out.acc ; ./rii_cpu.x > rii.out.cpu

The maximum relative error in my machine run like that is ~1e-9, so that looks good. (as reported using numdiff: numdiff rii.out.acc rii.out.cpu -q -S)

But the performance is not very good at the moment. There are two ‘heavy’ routines, riiparts and rii_matrix. The code does a 100 times loop calling both routines, and in my machine each loop (one execution of both routines) takes around 42 ms.

In the OpenACC version these two routines end up being three kernels: riiparts_173_gpu, rii_matrix_252_gpu and rii_matrix_294_gpu, and by looking at the start times of the kernels executions, one executiong of the three kernels takes around 2.78 ms.

So that is around 15x speedup, which doesn’t look bad, but a colleague working with a similar routine (though less tuned for the CPU) was getting around 80x speedup. At the same time, the profiler (pgprof) tells me that the occupancy was very low: 22.7% for riiparts_173_gpu, 4.9% for rii_matrix_252_gpu and 6.9% for rii_matrix_294_gpu, so I guess there is room for improvement.

But at the moment I’m mostly trying to change gangs, worker and vectors numbers more or less at random, but without a clear understanding on what I should do.

Can you give me some pointers to reading material or similar to better understand how I should play with the distribution of threads in the GPU to try to reach better occupancy (and thus hopefully better performance)? [If you actually play with the code and get better execution numbers, then that gives you bonus points :-) ]

Many thanks,
AdV

Hi AdV,

I think the biggest issue with occupancy is the that there’s too little work to keep the GPU busy.

First, I made a few scheduling adjustments. The riiparts_173_gpu region, I only scheduled the outer two loops and didn’t fix the number of workers or the vector length:

    !$acc parallel present(dx,dy,u,ys,p1_r,p1_i,p2_r,p2_i)
    !$acc loop gang vector collapse(2)
    do i=1,nf
       do ip=1,nf

On my P100, the overall time for the is kernel went from ~.0314 seconds to ~0.01864.

I couldn’t find a better schedule for the rii_matrix_252_gpu but did remove the fixed gang size. I know 101 matches the input size, but is rather small for a GPU.

For the rii_matrix_294_gpu, instead of vectoring the inner most loop, I vectorized and collapsed the two most inner loops:

      !$acc kernels present(rii_r,rii_i)
    !$acc loop gang collapse(2)
    do k=kmin,kmax
       do kp=kmin,kmax
          km = MIN(k,kp)
          do q=-km,km
             hnt=2.d0*pi*lf*gu*dble(q)/Aul
             hnt2=hnt*hnt
             hr=1.d0/(1.d0+hnt2)     !Real part Hanle term
             hi=-hnt/(1.d0+hnt2)     !Imaginary part Hanle term

            !$acc loop vector(128) private(tr,ti) collapse(2)
             do i=1,nf
                do ip=1,nf
                   call comprod(rii_r(k,kp,q,i,ip),rii_i(k,kp,q,i,ip),hr,hi,tr,ti)
                   rii_r(k,kp,q,i,ip)=tr
                   rii_i(k,kp,q,i,ip)=ti
                end do
             end do
          end do
       end do
    end do

This helped a bit, going from 0.083 to 0.667 seconds.

The theoretical occupancy for the kernels is 34% for 173 and 64% for the other two. But like you only saw between 6-9% achieved occupancy. The problem here is that the warps are stalled waiting for memory and there simply isn’t enough total warps to hide the memory latency.

If I increase the number of frequency points from 101 to 501, then the achieved occupancy goes to 34% for 173 and 64% for 252, matching the theoretical occupancy. 294 remained at 6%, but that’s because kmin=0 and kmax=2, so you only have a maximum of 9 gangs.

I’m not sure that the larger frequency points is valid, nor did I check for correct answers. I was just seeing if indeed the main problem with occupancy was the problem size.

-Mat

Hi Mat.

thanks for the help. Since we have to perform MANY of these calculations, I’m going to modify the code so several of them can be run concurrently in the GPU, and thus I hope to get more work and increase the GPU occupancy.

[I will likely come back with more questions during the week…]

Many thanks,
AdV