OpenACC: Fine tuning accelerator performance

Mat, can you comment?

I’ve read quite a bit of blog entries and understand the PG compilers are quite good at allocating gangs/workers/vectors for target accelerators. There might not be much else that can be done.

My program does not have a lot of parallelizing; I’ve needed to speed up the loops with PARALLEL LOOP directives. Furthermore, profiling the CPU-only version pointed to four subroutines cycle hogs. And so I’m focusing on them.

I am working on a Quatro GP100. pgaccelinfo tells me that it has max-threads-per-block of 1024, max-block-dimensions of 1024,1024,64, and max-grid-dimensions of 2147483647 x 65535 x 65535. I think that’s what’s relevant for gang/worker/thread determination. Can you explain that better for me? Also, what are the limits for gang/worker/vector on this accelerator?

The program is working slightly better 1/3 as fast as its CPU-only version. That’s not what I expected.

Hi Richard,

First in case you don’t know, when targeting an NVIDIA device using OpenACC, we map a gang to a CUDA Block, a worker to the y-dimension of thread block (threadidx.y) and a vector to the x-dimension (threadidx.x).

The grid dimension (the block dimension) really doesn’t matter here since it’s more about enumeration and it’s rare that the compiler with use the y or z dimensions. You just want enough blocks to fill the streaming multiprocessors (SM) with enough work. If I remember correctly, Pascal has a max of 16 blocks per SM. I don’t know how many SM’s are on a GP100, but “pgaccelinfo”/“nvaccelinfo” it will tell you how many you have.

What does matter is occupancy which basically is a percentage of threads able to run over the max allowed on this device. High occupancy does not guarantee optimal performance, but low occupancy can hurt. 50-75% occupancy is usually acceptable.

At 16 blocks per SM, and 2048 total threads per SM, this means you want a minimum of 128 threads per block. On Volta and Ampere the number of blocks per SM is 32, with 2048 threads per SM, so the minimum threads per block is 64. By default, the compiler typically will use a vector_length of 128 (i.e. 128 threads per block).

The other things that effect occupancy is register usage and software managed cache size. Unless you’re using gang private variables or the OpenACC “cache” directive, we’re safe to ignore the cache size limitation. For registers, there are 65536 per SM so at 2048 threads, each thread can use a max of 32 registers. Increasing the number of registers will lower the number of max threads, thus lowering the occupancy. You can see the number of registers in use by adding the “-gpu=ptxinfo” flag or can be viewed in a Nsight-Compute profile. Note we don’t document the ptxinfo flag, it’s primarily used for internal debugging.

Unfortunately, you don’t have much control over the number of registers being used since register allocation is done by PTXAS (the device assembler) though it’s largely determine by the number of local variables in the kernels. So the best way to reduce register usage, is to use less local variables. There’s also the flag “-gpu=maxregcount:” where you can tell PTXAS to use no more that “n” registers, but this will cause more spilling where variables that would normally be put in a register as now put in L2 cache or global memory. Unless you’re very close to a threshold (like 33), often the increased spilling will offset any gains with an increased occupancy.

The CUDA Occupancy Calculator (CUDA Occupancy Calculator :: CUDA Toolkit Documentation) is useful to compute the theoretical occupancy. Note that the actual occupancy may be lower due to warp stalls. To see the actual occupancy, you’ll need to use the Nsight-Compute profiler.

Another important aspect is to ensure that the vector and/or worker is accessing the stride-1 dimension (the contiguous memory) of your array. For C/C++, this is your rows (the last dimension), in Fortran this is the columns (the first dimension). Since threads in a warp (groups of 32 threads) are SIMT (single instruction multiple threads), when one thread accesses memory, all 32 do as well. If that memory is contiguous, it can be brought into cache together. If each thread is accessing non-contiguous memory, each thread may have to wait while the others fetch their memory.

You can try adjusting the block size via the vector_length and num_worker clauses, but in general using the default 128 vector_length works well. Typically I only advising using larger sizes is when there’s software caching so more threads can take advantage of the block’s shared memory cache. Using smaller values is useful when an inner loop which accesses the stride-1 dimension, is itself smaller that the default. The worker clause is rarely used, but can help when the vector length is lowered in order to make-up for the few threads. For example something like:

bounds1=...some big value..;
bounds2=32; 
#pragma acc parallel loop gang worker vector_length(32)
for (int i = 0 ; i < bounds1 ;  ++i) {
    #pragma acc loop vector
    for (int j=0; j < bound2; ++j) {
         arr[i][j] = ....
     }
}

Now if your offloaded loops simply don’t have a lot of parallelism (like the loop trip count is too small), there’s not much you can do other that use a bigger problem. However, if the compute regions are not dependent on each other, you can use “async(<qid>)” with different queue ids so both kernels can be run concurrently on the device.

If you want to provide an example of what you’re doing, I can offer specific suggestions on how better tune.

-Mat

Let me unpack a few tidbits you wrote: First, it appears that the GP100 has 56 processors. Second, by the grid dimensions you are saying that the GP100 has (x,y) mapped to (1024,64)? Or is it the other way around? And also, you indicate blocks per SM. Is that the accelerator’s “Warp Size” (32 for my GPU)?

It will take time to digest what you’ve wrote. Thanks for taking the time to explain this.

I’m not 100% clear on this but assuming you’re asking about the block size. This would actually be incorrect since the product of the threads from each dimension can’t exceed 1024.

vector => CUDA block x dimension
worker => CUDA block y dimension

So having an OpenACC “vector_length(32) num_workers(4)” would be a 32x4 block dimension, 128 threads total.
Most kernels only need “vector” and I’ll typically only using worker for special cases (like a smaller inner vector loop)

And also, you indicate blocks per SM. Is that the accelerator’s “Warp Size” (32 for my GPU)?

No, a warp is something separate. It’s a grouping of 32 threads that execute in SIMT. A block is comprised of threads grouped together in a warp, and a block can have multiple warps. So a block size of 128x1 would have 4 warps.

Now this gets a bit more complex as of Volta in that you could use cooperative groups instead of warps, but for simplicity let’s think of a a warp being 32 threads grouped together during execution.

And also, you indicate blocks per SM.

Correct though the 2048 threads per SM is more relevant. Blocks would be just how you group those threads together. You could have 2 blocks, each with 1024 threads per block, or 16 blocks with 128 threads per block (or 32 blocks with 64 threads per block on Volta or Ampere). I only mention the blocks per SM to point out the minimum number of threads per block needed in order to get 100% theoretical occupancy. You could use 32 threads per block, but then the maximum number of threads running on an SM would be 512 on Pascal, 1024 on Volta or Ampere.

Perhaps I’m too much in the weeds on this one. Here is a loop in my program, where all loop iterations are quite large (up to 5.7 million):

integer i,ii,il,ir,j,je,jo;

!$acc parallel loop
do ii=1,vector(j) ! line 10200
i = vector2(ii,j)
il = jei + jomatrix(i,j)
ir = joi + jematrix(i,j)

do n=1,max ! line 20208
A(n,ii) = B(n,il)
end do

do n=1,max ! line 1025
C(n,ii) = B(n,ir)
end do
end do

And the compiler spits out the following:
10199, Generating Tesla code
10200, !$acc loop gang ! blockinx%x
10208, !acc loop vector(128) ! threadidx%x
10215, !acc loop vector(128) ! threadidx%x
10208, Loop is parallelizable
10215, Loop is parallelizable

And when I run the program, I get:
launch CUDA kernal file=… function=… line=10199 device=0,threadid=1 num_gangs=1024 num_workers=1 vector_length=124 grid=1024 block=128

HENCE, from what you’ve written I have 128 vectors; that means that I have 8 workers and that I’m using all the gangs on this loop. So it seems that I can only increase the number of workers for the loop (to a max of 8). Right?

In the generate code, there’s only 1 worker, but yes you could use up to 8. Though it may or may not benefit performance. The benefit might be that you then get stride-1 access on “vector2” across the workers, however the complier would need to add an extra barrier before the first vector loop which might offset any gains. You’d be making the block (gang) size bigger so that only 2 blocks can be run per SM versus the current schedule where 16 blocks per SM are used.

OpenACC does make it easy to experiment, so you can try the following to see if it helps. Though you’ll want to set the schedule (i.e. the gang/worker/vector) clauses to override the compiler’s defaults. You can also try experimenting various values for the num_workers and vector_length. My guess it wont help or hurt much in this case since the loop bounds are large, but again it’s worth spending a bit of time experimenting. I’ll go through a similar process for an hour or two when I port codes, but more often then not, the compiler defaults work fine.

   integer i,ii,il,ir,j,je,jo;
…
!$acc parallel loop gang worker num_workers(8) vector_length(128)
do ii=1,vector(j) ! line 10200
i = vector2(ii,j)
il = jei + jomatrix(i,j)
ir = joi + jematrix(i,j)

!$acc loop vector
do n=1,max ! line 20208
A(n,ii) = B(n,il)
end do

!$acc loop vector
do n=1,max ! line 1025
C(n,ii) = B(n,ir)
end do
end do