Hi Richard,
First in case you don’t know, when targeting an NVIDIA device using OpenACC, we map a gang to a CUDA Block, a worker to the y-dimension of thread block (threadidx.y) and a vector to the x-dimension (threadidx.x).
The grid dimension (the block dimension) really doesn’t matter here since it’s more about enumeration and it’s rare that the compiler with use the y or z dimensions. You just want enough blocks to fill the streaming multiprocessors (SM) with enough work. If I remember correctly, Pascal has a max of 16 blocks per SM. I don’t know how many SM’s are on a GP100, but “pgaccelinfo”/“nvaccelinfo” it will tell you how many you have.
What does matter is occupancy which basically is a percentage of threads able to run over the max allowed on this device. High occupancy does not guarantee optimal performance, but low occupancy can hurt. 50-75% occupancy is usually acceptable.
At 16 blocks per SM, and 2048 total threads per SM, this means you want a minimum of 128 threads per block. On Volta and Ampere the number of blocks per SM is 32, with 2048 threads per SM, so the minimum threads per block is 64. By default, the compiler typically will use a vector_length of 128 (i.e. 128 threads per block).
The other things that effect occupancy is register usage and software managed cache size. Unless you’re using gang private variables or the OpenACC “cache” directive, we’re safe to ignore the cache size limitation. For registers, there are 65536 per SM so at 2048 threads, each thread can use a max of 32 registers. Increasing the number of registers will lower the number of max threads, thus lowering the occupancy. You can see the number of registers in use by adding the “-gpu=ptxinfo” flag or can be viewed in a Nsight-Compute profile. Note we don’t document the ptxinfo flag, it’s primarily used for internal debugging.
Unfortunately, you don’t have much control over the number of registers being used since register allocation is done by PTXAS (the device assembler) though it’s largely determine by the number of local variables in the kernels. So the best way to reduce register usage, is to use less local variables. There’s also the flag “-gpu=maxregcount:” where you can tell PTXAS to use no more that “n” registers, but this will cause more spilling where variables that would normally be put in a register as now put in L2 cache or global memory. Unless you’re very close to a threshold (like 33), often the increased spilling will offset any gains with an increased occupancy.
The CUDA Occupancy Calculator (CUDA Occupancy Calculator :: CUDA Toolkit Documentation) is useful to compute the theoretical occupancy. Note that the actual occupancy may be lower due to warp stalls. To see the actual occupancy, you’ll need to use the Nsight-Compute profiler.
Another important aspect is to ensure that the vector and/or worker is accessing the stride-1 dimension (the contiguous memory) of your array. For C/C++, this is your rows (the last dimension), in Fortran this is the columns (the first dimension). Since threads in a warp (groups of 32 threads) are SIMT (single instruction multiple threads), when one thread accesses memory, all 32 do as well. If that memory is contiguous, it can be brought into cache together. If each thread is accessing non-contiguous memory, each thread may have to wait while the others fetch their memory.
You can try adjusting the block size via the vector_length and num_worker clauses, but in general using the default 128 vector_length works well. Typically I only advising using larger sizes is when there’s software caching so more threads can take advantage of the block’s shared memory cache. Using smaller values is useful when an inner loop which accesses the stride-1 dimension, is itself smaller that the default. The worker clause is rarely used, but can help when the vector length is lowered in order to make-up for the few threads. For example something like:
bounds1=...some big value..;
bounds2=32;
#pragma acc parallel loop gang worker vector_length(32)
for (int i = 0 ; i < bounds1 ; ++i) {
#pragma acc loop vector
for (int j=0; j < bound2; ++j) {
arr[i][j] = ....
}
}
Now if your offloaded loops simply don’t have a lot of parallelism (like the loop trip count is too small), there’s not much you can do other that use a bigger problem. However, if the compute regions are not dependent on each other, you can use “async(<qid>)” with different queue ids so both kernels can be run concurrently on the device.
If you want to provide an example of what you’re doing, I can offer specific suggestions on how better tune.
-Mat