since the code I am trying to optimise is quite lengthy I will ask my question in a bit abstract form. I have the following openacc kernel:
!$acc loop gang independent private(...)
!$acc end kernels
where nchansmax=20706, and the subroutine get_vmat is of $acc routine vector type. When I profile this code with Nsight Compute I get a runtime of 9.73 sec, gridsize 20706 and block size 128. In other words, each gang computes one nchloop iteration. However, if I limit nchloop to only one iteration, say
do nchloop=20403, 20403, the kernel runs just in 380.75 msec and gridsize 1 and block size 128. I tried to consider other values of nchloop, but all of them are computed in about the same time, about 400 ms. I don’t understand why I am getting 9.73 sec runtime when I run 20706 iterations and only 380.75 msec when I run any single iteration separately.
Under the “Launch Configuration” section in the profile, how many “waves per SM” are listed?
From the “Occupancy” section, what’s the theoretical and achieved occupancy, as well as number of registers used?
What GPU are you using?
Each SM can run up to 2048 threads when the occupancy is at 100%. So with 128 threads per block, this means 16 concurrent blocks per SM. A100s have 108 SMs for a max of 1728 concurrent blocks, V100s have 80 SMs so a max of 1280.
Most likely the extra time per iteration is because all the blocks can’t be running at the same time so multiple “waves” need to be issued.
You can restrict the number of gangs (aka blocks) by setting the “num_gangs” to number of blocks that can fit into a single wave, but each gang would then be executing multiple iterations of the loop, so you’d unlikely see any improvement.
Personally I very rarely try to optimize this and instead look at on how the data is being accesses (the vector loops should access arrays across the stride-1 dimension), register usage (which effects occupancy), and maximizing the speed-of-light (SOL) metrics.