In my ongoing effort to optimize my integer (modular) multiplication kernel, i am benchmarking with Nsight Compute. As my kernel uses around 80 registers, occupancy is pretty low. While probably lowering register usage would be the way to go, i have not been able to do so without major performance impacts in this scenario.
However, if i understand the benchmarks results correctly, the main reason for the subpar performance is the fact that there are a lot of stalls with the “No instruction” reason. I attached Nsights’ “Details” page as an image below. Nsight suggests:
[Warning] On average each warp of this kernel spends 4.0 cycles being stalled due to not having the next instruction fetched yet. This represents about 59.4% of the total average of 6.7 cycles between issuing two instructions. A high number of warps not having an instruction fetched is typical for very short kernels with less than one full wave of work in the grid. Excessively jumping across large blocks of assembly code can also lead to more warps stalled for this reason.
Well, this does not really apply in this case: The kernel is quite huge (in code size) and not short in any aspect (more than 500M cycles). Also, there are no branches or jumps in the relevant code, it is completely unrolled, straight-line code.
I thought tracking down the positions for the instruction stalls would be helpful, so i looked at the generated SASS and found, that every 0x100 = 256 bytes a “no_inst” stall happens.
Are instructions fetched 256 bytes at a time and again when the queue runs empty?
And if so, looking at the SASS code, this translates into around 25 instructions, which i would think are very few and i would imagine most kernels having more than that in their “hot” portion…
Above quote suggests that there might be not enough “waves”. If i increase my batch-workload for the kernel, i can increase the number of “Waves per SM” (e.g. to 10), however, the “no_inst” stalls decrease only slightly. Playing around with more or less threads per block also does not change anything significantly.
Am i missing something totally different here?
I would be happy for any pointers or directions for any sort of optimization, thanks in advance!