"no instruction" stalls every 256 bytes of the binary code

In my ongoing effort to optimize my integer (modular) multiplication kernel, i am benchmarking with Nsight Compute. As my kernel uses around 80 registers, occupancy is pretty low. While probably lowering register usage would be the way to go, i have not been able to do so without major performance impacts in this scenario.

However, if i understand the benchmarks results correctly, the main reason for the subpar performance is the fact that there are a lot of stalls with the “No instruction” reason. I attached Nsights’ “Details” page as an image below. Nsight suggests:

[Warning] On average each warp of this kernel spends 4.0 cycles being stalled due to not having the next instruction fetched yet. This represents about 59.4% of the total average of 6.7 cycles between issuing two instructions. A high number of warps not having an instruction fetched is typical for very short kernels with less than one full wave of work in the grid. Excessively jumping across large blocks of assembly code can also lead to more warps stalled for this reason.

Well, this does not really apply in this case: The kernel is quite huge (in code size) and not short in any aspect (more than 500M cycles). Also, there are no branches or jumps in the relevant code, it is completely unrolled, straight-line code.

I thought tracking down the positions for the instruction stalls would be helpful, so i looked at the generated SASS and found, that every 0x100 = 256 bytes a “no_inst” stall happens.

Are instructions fetched 256 bytes at a time and again when the queue runs empty?
And if so, looking at the SASS code, this translates into around 25 instructions, which i would think are very few and i would imagine most kernels having more than that in their “hot” portion…

Above quote suggests that there might be not enough “waves”. If i increase my batch-workload for the kernel, i can increase the number of “Waves per SM” (e.g. to 10), however, the “no_inst” stalls decrease only slightly. Playing around with more or less threads per block also does not change anything significantly.

Am i missing something totally different here?

I would be happy for any pointers or directions for any sort of optimization, thanks in advance!

I also found the following notice in the CUDA Profiling Usage guide for NVVP (https://docs.nvidia.com/cuda/profiler-users-guide/index.html#warp-state-nvvp)
[i]
Stalled for instruction fetch - The next instruction was not yet available.
To reduce instruction fetch stalls:

  • If large loop have been unrolled in kernel, try reducing them.
  • If the kernel contains many calls to small function, try inlining more of them with the __inline__ or __forceinline__ qualifiers. Conversely, if inlining many functions or large functions, try __noinline__ to disable inlining of those functions.
  • For very short kernels, consider fusing into a single kernels.
  • If blocks with fewer threads are used, consider using fewer blocks of more threads. Occasional calls to __syncthreads() will then keep the warps in sync which may improve instruction cache hit rate.

[/i]

Well, the first bullet point might address my issue, but if i force my outer most loop to not be unrolled with (#pragma unroll 1) performance drops to around 10% of the unrolled version (which the compiler prefers without any pragmas anyway).

Adding a few __syncthread()-calls after (short) divergent code pieces elsewhere and in front of the long sequence of unrolled code sounded promising as execution of threads would be resynced, but in fact did not change performance in any way.

Thanks again for any pointers!

You could try other unroll factors.

Empirically when no unroll pragma is given, the compiler seems to tend to pick 4 for small loop bodies. You could try a few other values around that, e.g. #pragma unroll 8 .

How large is the SASS code of your kernel? Are you running other kernels at the same time?

As tera said: inspect the SASS code with cuobjdump and observe the impact of reducing (not necessarily eliminating!) the amount of loop unrolling and function inlining. Please be aware that such code changes can reduce the efficiency of your code and lead to a bottom-line performance drop.

How big is the performance impact of instruction fetch stalls precisely? I have never observed a case where it was more than 3%, i.e. barely above my heuristic measurement noise level (2%). That doesn’t mean cases where the impact is larger couldn’t exist.

The underlying root cause is a combination of large GPU instructions (plus control block overhead), small GPU instruction caches, and total lack of branch prediction logic. This is a trade-off: it helps to keep the GPU hardware simple and small, which in turn allows the number of functional units and computational throughput to be maximized.

Loops don’t have to get very large to encounter this issue, a few hundred instructions is typically all it takes.

[Later:] I notice belatedly that OP states that the kernel in question is straight-line code, in which case the stalls are presumably just a function of the code being streamed through the instruction cache, and the occasional hiccups are caused by artifacts of the instruction cache replacement policy.

To be perfectly honest: I wouldn’t worry about this is if were my code.

Only after Norbert’s addition about straight line code did I notice the huge addresses in the SASS code.
Under these circumstances I think Nvidia’s GPU architecture is actually doing amazingly well, given that each cache line fetched into the instruction cache can only be reused by 24 other warps, without any opportunity to amortize the cost amongst multiple loop iterations.

This seems to fit well with my own observations that instruction cache misses hardly ever seemed to be of concern. It certainly looks like Nvidia have implemented some kind of prefetching, as otherwise these good numbers could not be explained.*


  • Indeed it shouldn’t be too difficult to predict the best prefetching candidates for straight line code…

That GPUs use instruction prefetch is pretty much a given; even the lowly 8086 did that in 1978 (six byte prefetch queue, IIRC). But as far as I am aware, there are no smarts to the GPU prefetching, in particular since there is no branch prediction. So instruction prefetching presumably happens strictly sequential in ascending address order.

I therefore assume that the stalls observed are indicative of instruction cache or TLB activity. I don’t think anything has been published on GPU instruction caches, so one could speculate about cache line length or use of sectoring and how that might play into the OP’s observations. My first thought was that the regular hiccups (stalls) have to do with crossing page boundaries, but since I would assume page size to be 4K, that doesn’t seem to fit the reported pattern.

As for the kernel functionality consisting of large integer multiplies, my observation for Kepler was that such code would consist almost entirey of IMAD operations and would consequently saturate the 32-bit IMAD units almost perfectly (think 98% efficient vs theoretical), and I would expect the same for the most recent GPU architectures which have brought back (both to my surprise and my delight) 32-bit integer multipliers.

The tricky architectures are those where one is forced to use XMAD, which is what OP seems to be using based on their earlier posts. There are published results for that, and I also posted some code for medium-width multiplies in these forums some while ago that I seem to recall had similar performance. I think it will be difficult to multiply much more efficiently than those solutions (because non-XMAD “helper instructions” are needed), but if OP had found a way, I would love to see the results :-)

Thanks very much, you two! I’ll try to answer your questions first:

The SASS code is pretty massive, the last instruction’s address from cuobjdump is 0x1deb8 in the unrolled version, thus i think the kernel is a bit larger than 122KB.

I am normally using streams in this application to hide cpu-side computation latency, but usually only one kernel runs at the same time (from nvvp’s timeline). If i switch to a single stream, the “no_inst” latency is still the same, too.

Well, all those loops loop over the limbs of a multi-precision integer, in this example case here a 192-bit number with 6 32-bit limbs, where 6 is a compile-time constant. As (for carry-propagation reasons) the first limb is treated seperately, there are only 5 iterations. I guess the compiler figures unrolling those completely is more efficient. I tried forcing smaller unroll factors, but as performance dropped to around 25%, i did not even check the profiler for the instruction stalls.

The same goes for reducing function inlining: Using noinline on the only - again quite large - (multiplication) function results in a massive performance drop. The “functions” you see in the code on the left i posted above are just macros for inline ptx assembly.

Well, this is a very good question! ;)

From the profiling results i posted in the OP, i thought this would be the major restriction. The kernel seems to be running smooth (no long data stalls or data dependency issues), except for the instruction stalls. Honestly, i have no idea on how to measure the actual impact of these stalls.

Well, after some more digging, i found this probably outdated paper Demystifying GPU Microarchitecture through Microbenchmarking investigating the Tesla architecture (http://www.stuffedcow.net/files/gpuarch-ispass2010.pdf), finding:

If the instruction cache line’s size has not changed since then, this fits the pattern and would confirm njuffa’s ideas that this is an artifact of cache replacement.

Yes, you are completely right, for the moment i am stuck with the Pascal architecture not featuring 32-bit multipliers. Replacing the one-year old Tesla P100 with a V100 is quite… ehm… financially challenging. ;)

I did implement an alternative version based on your (njuffa’s) post about multi-precision multiplication on Maxwell/Pascal (and this paper https://ieeexplore.ieee.org/abstract/document/7563271) that’s using the XMAD-and-shift approach. However, as i posted here https://devtalk.nvidia.com/default/topic/1047245/cuda-programming-and-performance/generating-xmad-x-cc-by-ptx/, the newer versions (> CUDA 8) of ptxas apparently do not generate the optimal XMAD sequence for the Pascal architecture.

Using CUDA 10 for both variants (32-bit MAD vs 16-bit XMAD), the 32-bit version (and ptxas’s “emulation”) is about twice as fast.

Also, using CUDA 8 to compile my 16-bit XMAD based test-case (generating long chains of XMADs plus shifts) the result is about the same as the CUDA 10 “emulated 32-bit” version. I had hoped to gain some performance, but maybe the newer compiler is smart enough to generate something similarly fast.

Bottom line: So far i am quite far from being able to reproduce the multi-precision integer multiplication throughput found in the literature on older platforms with far less cores (think GTX580) with the Tesla P100. And i find it hard to believe that in this regard it is actually the hardware’s fault.
I rather suspect my code not being optimized correctly.

Well, there is the paper Utilizing the Double-Precision Floating-Point Computing Power of GPUs for RSA Acceleration (https://www.hindawi.com/journals/scn/2017/3508786/) and the authors claim to have achieved pretty good results. They use the 64-bit FP path for 52-bit integer multiplication. However, in my case that would mean a quite large overhaul of the software and as of now i am trying to evade that. However, i suppose eventually i will give that approach a try and i’ll be sure to report back here.

Thank you two for looking at this, in case you have any ideas for optimization approaches, i am grateful to hear those.

Using a high-radix representation and the floating-point unit for long-integer multiplications is an old trick in cryptographic computations. I seem to recall it was used for crypto libraries on SPARC-based Sun machines around the year 2000.

For example use a ‘float’ to store each digit, and ‘double’ multiplication to compute partial products, then accumulate columns of partial products in ‘double’. With FMA available, one should be able to use ‘double’ for the “digits” directly. For modular arithmetic I sketched some floating-point based code years ago on Stackoverflow: https://stackoverflow.com/questions/12252826/modular-arithmetic-on-the-gpu

An obvious practical issue is that relatively few people have access to a GPU with fast double-precision throughput, so for code that needs mass-market appeal, it’s not really an option.

As for the compiler changes you observed: Since you use a Pascal-family GPU, you could always go back to use CUDA 8 if that offers better performance. I still use CUDA 8 with a Pascal-class GPU and it works as well as ever :-)

I can only speculate why code generation changed. I am aware that there have been recurring bugs in the handling of carry propagation by PTXAS, found by people that use medium sized integer math in prime number searches. So the code-generation changes you observe post CUDA 8 may be related to bug fixes for that. Handling of the carry bit can be performance-limiting in a variety of ways (both in hardware and in software: additional data dependency, renaming limitations), so maybe the newer code generator employs alternative carry-less alternatives?

The code changes could also be plain old performance regressions, which often occur in compilers (not just those for CUDA) with regard to specific applications. But if I recall correctly, you reported that performance of your kernel(s) increased post CUDA 8.0, meaning no performance regression.