Finding cause of "No Instruction" stall and optimizing for it

Hi,
I am working on a simple MLP implementation in HLSL for Vulkan. I am optimizing for maximum performance and use Nsight Graphics for shader profiling. An improved version of the shader moves all parameters of the MLP into shared memory for higher throughput and while the previous version was bottlenecked by memory throughput
and latency (probably most likely latency), this one is experiencing stalls because of instruction fetches (“No Instruction” stall in nsight).

The picture above shows the simple (naive) matrix multiplication code. In this test, I disabled loop unrolling to ensure that the loops would not emit too many instructions. Even with no unrolling, the code is still severely bottlenecked.

I could only find sparse literature on the topic, and I am trying to understand the possible causes of the instruction cache stall. Any blogs or documents describing the nature of this problem in detail are highly welcome. I am certain that it would help to see lower-level statistics of the shader like the amount of SASS instructions or the specific instructions emitted from the SPIR-V. Just looking at the shader code does not provide useful insight into this problem.
One cause I could find is that there might be not enough warps to cover the fetch latency. However, I am dispatching ~65,000 threads in groups of 32, and I would think that this gives enough freedom to the warp scheduler.

Can I show the raw shader instructions in Nsight Graphics like it is possible in Nsight compute for CUDA?
And what possible ways are there to debug the instruction fetch bottleneck?

Hello,
Thank you for using Nsight Graphics and your question on Shader Profiling. I will connect with the engineering team and get back with you on a response.
Regards,

Hi Friedrich,

I would recommend: First, try to confirm whether the SMs (shader processors) are near 100% pipeline utilization. If so, then the shader is throughput-bound, not latency-bound, and it would be safe to ignore the “no instruction” stall reason. Or in other words, there may be a sufficient number of issuing (“selected”) warps to hide the latency of the “no instruction” warps.

One easy way to confirm this is via GPU Trace. In the below example, you can see it’s issue limited (yellow at 100%), with a mix of ALU (int32) and FMA (fp32).

When throughput bound, usually the goal is to offload the limiting pipeline by adjusting the instruction mix. Or, you may pat yourself on the back for achieving peak throughput :-)

Other possibilities, since I$ contention is a “global” problem:

  • Another portion of the shader may consist of a large # of instructions.
  • Multiple shaders may be executing concurrently.

I am dispatching ~65,000 threads in groups of 32

If the shader is latency-limited (not throughput limited) the following advice will be more relevant.
A group size of 32-threads limits occupancy to 16 warps-per-SM on most GPUs; that’s due to the HW limit of 16 groups-per-SM at a time.
To overcome this, it’s generally advisable to launch at least 64 threads per group, yielding 32 warps-per-SM. (on Turing that can reach 100% warp occupancy; on Ampere it’s 66% while yielding Turing-equivalent perf)
The rationale for higher occupancy is to increase the chance of issuing an instruction each cycle. If the shader is already throughput-bound, increasing occupancy will not be useful, and may even be counter-productive.

The Shader Profiler has a built-in occupancy calculator in the Summary tab. If you hover over the # Warp value, I suspect you’ll see something like this, but with a HW CTA Slot Limiter = 16. (HW CTA Slot == group)
image

the specific instructions emitted from the SPIR-V

This is available in the Source view
image

Can I show the raw shader instructions in Nsight Graphics like it is possible in Nsight compute for CUDA?

This is only available in the Pro version. Please email NsightGraphics@nvidia.com with your name + the name of your business entity, and we’ll get the conversation started.

Additional Material
The following are not exactly specific to your problem, but hopefully you find them interesting.

Best,
Avinash

Hi Avinash,

thank you very much for taking your time to answer my question. I have marked your response as the solution, because increasing the group size as you suggested already doubled the performance. I highly appreciate the additional material you added.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.