Finding cause of "No Instruction" stall and optimizing for it

FriedrichS · December 7, 2022, 9:32pm

Hi,
I am working on a simple MLP implementation in HLSL for Vulkan. I am optimizing for maximum performance and use Nsight Graphics for shader profiling. An improved version of the shader moves all parameters of the MLP into shared memory for higher throughput and while the previous version was bottlenecked by memory throughput
and latency (probably most likely latency), this one is experiencing stalls because of instruction fetches (“No Instruction” stall in nsight).

The picture above shows the simple (naive) matrix multiplication code. In this test, I disabled loop unrolling to ensure that the loops would not emit too many instructions. Even with no unrolling, the code is still severely bottlenecked.

I could only find sparse literature on the topic, and I am trying to understand the possible causes of the instruction cache stall. Any blogs or documents describing the nature of this problem in detail are highly welcome. I am certain that it would help to see lower-level statistics of the shader like the amount of SASS instructions or the specific instructions emitted from the SPIR-V. Just looking at the shader code does not provide useful insight into this problem.
One cause I could find is that there might be not enough warps to cover the fetch latency. However, I am dispatching ~65,000 threads in groups of 32, and I would think that this gives enough freedom to the warp scheduler.

Can I show the raw shader instructions in Nsight Graphics like it is possible in Nsight compute for CUDA?
And what possible ways are there to debug the instruction fetch bottleneck?

dwoods · December 15, 2022, 12:17am

Hello,
Thank you for using Nsight Graphics and your question on Shader Profiling. I will connect with the engineering team and get back with you on a response.
Regards,

abaliga · December 15, 2022, 3:18am

Hi Friedrich,

I would recommend: First, try to confirm whether the SMs (shader processors) are near 100% pipeline utilization. If so, then the shader is throughput-bound, not latency-bound, and it would be safe to ignore the “no instruction” stall reason. Or in other words, there may be a sufficient number of issuing (“selected”) warps to hide the latency of the “no instruction” warps.

One easy way to confirm this is via GPU Trace. In the below example, you can see it’s issue limited (yellow at 100%), with a mix of ALU (int32) and FMA (fp32).

When throughput bound, usually the goal is to offload the limiting pipeline by adjusting the instruction mix. Or, you may pat yourself on the back for achieving peak throughput :-)

Other possibilities, since I$ contention is a “global” problem:

Another portion of the shader may consist of a large # of instructions.
Multiple shaders may be executing concurrently.

I am dispatching ~65,000 threads in groups of 32

If the shader is latency-limited (not throughput limited) the following advice will be more relevant.
A group size of 32-threads limits occupancy to 16 warps-per-SM on most GPUs; that’s due to the HW limit of 16 groups-per-SM at a time.
To overcome this, it’s generally advisable to launch at least 64 threads per group, yielding 32 warps-per-SM. (on Turing that can reach 100% warp occupancy; on Ampere it’s 66% while yielding Turing-equivalent perf)
The rationale for higher occupancy is to increase the chance of issuing an instruction each cycle. If the shader is already throughput-bound, increasing occupancy will not be useful, and may even be counter-productive.

The Shader Profiler has a built-in occupancy calculator in the Summary tab. If you hover over the # Warp value, I suspect you’ll see something like this, but with a HW CTA Slot Limiter = 16. (HW CTA Slot == group)

the specific instructions emitted from the SPIR-V

This is available in the Source view

Can I show the raw shader instructions in Nsight Graphics like it is possible in Nsight compute for CUDA?

This is only available in the Pro version. Please email NsightGraphics@nvidia.com with your name + the name of your business entity, and we’ll get the conversation started.

Additional Material
The following are not exactly specific to your problem, but hopefully you find them interesting.

The Peak-Performance-Percentage Analysis Method for Optimizing Any GPU Workload | NVIDIA Technical Blog
Kernel Profiling Guide :: Nsight Compute Documentation
This GTC presentation discusses both the compute pipe and ray tracing: Getting Started with Ray Tracing Graphics Tools | NVIDIA On-Demand

Best,
Avinash

FriedrichS · December 15, 2022, 12:40pm

Hi Avinash,

thank you very much for taking your time to answer my question. I have marked your response as the solution, because increasing the group size as you suggested already doubled the performance. I highly appreciate the additional material you added.

system · December 29, 2022, 12:40pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
"Instruction Fetch" in Nsight Performance Analysis CUDA Programming and Performance	8	2489	January 7, 2016
"no instruction" stalls every 256 bytes of the binary code CUDA Programming and Performance	7	1507	February 14, 2019
Improving GPU Performance by Reducing Instruction Cache Misses Technical Blog	5	40	August 9, 2024
Kernel with very low eligible warps despite fully coalesced memory access CUDA Programming and Performance	7	940	July 17, 2023
How to know my kernel if Pipeline parallel by nsight compute Nsight Compute	6	813	April 18, 2023
How to keep the float pipe busy? CUDA Programming and Performance	7	699	April 23, 2019
What cause dispatch stall? How to avoid it? Nsight Compute cuda	11	1682	February 9, 2023
Things related to stall reasons... or not so related CUDA Programming and Performance	6	1914	April 14, 2017
Instruction-Level Profiling of Graphics Shaders Nsight Graphics	1	605	September 20, 2019
Using Nsight Compute to Inspect your Kernels Technical Blog	2	1640	August 31, 2020

Finding cause of "No Instruction" stall and optimizing for it

Related topics