I am working on a simple MLP implementation in HLSL for Vulkan. I am optimizing for maximum performance and use Nsight Graphics for shader profiling. An improved version of the shader moves all parameters of the MLP into shared memory for higher throughput and while the previous version was bottlenecked by memory throughput
and latency (probably most likely latency), this one is experiencing stalls because of instruction fetches (“No Instruction” stall in nsight).
The picture above shows the simple (naive) matrix multiplication code. In this test, I disabled loop unrolling to ensure that the loops would not emit too many instructions. Even with no unrolling, the code is still severely bottlenecked.
I could only find sparse literature on the topic, and I am trying to understand the possible causes of the instruction cache stall. Any blogs or documents describing the nature of this problem in detail are highly welcome. I am certain that it would help to see lower-level statistics of the shader like the amount of SASS instructions or the specific instructions emitted from the SPIR-V. Just looking at the shader code does not provide useful insight into this problem.
One cause I could find is that there might be not enough warps to cover the fetch latency. However, I am dispatching ~65,000 threads in groups of 32, and I would think that this gives enough freedom to the warp scheduler.
Can I show the raw shader instructions in Nsight Graphics like it is possible in Nsight compute for CUDA?
And what possible ways are there to debug the instruction fetch bottleneck?