CUDA Kernel Stall no instruction is very high using nsight compute

Chad.Ding · May 18, 2021, 2:13am

I Try to tuning a conv kernel using tensorcore in Jetson Xavier, however, I found that the kernel stall no instruction is very high when using nsight compute. And when I remove the store global operater, And I can see the stall no instruction is low.
The instruction size is 4928 line which means 77KB

The blue one is origin conv, and the purple one is the conv remove store global operator. So I confused that why STG operator will affect the stall no instruction?
conv_origin.nsight-cuprof-report (2.5 MB)
conv_merge.nsight-cuprof-report (2.7 MB)

Greg · May 21, 2021, 2:45am

The primary issue is that there is only 1 warp per SM sub-partition (SMSP). The warp scheduler has no ability to cover latency by switching warps in this configuration. The warp occupancy limit is due to the 20,480 bytes of shared memory per thread block with only 1 warp per thread block. If you could fit 2 warps per SM sub-partition the performance would likely double given the current math throughput and memory throughput.

Each kernel has >4800 instructions which far exceeds the instruction cache and second level constant cache. Since the 4 warps are in different blocks the warps are in very different locations of the shader putting a very heavy stress on the i-cache. For most kernels the target occupancy results in multiple warps accessing each instruction cache line. In the case of these kernels I suspect each cache line is only accessed 1-2x due to thrashing. I think the stalls on the STG instructions are helping gang the warps.

If I owned this my first step would be to get to 2 warps per SMSP to allow for some latency hiding and would fully use the register file.

If the kernel has an unrolled loop then I would try to reduce the number of times it is unrolled to reduce the demand on the instruction cache. Ideally you would want to reduce the icache to under 4K instructions but ideally I would shot for less than 1K instructions.

Topic		Replies	Views
"no instruction" stalls every 256 bytes of the binary code CUDA Programming and Performance	7	1627	February 14, 2019
Warp Schedulling CUDA Programming and Performance	7	8037	October 22, 2010
"Instruction Fetch" in Nsight Performance Analysis CUDA Programming and Performance	8	2557	January 7, 2016
Question regarding warp efficiency... CUDA Programming and Performance	9	15135	March 13, 2007
Very low speed CUDA Programming and Performance	2	1395	November 17, 2008
Things related to stall reasons... or not so related CUDA Programming and Performance	6	2049	April 14, 2017
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	24938	September 6, 2009
Warp Size Question CUDA Programming and Performance	21	14086	June 18, 2010
How many warps per SM to hide global mem latency? CUDA Programming and Performance	15	14183	November 18, 2008
Case study: [TensorCore backed Conv] What makes a huge "Stall Wait"? Deep Learning (Training & Inference) mixed-precision	0	707	April 25, 2019

CUDA Kernel Stall no instruction is very high using nsight compute

Related topics