How to know my kernel if Pipeline parallel by nsight compute

A warp reports “Stall Wait” when stalls between instructions with data dependencies that have a statically defined number of cycles.

EX. 1 No dependency
FADD R0, R1, R2
FMUL R3, R4, R5

There is no data dependency between the two FADD instructions. Assuming the two instructions are in the same cache line and there are no other data dependencies on R4 or R5, then between instructions the warp may either issue back to back instructions or report one of the following stall reasons: math throttle, not selected, or dispatch stall.

EX 2. Register Dependency
FADD R0, R1, R2
FMUL R3, R0, R5

There is a data dependency on FADD output register and FMUL input register R0. The warp will report Stall Wait for a fixed number of cycles based upon the depth of the FMA pipeline. After the fixed number of cycles the warp is eligible to issue but may report the following stall reasons: math throttle, not selected, or dispatch stall.

EX 3. Variable Latency Dependency
LDG.E R0, [R2]
FMUL R4, R0, R5

There is a data dependency on LDG return register R0 and FMUL input register R0. The warp may report Stall Wait for 1-2 cycles before reporting Stall Long Scoreboard. When LDG writes back R0 to the register file then the warp is eligible to issue the FMUL but may report the following stall reasons: math throttle, not selected, or dispatch stall.

  1. If it is because not Pipeline parallel?

No. Wait is reported when there is a fixed latency dependency between two instructions. See EX 2 above. Depending on the pipeline another warp may be able to issue to the pipeline. For example if the issue rate of the target pipeline is 1/2 instruction/cycle and the dependent latency is 6 cycles then it is possible that 2 other instructions could be issued to the same pipeline. If these instructions are from the same warp then this is instruction level parallelism. If the instructions are from different warps then this is the benefit of using multiple warps to hide latency.

  1. Can I measuer the time of wait between two stage?

If by stage you mean the latency between two instructions then the answer is no there is no a tool to measure the cycles between two instructions. This can be determined by writing microbenchmarks. There have been papers published on numerous NVIDIA GPU architectures.

  1. Can I get which step cause it? such as : L2 needs to wait for data from DRAM before transferring to L1?

The best method to determine where stalls are occurring is to use the Nsight Compute Source View and look at the SASS view. The GTC 2023 presentation Become Faster in Writing Performant CUDA Kernels using the Source Page in Nsight Compute provides a good introduction to the view.

The Source View can help you track down what operations have the most stalls and the stall reasons. There is not a good method to determine if a memory operation hit or missed in the L1 or L2 cache; however, you can determine the magnitude of the stall.

1 Like