How to know my kernel if Pipeline parallel by nsight compute

In my opinion, for a big kernel, the kernel time include computer and memory. when I use fp16, the kernel works with memory bound, but it is stall less than 50% Throughput, The “GPU Speed Of Light Throughput” tell me it may because “Latency Issue”
The “Warp state statistics” show the top state is “Stall Wait”.
and "This stall type represents about 43.8% of the total average of 7.5 cycles between issuing two instructions. " I want to know:

  1. If it is because not Pipeline parallel?
  2. Can I measuer the time of wait between two stage?
  3. Can I get which step cause it? such as : L2 needs to wait for data from DRAM before transferring to L1?

A warp reports “Stall Wait” when stalls between instructions with data dependencies that have a statically defined number of cycles.

EX. 1 No dependency
FADD R0, R1, R2
FMUL R3, R4, R5

There is no data dependency between the two FADD instructions. Assuming the two instructions are in the same cache line and there are no other data dependencies on R4 or R5, then between instructions the warp may either issue back to back instructions or report one of the following stall reasons: math throttle, not selected, or dispatch stall.

EX 2. Register Dependency
FADD R0, R1, R2
FMUL R3, R0, R5

There is a data dependency on FADD output register and FMUL input register R0. The warp will report Stall Wait for a fixed number of cycles based upon the depth of the FMA pipeline. After the fixed number of cycles the warp is eligible to issue but may report the following stall reasons: math throttle, not selected, or dispatch stall.

EX 3. Variable Latency Dependency
LDG.E R0, [R2]
FMUL R4, R0, R5

There is a data dependency on LDG return register R0 and FMUL input register R0. The warp may report Stall Wait for 1-2 cycles before reporting Stall Long Scoreboard. When LDG writes back R0 to the register file then the warp is eligible to issue the FMUL but may report the following stall reasons: math throttle, not selected, or dispatch stall.

  1. If it is because not Pipeline parallel?

No. Wait is reported when there is a fixed latency dependency between two instructions. See EX 2 above. Depending on the pipeline another warp may be able to issue to the pipeline. For example if the issue rate of the target pipeline is 1/2 instruction/cycle and the dependent latency is 6 cycles then it is possible that 2 other instructions could be issued to the same pipeline. If these instructions are from the same warp then this is instruction level parallelism. If the instructions are from different warps then this is the benefit of using multiple warps to hide latency.

  1. Can I measuer the time of wait between two stage?

If by stage you mean the latency between two instructions then the answer is no there is no a tool to measure the cycles between two instructions. This can be determined by writing microbenchmarks. There have been papers published on numerous NVIDIA GPU architectures.

  1. Can I get which step cause it? such as : L2 needs to wait for data from DRAM before transferring to L1?

The best method to determine where stalls are occurring is to use the Nsight Compute Source View and look at the SASS view. The GTC 2023 presentation Become Faster in Writing Performant CUDA Kernels using the Source Page in Nsight Compute provides a good introduction to the view.

The Source View can help you track down what operations have the most stalls and the stall reasons. There is not a good method to determine if a memory operation hit or missed in the L1 or L2 cache; however, you can determine the magnitude of the stall.

1 Like

Thanks, I got the idea of stall wait. The EXs are very clearly. I checked my kernel, it may not my barrier。

So I get another questions. : )
I test a matmul kernel with memory bound (fp16),
I see the memory chart, there are four important lines of memory : DRAM_L2 L2_DRAM L2_L1 L1_L2
and there are some lines like: ShardMenory_Shared L1_Global Shared_Kernel. I want to know:

  1. The kernel is memory bound, but how can I know which memory cause the bound? ie. If L2 to DRAM or L2 to L1 cause overall memory bound. The peak%?
  2. Suppose the bottleneck is at L2 to L1(or L2 to Shared memory), it store 100MB data, the bandwidth is 1GB/s, so the time of the kernel may nearly 0.1s or just a little more, but the real time is 0.8 or more. I see the roofline chart, it assume the kernel as memory bound, but is it realy a memory bound? This situation should be attributed to the Latency bottleneck

If you collect a full Nsight Compute report and expand the GPU Speed of Light Throughput section there is a table of Memory Throughput Breakdowon that is sorted from highest to lowest. The name is prefexed by L1, L2, or DRAM. The throughput in bytes per second is not the only criteria as a program that could also be request limited.

Determining if an application is latency bound is difficult. A kernel is generally marked as latency bound if neither Compute Throughput or Memory Throughput is high.

  1. Suppose the bottleneck is at L2 to L1(or L2 to Shared memory), it store 100MB data, the bandwidth is 1GB/s, so the time of the kernel may nearly 0.1s or just a little more, but the real time is 0.8 or more. I see the roofline chart, it assume the kernel as memory bound, but is it realy a memory bound? This situation should be attributed to the Latency bottleneck

There is insufficient information in the question. Depending on the GPU it is hard to exceed 80% or 90% of parts of the memory system. It is possible that a program is both memory and latency bound.

  • If a kernel is memory bound and the memory bandwidth is increased in the critical path then the grid should run faster.
  • If a kernel is latency bound and the memory bandwidth is increased in the critical path then the grid may not run faster if the latency is not decreased. Latency bound kernels often need to be fixed by increasing parallelism, improving cache hits, increasing occupancy to hide latency, etc.

Thanks for your reply.

  1. I reprofile my code, and see the Memory Throughput Breakdown is L1:Data Pipe Lsu Wavefronts[%], but I don’t actually understand the mean, I will read the doc again.
  2. As you provide:
  • If a kernel is memory bound and the memory bandwidth is increased in the critical path then the grid should run faster.
  • If a kernel is latency bound and the memory bandwidth is increased in the critical path then the grid may not run faster if the latency is not decreased. Latency bound kernels often need to be fixed by increasing parallelism, improving cache hits, increasing occupancy to hide latency, etc.

can I test my kernel by limit or change the bandwidth?(eg. limit the L2 to 1GB/s) it might no such section in nvidia-smi.

  1. I reprofile my code, and see the Memory Throughput Breakdown is L1:Data Pipe Lsu Wavefronts[%], but I don’t actually understand the mean, I will read the doc again.

The L1:Data Pipe LSUS Wavefronts (l1tex__data_pipe_lsu_wavefront) is the percent of cycles that the L1 SRAM was accessed for global, local, shared, or surface write operations.

If all threads access the same address then wavefronts will be 1 for the instruction.
If all threads access an address with 128B stride then wavefronts will be 32 (1 per thread) for the instruction.

Vector operations (64-bit, 128-bit) or operations with address divergence will increase the number of L1 data accesses.

  1. … can I test my kernel by limit or change the bandwidth?(eg. limit the L2 to 1GB/s) it might no such section in nvidia-smi.

No tool exists that can limit the memory system B/W or easily limit a specific section (e.g. L1 vs. L2, vs. DRAM). nvidia-smi can be used to change the memory clock on GDDR* based systems but there is not a good system to perform these studies.

Thank you for your guidance !