How to keep the float pipe busy?

dyanab · April 19, 2019, 9:36am

I observed that there is a 2 cycle stall between two float operations, e.g., 2 stalls between two FFMAs. Does this suggest that I need more warps to hide the latency?

Greg · April 22, 2019, 7:07pm

On CC7.* the warp scheduler can issue a fully active warp to the FP32 pipe every 2 cycles. Dependent
instruction latency is 4 cycles. The diagrams below show 1 SM warp scheduler.

WARP STATE - This is the scheduler state of the warp. This is small set of the states that are available in the current CUDA profiling tools.
S - Selected
N - Not selected - there were two warps ready and scheduler picked a different warp
W - Wait - instruction specified warp to wait N cycles to resolve execution dependency

FP32 Issue (Warp ID)
# - Warp ID
- - No issue

INDEPENDENT INSTRUCTION ISSUE

Warp ID \ cycles      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Warp 0                S W S W S W S W S W S W S W S W S W S W

FP32 Issue (Warp ID)  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0     pipe_cycles_active = 100%

DEPENDENT INSTRUCTION ISSUE (1 WARP)

Warp ID \ cycles      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Warp 0                S W W W S W W W S W W W S W W W S W W W

FP32 Issue (Warp ID)  0 0 - - 0 0 - - 0 0 - - 0 0 - - 0 0 - -     pipe_cycles_active = 50%

DEPENDENT INSTRUCTION ISSUE (2 WARPS)

Warp ID \ cycles      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Warp 0                S W W W S W W W S W W W S W W W S W W W
Warp 0                N N S W W W S W W W S W W W S W W W S W

FP32 Issue (Warp ID)  0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1     pipe_cycles_active = 100%

If a warp has sufficient instruction level parallelism then 1 warp/scheduler can fully utilize the FP32 pipe.

If a warp has a dependent list of instructions then 1 warp/scheduler can only utilize 50% of the FP32 pipe so a minimum of 2 warps is required.

A kernel may issue a long sequence of independent math. During this section 1 warp can fully utilize the math pipe. In most kernels the kernel is performing other options such as loading data, storing data, or calculating and address. If the compiler can interleave loading/saving values to memory interleaved with FP32 instructions then it is possible a single warp can still drive 100% activity on the FP32 pipe but it is highly unlikely that a single warp has sufficient independent math to hide all latency so multiple warps are useful to hide latency. The CUDA profilers have counters to show pipe activity and instruction issue activity. The CUDA profilers can collect this information using statistic sampling of the program counter and show this next to disassembly or rolled up to source code.

njuffa · April 22, 2019, 7:58pm

It should be noted that for reasons that differ somewhat between GPU generation, running with just one warp per thread block is generally not advisable across all GPU architectures currently supported by CUDA (>= 3.0).

A good rule of thumb across GPU generations is to initially aim for a thread count per block that is a multiple of 32 threads (so an integer multiple of a warp) and falls between 128 to 256, with adjustments made according to the needs of the use case at hand. One would generally want to run at least two such thread blocks per SM.

Greg · April 22, 2019, 9:26pm

As @njuffa implied the minimal is not often enough to hide latency. Guidelines are

Threads per thread block should be a multiple of WARP_SIZE (32).
= 2 warps per thread block is necessary to achieve > 50% occupancy on most GPUs (due to maximum thread blocks per SM == 1/2 maximum warps per SM).
Multiple blocks per SM will reduce tail effects, avoid instruction issue stalls when warps hit barriers, and distribute instruction type over time to improve eligible warps.

When developing a kernel try to keep thread assignment flexible so it is easy to vary block and grid dimensions.
For most kernels the optimal occupancy is 50-70% range. For Turing the number tends to be higher due to the reduced number of warps per SM.

dyanab · April 23, 2019, 3:52am

Hi @Greg@NV and @njuffa. Thanks for your help.

I’m using an RTX GPU.

dyanab · April 23, 2019, 4:36am

I have another question.

I also observed that there is a 4 cycle stall between two consecutive Load/Store instructions on a Turing GPU. But from the Turing white paper, there are only 4 LD/ST units per processing block, so I except there is an 8 cycle stall (32 (warp size) / 4 (LDST unit) = 8) between two consecutive Load/Store instructions.

How should I explain the 4 cycle stall?

Greg · April 23, 2019, 4:48am

There is a difference between burst instruction issue rate and the sustained instruction issue rate. The burst rate is 1 LSU instruction per SM sub-partition per cycle. The sustained peak rate is 0.5 LSU instructions per cycle per SM.

dyanab · April 23, 2019, 5:02am

Got it. Really helpful.

Topic		Replies	Views
How to understand the "hide latency" CUDA Programming and Performance	13	3549	August 8, 2024
Warp Size Question CUDA Programming and Performance	21	13981	June 18, 2010
Basic question about warps CUDA Programming and Performance	14	6609	June 9, 2009
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15612	February 4, 2011
questions about thread execution & volatile CUDA Programming and Performance	19	16904	December 29, 2008
Basic question about hiding latency CUDA Programming and Performance	6	2131	July 9, 2014
Stupid (?) questions about Warp vs. Half Warp vs. SM width CUDA Programming and Performance	3	43767	November 12, 2010
Simple summary of CUDA execution model An attempt to simplify and summarize various sources on execu CUDA Programming and Performance	7	5567	July 28, 2009
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	24866	September 6, 2009
What limits the IPC in CUDA? or How to decrease the avg execution dependency cycles? CUDA Programming and Performance	6	7197	March 30, 2013

How to keep the float pipe busy?

Related topics