How to keep the float pipe busy?

I observed that there is a 2 cycle stall between two float operations, e.g., 2 stalls between two FFMAs. Does this suggest that I need more warps to hide the latency?

On CC7.* the warp scheduler can issue a fully active warp to the FP32 pipe every 2 cycles. Dependent
instruction latency is 4 cycles. The diagrams below show 1 SM warp scheduler.

WARP STATE - This is the scheduler state of the warp. This is small set of the states that are available in the current CUDA profiling tools.
S - Selected
N - Not selected - there were two warps ready and scheduler picked a different warp
W - Wait - instruction specified warp to wait N cycles to resolve execution dependency

FP32 Issue (Warp ID)
# - Warp ID
- - No issue

INDEPENDENT INSTRUCTION ISSUE

Warp ID \ cycles      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Warp 0                S W S W S W S W S W S W S W S W S W S W

FP32 Issue (Warp ID)  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0     pipe_cycles_active = 100%

DEPENDENT INSTRUCTION ISSUE (1 WARP)

Warp ID \ cycles      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Warp 0                S W W W S W W W S W W W S W W W S W W W

FP32 Issue (Warp ID)  0 0 - - 0 0 - - 0 0 - - 0 0 - - 0 0 - -     pipe_cycles_active = 50%

DEPENDENT INSTRUCTION ISSUE (2 WARPS)

Warp ID \ cycles      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Warp 0                S W W W S W W W S W W W S W W W S W W W
Warp 0                N N S W W W S W W W S W W W S W W W S W

FP32 Issue (Warp ID)  0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1     pipe_cycles_active = 100%

If a warp has sufficient instruction level parallelism then 1 warp/scheduler can fully utilize the FP32 pipe.

If a warp has a dependent list of instructions then 1 warp/scheduler can only utilize 50% of the FP32 pipe so a minimum of 2 warps is required.

A kernel may issue a long sequence of independent math. During this section 1 warp can fully utilize the math pipe. In most kernels the kernel is performing other options such as loading data, storing data, or calculating and address. If the compiler can interleave loading/saving values to memory interleaved with FP32 instructions then it is possible a single warp can still drive 100% activity on the FP32 pipe but it is highly unlikely that a single warp has sufficient independent math to hide all latency so multiple warps are useful to hide latency. The CUDA profilers have counters to show pipe activity and instruction issue activity. The CUDA profilers can collect this information using statistic sampling of the program counter and show this next to disassembly or rolled up to source code.

It should be noted that for reasons that differ somewhat between GPU generation, running with just one warp per thread block is generally not advisable across all GPU architectures currently supported by CUDA (>= 3.0).

A good rule of thumb across GPU generations is to initially aim for a thread count per block that is a multiple of 32 threads (so an integer multiple of a warp) and falls between 128 to 256, with adjustments made according to the needs of the use case at hand. One would generally want to run at least two such thread blocks per SM.

As @njuffa implied the minimal is not often enough to hide latency. Guidelines are

  • Threads per thread block should be a multiple of WARP_SIZE (32).
  • = 2 warps per thread block is necessary to achieve > 50% occupancy on most GPUs (due to maximum thread blocks per SM == 1/2 maximum warps per SM).

  • Multiple blocks per SM will reduce tail effects, avoid instruction issue stalls when warps hit barriers, and distribute instruction type over time to improve eligible warps.

When developing a kernel try to keep thread assignment flexible so it is easy to vary block and grid dimensions.
For most kernels the optimal occupancy is 50-70% range. For Turing the number tends to be higher due to the reduced number of warps per SM.

Hi @Greg@NV and @njuffa. Thanks for your help.

I’m using an RTX GPU.

I have another question.

I also observed that there is a 4 cycle stall between two consecutive Load/Store instructions on a Turing GPU. But from the Turing white paper, there are only 4 LD/ST units per processing block, so I except there is an 8 cycle stall (32 (warp size) / 4 (LDST unit) = 8) between two consecutive Load/Store instructions.

How should I explain the 4 cycle stall?

There is a difference between burst instruction issue rate and the sustained instruction issue rate. The burst rate is 1 LSU instruction per SM sub-partition per cycle. The sustained peak rate is 0.5 LSU instructions per cycle per SM.

Got it. Really helpful.