Increasing number of active warps per scheduler

asandip785 · November 18, 2021, 11:31pm

I am using NSIGHT Compute to debug and optimize the kernels. Under Scheduler Statistics, I get the following message:

“Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only issues an instruction every 16.4 cycles. This might leave hardware resources underutilized and may lead to less optimal performance. Out of the maximum of 16 warps per scheduler, this kernel allocates an average of 3.24 active warps per scheduler, but only an average of 0.07 warps were eligible per cycle. Eligible warps are the subset of active warps that are ready to issue their next instruction. Every cycle with no eligible warp results in no instruction being issued and the issue slot remains unused. To increase the number of eligible warps either increase the number of active warps or reduce the time the active warps are stalled.”

How do I increase the number of active warps?

Robert_Crovella · November 18, 2021, 11:37pm

Increase the total number of warps in your grid. This question is also partially about occupancy, and an occupancy analysis of your code cannot be done based simply/purely on the profiler message.

Make sure your kernel grid is large enough to saturate the GPU. Find out how many SMs are in your GPU (deviceQuery) and how many warps can be resident on each SM (programming guide, table 14 or 15). That is the total warps required at minimum to drive that number reported by the profiler up to a higher value.
Beyond that, if the number is still low, you have to find out why from an occupancy perspective. Learn to use the occupancy calculator spreadsheet. For example, if your kernel is of threadblock size 128, and it also uses 48kbyte of shared memory, you will be limited to 4 warps per SM. Let’s say your SM has 2 warp schedulers. Then you would be limited to 2 warps per scheduler. This is just a made-up example.

By the way, possibly the more serious issue here is eligible warps. That very low number means that most of the time, your warps are stalled. That is a latency issue.

njuffa · November 18, 2021, 11:50pm

Some possible causes are low compute density, heavy use of synchronization primitives, and long dependency chains, especially those involving low-throughput instructions. This blog post might be helpful (I didn’t read it, I only scanned it; their might be a more up-to-date one):

asandip785 · January 7, 2022, 1:18am

Thank you. I played with launch configurations and looked at the occupancy calculator as suggested. Here are the updated results:

The theoretical active warps per SM is 48 and the achieved active warps per SM IS 45.02, does that mean that the number of active warps per SM is close to the max. limit?

I continue to get a similar message:
“Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only issues an instruction every 88.9 cycles. This might leave hardware resources underutilized and may lead to less optimal performance. Out of the maximum of 16 warps per scheduler, this kernel allocates an average of 11.95 active warps per scheduler, but only an average of 0.02 warps were eligible per cycle. Eligible warps are the subset of active warps that are ready to issue their next instruction. Every cycle with no eligible warp results in no instruction being issued and the issue slot remains unused. To increase the number of eligible warps either increase the number of active warps or reduce the time the active warps are stalled.”

Is the number of active warps per SM different from the number of active warps per scheduler? If yes, how do I increase the number of active warps per scheduler?

Robert_Crovella · January 7, 2022, 3:38pm

Yes, I would say 45 is close to the max limit of 48.

Yes. You may wish to study the architecture whitepaper for the GPU you are running on. You can find this with a simple google search. A modern GPU SM usually has at least two warp schedulers. In modern GPUs warps are assigned to each of these schedulers when the threadblock is assigned to the SM by the block scheduler. The division of warps between schedulers is unspecified and fixed. So the Number of active warps per SM is the sum of the number of active warps for each of the schedulers in that SM.

The way to increase the number of active warps, whether per scheduler or per SM, is to maximize occupancy. This topic is covered extensively in a variety of web resources, and CUDA includes an occupancy calculator spreadsheet which will be instructive.

Topic		Replies	Views
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5898	July 25, 2007
Amount of Shared Memory CUDA Programming and Performance	10	4181	June 3, 2010
Block size and occupancy CUDA Programming and Performance	12	52	January 2, 2025
Why sometimes number of issued warps is smaller than the number eligible warps? CUDA Programming and Performance	4	921	April 3, 2019
How to keep the float pipe busy? CUDA Programming and Performance	7	703	April 23, 2019
Question about threads per block and warps per SM CUDA Programming and Performance	13	15526	October 6, 2022
Warp switching does anybody understands the mechanism CUDA Programming and Performance	16	8469	March 28, 2008
What is Warp Allocation Granulatity for? CUDA Programming and Performance	8	3009	March 21, 2017
CUDA Pro Tip: Occupancy API Simplifies Launch Configuration Technical Blog	12	676	February 21, 2017
Kernel with very low eligible warps despite fully coalesced memory access CUDA Programming and Performance	7	1002	July 17, 2023

Increasing number of active warps per scheduler

Related topics