I am using NSIGHT Compute to debug and optimize the kernels. Under Scheduler Statistics, I get the following message:
“Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only issues an instruction every 16.4 cycles. This might leave hardware resources underutilized and may lead to less optimal performance. Out of the maximum of 16 warps per scheduler, this kernel allocates an average of 3.24 active warps per scheduler, but only an average of 0.07 warps were eligible per cycle. Eligible warps are the subset of active warps that are ready to issue their next instruction. Every cycle with no eligible warp results in no instruction being issued and the issue slot remains unused. To increase the number of eligible warps either increase the number of active warps or reduce the time the active warps are stalled.”
Increase the total number of warps in your grid. This question is also partially about occupancy, and an occupancy analysis of your code cannot be done based simply/purely on the profiler message.
Make sure your kernel grid is large enough to saturate the GPU. Find out how many SMs are in your GPU (deviceQuery) and how many warps can be resident on each SM (programming guide, table 14 or 15). That is the total warps required at minimum to drive that number reported by the profiler up to a higher value.
Beyond that, if the number is still low, you have to find out why from an occupancy perspective. Learn to use the occupancy calculator spreadsheet. For example, if your kernel is of threadblock size 128, and it also uses 48kbyte of shared memory, you will be limited to 4 warps per SM. Let’s say your SM has 2 warp schedulers. Then you would be limited to 2 warps per scheduler. This is just a made-up example.
By the way, possibly the more serious issue here is eligible warps. That very low number means that most of the time, your warps are stalled. That is a latency issue.
Some possible causes are low compute density, heavy use of synchronization primitives, and long dependency chains, especially those involving low-throughput instructions. This blog post might be helpful (I didn’t read it, I only scanned it; their might be a more up-to-date one):
The theoretical active warps per SM is 48 and the achieved active warps per SM IS 45.02, does that mean that the number of active warps per SM is close to the max. limit?
I continue to get a similar message:
“Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only issues an instruction every 88.9 cycles. This might leave hardware resources underutilized and may lead to less optimal performance. Out of the maximum of 16 warps per scheduler, this kernel allocates an average of 11.95 active warps per scheduler, but only an average of 0.02 warps were eligible per cycle. Eligible warps are the subset of active warps that are ready to issue their next instruction. Every cycle with no eligible warp results in no instruction being issued and the issue slot remains unused. To increase the number of eligible warps either increase the number of active warps or reduce the time the active warps are stalled.”
Is the number of active warps per SM different from the number of active warps per scheduler? If yes, how do I increase the number of active warps per scheduler?
Yes, I would say 45 is close to the max limit of 48.
Yes. You may wish to study the architecture whitepaper for the GPU you are running on. You can find this with a simple google search. A modern GPU SM usually has at least two warp schedulers. In modern GPUs warps are assigned to each of these schedulers when the threadblock is assigned to the SM by the block scheduler. The division of warps between schedulers is unspecified and fixed. So the Number of active warps per SM is the sum of the number of active warps for each of the schedulers in that SM.
The way to increase the number of active warps, whether per scheduler or per SM, is to maximize occupancy. This topic is covered extensively in a variety of web resources, and CUDA includes an occupancy calculator spreadsheet which will be instructive.