Why sometimes number of issued warps is smaller than the number eligible warps?

As far as I’m concerned, once there are one or more eligible warps, the scheduler can issue at least one warp. What prevents the scheduler from issuing warps when there are eligible warps?

Thanks

To some degree, this may depend on the GPU you are running on. Recent GPUs partition warps to warp schedulers. If you have 16 available warps, and 4 warp schedulers, then each warp scheduler may be “responsible” for 4 warps. A warp scheduler can usually only issue (at most) 2 instructions from a single warp, per clock cycle. Therefore if the 4 warps assigned to warp scheduler 0 are all eligible, and none of the 12 warps assigned to warp schedulers 1, 2, and 3 are eligible, you will have 4 out of 16 warps eligible, but only 1 or 2 instructions issued (i.e. one issued warp) in that clock cycle, in that SM.

I imagine there may be other possible reasons/examples, as well. For example, suppose all warps are eligible in the above example. You have 16 eligible warps, but only 4 issued warps, in a particular cycle.

That is reasonable. Thank you, Robert.

The number of issued warps is always less than the number of eligible warps as issued warps are a subset of eligible warps. From a data collection standpoint these two counters may be collected on separate passes so there is a small chance that this condition does not hold true if the tool has to replay to collect the counters.

When a warp is launched it is assigned to a SM sub-partition (warp scheduler). The warp will remain on that SM sub-partition until it completes. In the case of instruction level preemption the warp will be saved and restored to the same SM sub-partition. The only exception is CDP preemption which is SW implementation.

On each cycle the warp scheduler will scan the active warps for eligible warps (warps that are not stalled) and select one warp to issue. The micro-scheduler may issue 1 or 2 instructions from the warp. The number is dependent on the architecture.

Nsight compute metrics are collected at the SM subpartition (smsp) level. Other tools collect at the SM subpartition level and display at the SM level.

At the SM sub-partition level there can be 1 - MAX_WARPS_PER_SUBPARTITION active warps. This varies from 8-32 on recent hardware. An active warp is either eligible or stalled. Only 1 eligible warp can be selected each cycle.

If the raw counters are rolled up to the SM level then the worst case is only 1 sub-partition can have warps and the number of eligible warps per SM could be 16 and only 1 is selected. The best case is there is at least 1 eligible warp per sub-partition each cycle so all schedulers can issue instruction each cycle.

https://docs.nvidia.com/nsight-visual-studio-edition/Nsight_Visual_Studio_Edition_User_Guide.htm#Analysis/Report/CudaExperiments/KernelLevel/IssueEfficiency.htm

The GTC2018 talk S9345 - CUDA Kernel Profiling Using NVIDIA Nsight Compute by Magnus Strengert has good coverage of this topic. The slides and recording will be available later this year.

Very helpful. Thank you for the information, Magnus. :)