Eligible/Stalled warps


I am doing an example in order to understand how to improve performance through the eligible active warps and here is the code:

__global__ void LAUNCHBOUNDS(1024) kernel()
	float test = 1.0f;

int main ()
kernel << <102400 / 1024, 1024>> > ();

By monitoring the Nsight Compute, it seems that there is too high stalled warps at the launch as you can see on the next figure :

Do you know if there is something wrong on my settings?


You would want to inspect the Warp State Statistics section next, to identify why those warps are stalled and are not eligible. This section will show you the individual warp stall reasons that were found, described in detail here: https://docs.nvidia.com/nsight-compute/NsightCompute/index.html#statistical-sampler

You can also check the “Sampling Data” metrics on the Source page to see where those stalls occur in the code, even though for your trivial code that might not provide too much additional insight. For larger codes, it can be very valuable.

kernel has 0 side effects so it will be a null kernel. The kernel will have only 3 instructions.

MOV R1, c[0x0]0x28] ; setup stack pointer

If you look at the Warp State Statistics (using metric names) you will find the following:

smsp__average_warps_issue_stalled_barrier_per_issue_active.ratio [inst]             0.00
smsp__average_warps_issue_stalled_dispatch_stall_per_issue_active.ratio [inst]      0.00
smsp__average_warps_issue_stalled_drain_per_issue_active.ratio [inst]               3.40
smsp__average_warps_issue_stalled_imc_miss_per_issue_active.ratio [inst]           36.67
smsp__average_warps_issue_stalled_lg_throttle_per_issue_active.ratio [inst]         0.00
smsp__average_warps_issue_stalled_long_scoreboard_per_issue_active.ratio [inst]     0.00
smsp__average_warps_issue_stalled_math_pipe_throttle_per_issue_active.ratio [inst]  0.13
smsp__average_warps_issue_stalled_membar_per_issue_active.ratio [inst]              0.00
smsp__average_warps_issue_stalled_mio_throttle_per_issue_active.ratio [inst]        0.12
smsp__average_warps_issue_stalled_misc_per_issue_active.ratio [inst]                0.00
smsp__average_warps_issue_stalled_no_instruction_per_issue_active.ratio [inst]      6.22
smsp__average_warps_issue_stalled_not_selected_per_issue_active.ratio [inst]        0.55
smsp__average_warps_issue_stalled_selected_per_issue_active.ratio [inst]            1.00
smsp__average_warps_issue_stalled_short_scoreboard_per_issue_active.ratio [inst]    0.00
smsp__average_warps_issue_stalled_sleeping_per_issue_active.ratio [inst]            0.00
smsp__average_warps_issue_stalled_tex_throttle_per_issue_active.ratio [inst]        0.00
smsp__average_warps_issue_stalled_wait_per_issue_active.ratio [inst]                3.51

The average warp spent majority of time waiting on the initial imc_miss and waiting to fetch an instruction.

Basically, this test is measuring the overhead to launch a warp, miss in the constant cache, miss in the instruction cache, and exit.