Hi all,
I am playing around with some metrics in NCU and I am particular looking at the number of instructions that a kernel can issue in every cycle.
My goal is to write a kernel, that emits a high amount of instructions every cycle without doing any meaningful work, i.e ideally the kernel doesn’t do any arithmetic, logical or data movement instructions such that it doesn’t show high pipe utilization across any of the pipelines.
The metrics that I’m mainly tracking currently in NCU for this purpose are:
- Pipe Utilizations in the Compute Workload Analysis section
- Executed IPC elapsed (sm__inst_executed.avg.per_cycle_elapsed)
- Executed IPC active (sm__inst_executed.avg.per_cycle_active)
- Issued IPC active (sm__inst_issued.avg.per_cycle_active)
- Issued Warp per Scheduler (smsp__issue_active.avg.per_cycle_active)
- stalls in the Warp State Statistics section
My kernel currently looks as follows. When looking at the compiled SASS code in NCU, the loop is completely unrolled
__global__ void null_kernel() {
#pragma unroll
for (int i = 0; i < 1024; ++i) {
asm volatile("nanosleep.u32 1;");
}
}
I’m having trouble getting close to issuing a maximum of 4 instructions/cycle/SM or issuing 1 instruction/cycle/SMSP. Instead when running this code on a H100 with the following launch configurations <<<132,1024>>> and <<<246,1024>>>, I keep seeing values very similar to the following (I don’t manage to get beyond them)
- sm__inst_executed.avg.per_cycle_elapsed: 0.24
- sm__inst_executed.avg.per_cycle_active: 0.25
- sm__inst_issued.avg.per_cycle_active: 0.5
- smsp__issue_active.avg.per_cycle_active: 0.12
I’m trying to understand why the kernel cannot issue more instructions, given that they are independent from each other. In terms of latency of the instruction, according to PTX ISA 8.5, a thread is guaranteed to wake up latest after 2*t, which given an SM frequency of 1.08Ghz should not take more than 3 cycles in this concrete case, but the instruction seems to take much longer.
I’m seeing a very high amount of “Stall Branch Resolved” reported for each of the nanosleep instructions, on average 51.58 cycles when launching with 132 thread blocks and 63.03 cycles when launching with 246 thread blocks.
So my questions boil down to:
- Why do the nanosleep instructions emit such a high amount of “Stall Branch Resolved” and what do they actually mean in this case?
- Is it physically even possible to emit close to 4 nanosleep instructions/cycle/SM or is there some limit on how many nanosleep instructions can be issued on average every cycle?
- My initial idea was to, instead of using nanosleep, use “NOP” instructions for not doing any meaningful work but still issuing an instruction. However I have not found a way to introduce them, given that they only seem to exist at the SASS level but not at the PTX level. Is there a way to introduce them or do you see a better way of writing a kernel that issues a high amount of instructions without saturating any of the pipelines?
Thanks a lot in advance ;)