Getting close to max IPC with nanosleep

Hi all,

I am playing around with some metrics in NCU and I am particular looking at the number of instructions that a kernel can issue in every cycle.
My goal is to write a kernel, that emits a high amount of instructions every cycle without doing any meaningful work, i.e ideally the kernel doesn’t do any arithmetic, logical or data movement instructions such that it doesn’t show high pipe utilization across any of the pipelines.

The metrics that I’m mainly tracking currently in NCU for this purpose are:

  • Pipe Utilizations in the Compute Workload Analysis section
  • Executed IPC elapsed (sm__inst_executed.avg.per_cycle_elapsed)
  • Executed IPC active (sm__inst_executed.avg.per_cycle_active)
  • Issued IPC active (sm__inst_issued.avg.per_cycle_active)
  • Issued Warp per Scheduler (smsp__issue_active.avg.per_cycle_active)
  • stalls in the Warp State Statistics section

My kernel currently looks as follows. When looking at the compiled SASS code in NCU, the loop is completely unrolled

__global__ void null_kernel() {
    #pragma unroll
    for (int i = 0; i < 1024; ++i) {
        asm volatile("nanosleep.u32 1;");
    }
}

I’m having trouble getting close to issuing a maximum of 4 instructions/cycle/SM or issuing 1 instruction/cycle/SMSP. Instead when running this code on a H100 with the following launch configurations <<<132,1024>>> and <<<246,1024>>>, I keep seeing values very similar to the following (I don’t manage to get beyond them)

  • sm__inst_executed.avg.per_cycle_elapsed: 0.24
  • sm__inst_executed.avg.per_cycle_active: 0.25
  • sm__inst_issued.avg.per_cycle_active: 0.5
  • smsp__issue_active.avg.per_cycle_active: 0.12

I’m trying to understand why the kernel cannot issue more instructions, given that they are independent from each other. In terms of latency of the instruction, according to PTX ISA 8.5, a thread is guaranteed to wake up latest after 2*t, which given an SM frequency of 1.08Ghz should not take more than 3 cycles in this concrete case, but the instruction seems to take much longer.
I’m seeing a very high amount of “Stall Branch Resolved” reported for each of the nanosleep instructions, on average 51.58 cycles when launching with 132 thread blocks and 63.03 cycles when launching with 246 thread blocks.

So my questions boil down to:

  • Why do the nanosleep instructions emit such a high amount of “Stall Branch Resolved” and what do they actually mean in this case?
  • Is it physically even possible to emit close to 4 nanosleep instructions/cycle/SM or is there some limit on how many nanosleep instructions can be issued on average every cycle?
  • My initial idea was to, instead of using nanosleep, use “NOP” instructions for not doing any meaningful work but still issuing an instruction. However I have not found a way to introduce them, given that they only seem to exist at the SASS level but not at the PTX level. Is there a way to introduce them or do you see a better way of writing a kernel that issues a high amount of instructions without saturating any of the pipelines?

Thanks a lot in advance ;)

NANOSLEEP is not a full speed instruction. The instruction is executed on a SM unit shared by the 4 sub-partitions. NANOSLEEP requires the control branch unit to elect what threads in the warp will run next. The election results in a Stall Branch Resolved. NANOSLEEP is not designed for extremely low cycle count sleeps. It is designed to remove the warp from the scheduler for many 10s to 100s of nanoseconds.

Maximum IPC can be reached using one of the following techniques:

  1. interleave FP32 add and INT add, or
  2. series of FP32 on GA10x+ (2xFFMA), or
  3. PTX instruction such as pmevent (PTX ISA 8.5) or __prof_trigger (CUDA C++ Programming Guide).

If using FP32 and INT try to reduce instruction dependencies or launch more than 1 warp per subpartition.

Thanks a lot Greg, the pmevent option did the trick!