I launched a null kernel with grid size and block size I specified on H800 cuda 12.8. And I use ncu to profile this null kernel. ncu --clock-control none --metrics smsp__cycles_elapsed.max,smsp__cycles_active.max ./null_kernel 132 384
The results are as follows:
Launching kernel with grid_size=132, block_size=384
Total threads: 50688
[1504] null_kernel@127.0.0.1
null_kernel() (132, 1, 1)x(384, 1, 1), Context 1, Stream 7, Device 0, CC 9.0
Warning: Data collection happened without fixed GPU frequencies. Profiling results may be inconsistent.
Section: Command line profiler metrics
1013 cycles is longer than I would expect. A H800 has <132 SM (114 SMs?), so a subset of SMs will receive 2 blocks. Each block is 384 threads which is 3 warps per SMSP.
I warp per SM will cause IDC to miss LDC R1, c[0x0][0x28]. However, all warps will back pressure on the load until the miss is returned. I would expect the second wave will be able to immediately execute.
The best response is that you have a subset of SMs that are getting 2 CTAs launched which is increasing the number by quite a bit. It would be interesting to look at the smsp__cycles_active.min and smsp__cycles_active.avg to see how these differ.
1013 seems too high to me. Are you sure you are not building with -G? This will add a lot of extra instructions to guarantee everything completes and is flushed to memory which could easily increase latency by multiple 100s of cycles.
The reason why I profile the null kernel is that I want to measure the warp schedule overhead. And I want to use smsp__cycles_active to represent the overhead. Is this correct?
H800 has 132 SMs, same as H100, so the num_waves=1. And I didn’t use the -G option.
ncu --set full -o null -f ./null_kernel 132 384
And here is the ncu report.
In this report, the smsp__cycles_active.max = 774, min=447. smsp__cycles_elapsed.max = 4066. Is this result makes sense to you?
Another command: ncu --clock-control none -–metrics smsp__cycles_elapsed,smsp__cycles_active,gpc__cycles_elapsed.avg.per_second ./null_kernel 132 384
I assume by warp schedule overhead you are talking about thread block (CTA = Cooperative Thread Array) launch and complete overhead. If you were to launch 1 warp CTA where you force only 1 CTA per SM you could potentially measure CTA launch and warp launch latency but you cannot easily measure this by smsp__elapsed_cycles - smsp__active_cycles in the manner you are using without the raw timestamps or per SMSP instance counter which is not exposed.
I would also look at smsp__warps_launched.{min, max} to make sure you are truly getting the same work per SM to ensure the smsp__cycles_active.max is not significantly higher as you have 2 waves launching on 1 SM.
The warp launch rate on a SM 1 warp/cycle. The expensive operation is the CTA launch. The CWD rate on pre-GH100 is 1 CTA/cycle for an idle GPU and 0.5 CTAs/cycle for non-idle. For GH100 the launch rate for initial wave increase 2-3x. The impact at the per SM level also depends on amount of shared memory and other resources.