Nsight-compute and NvBit differences

Hi
I have noticed that in Resnet50, the kernels profiled with Nsight-Compute are different what I see in the output of Nvbit. I have attached the commands and outputs. As you can see, profiler shows the following order:

Starting warmup. Running for a minimum of 5 seconds.
==PROF== Profiling "convActPoolKernelV2" - 0 (1/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "res2_sm_80_kernel" - 1 (2/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "sm80_xmma_fprop_implicit_gemm..." - 2 (3/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "sm80_xmma_fprop_implicit_gemm..." - 3 (4/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "conv_kernel_128_sb" - 4 (5/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "sm80_xmma_fprop_implicit_gemm..." - 5 (6/78825): 0%....50%....100% - 1 pass

While Nvbit shows the following order:

kernel 0 - void nvinfer1::rt::cuda::(anonymous namespace)::convActPoolKernelV2<16, 240, 4>(void*, void*, void*, void*, int, int, nvinfer1::rt::cuda::firstLayerMaxpoolPadding) - #thread-blocks 14336,  kernel instructions 524986368, total instructions 524986368
kernel 1 - sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_32_tilesize64x192x64_stage3_warpsize2x2x1_g1_tensor16x8x32_simple_t1r1s1_execute_kernel_trt - #thread-blocks 50176,  kernel instructions 315506688, total instructions 840493056
kernel 2 - sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_32_tilesize96x128x64_stage3_warpsize2x2x1_g1_tensor16x8x32_t1r3s3_execute_kernel_trt - #thread-blocks 8363,  kernel instructions 117316132, total instructions 957809188
kernel 3 - cask_plugin_trt::xmma_trt::conv_kernel_128_sb(cask_plugin_trt::xmma_trt::geometry_t, void*, void*, void*, void*, void*, float) - #thread-blocks 68,  kernel instructions 88130916, total instructions 1045940104
kernel 4 - sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_32_tilesize96x64x64_stage3_warpsize2x2x1_g1_tensor16x8x32_t1r1s1_execute_kernel_trt - #thread-blocks 66904,  kernel instructions 384028576, total instructions 1429968680
kernel 5 - sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_32_tilesize96x64x64_stage3_warpsize2x2x1_g1_tensor16x8x32_simple_t1r1s1_execute_kernel_trt - #thread-blocks 16726,  kernel instructions 90721824, total instructions 1520690504

I am using the docker image provided in the inference_2.0/closed/NVIDIA. The hardware is 3080 with CUDA-11.6.
I did the test after running make prebuild. I mean I didn’t exit the docker and re-enter that.
So the docker image and code are the same.
Any idea about that?

nvbit.txt (19.3 KB)
compute.txt (3.7 KB)

It appears that the main difference is the existence of a res2_sm_80_kernel in the Nsight Compute profile that doesn’t exist in Nvbit. Is that correct? I’m not sure what that kernel does, do you have any idea? One thing to try is running an Nsight Systems profile. If the kernel shows up there, then it may be a question for the Nvbit team. Without knowing what that kernel does and whether it’s supposed to be executing, it’s hard to say why they are different.

I tried Nsight systems and it also shows res2_sm_80_kernel.

Running [/opt/nvidia/nsight-systems/2022.2.1/target-linux-x64/reports/gpukernsum.py report1.sqlite]... 

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     20.9        811730962        800  1014663.7  1005076.0    950692   1519559      39240.6  sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_3…
     18.0        698975641         80  8737195.5  8755240.5   8416295   9622443     188208.6  res2_sm_80_kernel(uint4 *, uint4 *, uint2 *, uint4 *, float *, int)                                 
     12.5        485340060        320  1516687.7  1415606.5   1340262   1878760     185298.1  sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_3…
      8.6        332134484        320  1037920.3  1049477.0    977156   1374534      41047.1  sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_3…
      7.7        297900583        400   744751.5   741875.5    728035   1197285      27103.3  sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_3…
      6.3        244757400        480   509911.3   437522.0    413985   1252517     163791.8  cask_plugin_trt::xmma_trt::conv_kernel_1024_sb_relu(cask_plugin_trt::xmma_trt::geometry_t, void *, …
      4.8        186826419        240   778443.4   771252.0    731971    819716      17659.5  sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_3…
      4.4        171321127         81  2115075.6  2114250.0   2030377   2147946      19767.2  void nvinfer1::rt::cuda::<unnamed>::co
...

Apart from this kernel that doesn’t exist in Nvbit, the order of other kernels are also different. Do you have any idea about that? I don’t know if they are using the same streaming/ordering mechanism.

Nsight Compute will serialize the kernels, i.e. force them to run one after another. I don’t know how resnet is queueing up the work, but if it’s not deterministic, you could certainly see different ordering of kernels depending on how they get scheduled to the GPU. For resnet specifics, you may need to check with that them or do some more digging into the source if you have it available.