Nsight-compute and NvBit differences

mahmood.nt · February 10, 2023, 1:58pm

Hi
I have noticed that in Resnet50, the kernels profiled with Nsight-Compute are different what I see in the output of Nvbit. I have attached the commands and outputs. As you can see, profiler shows the following order:

Starting warmup. Running for a minimum of 5 seconds.
==PROF== Profiling "convActPoolKernelV2" - 0 (1/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "res2_sm_80_kernel" - 1 (2/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "sm80_xmma_fprop_implicit_gemm..." - 2 (3/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "sm80_xmma_fprop_implicit_gemm..." - 3 (4/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "conv_kernel_128_sb" - 4 (5/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "sm80_xmma_fprop_implicit_gemm..." - 5 (6/78825): 0%....50%....100% - 1 pass

While Nvbit shows the following order:

kernel 0 - void nvinfer1::rt::cuda::(anonymous namespace)::convActPoolKernelV2<16, 240, 4>(void*, void*, void*, void*, int, int, nvinfer1::rt::cuda::firstLayerMaxpoolPadding) - #thread-blocks 14336,  kernel instructions 524986368, total instructions 524986368
kernel 1 - sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_32_tilesize64x192x64_stage3_warpsize2x2x1_g1_tensor16x8x32_simple_t1r1s1_execute_kernel_trt - #thread-blocks 50176,  kernel instructions 315506688, total instructions 840493056
kernel 2 - sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_32_tilesize96x128x64_stage3_warpsize2x2x1_g1_tensor16x8x32_t1r3s3_execute_kernel_trt - #thread-blocks 8363,  kernel instructions 117316132, total instructions 957809188
kernel 3 - cask_plugin_trt::xmma_trt::conv_kernel_128_sb(cask_plugin_trt::xmma_trt::geometry_t, void*, void*, void*, void*, void*, float) - #thread-blocks 68,  kernel instructions 88130916, total instructions 1045940104
kernel 4 - sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_32_tilesize96x64x64_stage3_warpsize2x2x1_g1_tensor16x8x32_t1r1s1_execute_kernel_trt - #thread-blocks 66904,  kernel instructions 384028576, total instructions 1429968680
kernel 5 - sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_32_tilesize96x64x64_stage3_warpsize2x2x1_g1_tensor16x8x32_simple_t1r1s1_execute_kernel_trt - #thread-blocks 16726,  kernel instructions 90721824, total instructions 1520690504

I am using the docker image provided in the inference_2.0/closed/NVIDIA. The hardware is 3080 with CUDA-11.6.
I did the test after running make prebuild. I mean I didn’t exit the docker and re-enter that.
So the docker image and code are the same.
Any idea about that?

nvbit.txt (19.3 KB)
compute.txt (3.7 KB)

jmarusarz · February 13, 2023, 8:08pm

It appears that the main difference is the existence of a res2_sm_80_kernel in the Nsight Compute profile that doesn’t exist in Nvbit. Is that correct? I’m not sure what that kernel does, do you have any idea? One thing to try is running an Nsight Systems profile. If the kernel shows up there, then it may be a question for the Nvbit team. Without knowing what that kernel does and whether it’s supposed to be executing, it’s hard to say why they are different.

mahmood.nt · February 14, 2023, 12:38pm

I tried Nsight systems and it also shows res2_sm_80_kernel.

Running [/opt/nvidia/nsight-systems/2022.2.1/target-linux-x64/reports/gpukernsum.py report1.sqlite]... 

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     20.9        811730962        800  1014663.7  1005076.0    950692   1519559      39240.6  sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_3…
     18.0        698975641         80  8737195.5  8755240.5   8416295   9622443     188208.6  res2_sm_80_kernel(uint4 *, uint4 *, uint2 *, uint4 *, float *, int)                                 
     12.5        485340060        320  1516687.7  1415606.5   1340262   1878760     185298.1  sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_3…
      8.6        332134484        320  1037920.3  1049477.0    977156   1374534      41047.1  sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_3…
      7.7        297900583        400   744751.5   741875.5    728035   1197285      27103.3  sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_3…
      6.3        244757400        480   509911.3   437522.0    413985   1252517     163791.8  cask_plugin_trt::xmma_trt::conv_kernel_1024_sb_relu(cask_plugin_trt::xmma_trt::geometry_t, void *, …
      4.8        186826419        240   778443.4   771252.0    731971    819716      17659.5  sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_3…
      4.4        171321127         81  2115075.6  2114250.0   2030377   2147946      19767.2  void nvinfer1::rt::cuda::<unnamed>::co
...

Apart from this kernel that doesn’t exist in Nvbit, the order of other kernels are also different. Do you have any idea about that? I don’t know if they are using the same streaming/ordering mechanism.

jmarusarz · February 14, 2023, 8:35pm

Nsight Compute will serialize the kernels, i.e. force them to run one after another. I don’t know how resnet is queueing up the work, but if it’s not deterministic, you could certainly see different ordering of kernels depending on how they get scheduled to the GPU. For resnet specifics, you may need to check with that them or do some more digging into the source if you have it available.

Topic		Replies	Views
Can't Get NCU GUI To Import Properly Nsight Compute	8	1351	October 5, 2020
Nsight Compute not reporting/profiling all kernels profiled by Nsight Systems Nsight Compute	9	578	March 27, 2024
Nsight-Compute returns “No kernels were profiled” warning Nsight Compute	9	1481	July 27, 2023
==ERROR== Failed to prepare kernel for profiling (0xc00000fd) but CUDA sample works Nsight Compute kernel , nvbugs	13	2058	November 6, 2021
Nsight and nvprof results have large differences Nsight Compute	9	1182	November 26, 2019
Compute CLI hangs when profiling PyTorch application Nsight Compute	8	1815	August 6, 2019
Question about profiling nccl kernels with Nsight Compute Nsight Compute	20	5023	February 13, 2025
Option to profile only master process Nsight Compute cuda	23	3549	December 1, 2023
Ncu does not detect kernels, ==ERROR== The application returned an error code (11) Nsight Compute kernel , profiling	6	1903	December 13, 2023
Nsight Compute getting confused between kernels? Nsight Compute	6	333	June 17, 2024

Nsight-compute and NvBit differences

Related topics