Hi
I have noticed that in Resnet50, the kernels profiled with Nsight-Compute are different what I see in the output of Nvbit. I have attached the commands and outputs. As you can see, profiler shows the following order:
Starting warmup. Running for a minimum of 5 seconds.
==PROF== Profiling "convActPoolKernelV2" - 0 (1/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "res2_sm_80_kernel" - 1 (2/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "sm80_xmma_fprop_implicit_gemm..." - 2 (3/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "sm80_xmma_fprop_implicit_gemm..." - 3 (4/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "conv_kernel_128_sb" - 4 (5/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "sm80_xmma_fprop_implicit_gemm..." - 5 (6/78825): 0%....50%....100% - 1 pass
While Nvbit shows the following order:
kernel 0 - void nvinfer1::rt::cuda::(anonymous namespace)::convActPoolKernelV2<16, 240, 4>(void*, void*, void*, void*, int, int, nvinfer1::rt::cuda::firstLayerMaxpoolPadding) - #thread-blocks 14336, kernel instructions 524986368, total instructions 524986368
kernel 1 - sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_32_tilesize64x192x64_stage3_warpsize2x2x1_g1_tensor16x8x32_simple_t1r1s1_execute_kernel_trt - #thread-blocks 50176, kernel instructions 315506688, total instructions 840493056
kernel 2 - sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_32_tilesize96x128x64_stage3_warpsize2x2x1_g1_tensor16x8x32_t1r3s3_execute_kernel_trt - #thread-blocks 8363, kernel instructions 117316132, total instructions 957809188
kernel 3 - cask_plugin_trt::xmma_trt::conv_kernel_128_sb(cask_plugin_trt::xmma_trt::geometry_t, void*, void*, void*, void*, void*, float) - #thread-blocks 68, kernel instructions 88130916, total instructions 1045940104
kernel 4 - sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_32_tilesize96x64x64_stage3_warpsize2x2x1_g1_tensor16x8x32_t1r1s1_execute_kernel_trt - #thread-blocks 66904, kernel instructions 384028576, total instructions 1429968680
kernel 5 - sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_32_tilesize96x64x64_stage3_warpsize2x2x1_g1_tensor16x8x32_simple_t1r1s1_execute_kernel_trt - #thread-blocks 16726, kernel instructions 90721824, total instructions 1520690504
I am using the docker image provided in the inference_2.0/closed/NVIDIA. The hardware is 3080 with CUDA-11.6.
I did the test after running make prebuild
. I mean I didn’t exit the docker and re-enter that.
So the docker image and code are the same.
Any idea about that?
nvbit.txt (19.3 KB)
compute.txt (3.7 KB)