Hi,
Not yet.
As our internal benchmarks are not showing the regression between JetPack4 and JetPack5, we are trying to understand the implementation differences between PyTorch 1.8 and 2.1 to see what might cause the issue.
Currently, we found the ‘where’ operation is the main reason for the perf drop (in the post-processing case).
The kernel takes around 25ms on JetPack 4 but requires 31ms on JetPack 5.
And now we are building PyTorch 1.8 on JetPack 5 to see if we can reproduce the same perf drop with the exact identical kernel.
Jetpack 4
$ sudo /usr/local/cuda-10.2/bin/nvprof python3 run_benchmark_standalone.py
==14210== NVPROF is profiling process 14210, command: python3 run_benchmark_standalone.py
==14210== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
gpu_post_process_time: 2.603844404220581 seconds for 100 executions, 0.02603844404220581 for 1
==14210== Profiling application: python3 run_benchmark_standalone.py
==14210== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.37% 2.53165s 101 25.066ms 24.942ms 26.711ms _ZN2at6native27unrolled_elementwise_kernelIZZZNS0_83_GLOBAL__N__59_tmpxft_00007230_00000000_8_TensorCompare_compute_72_cpp1_ii_da58410617where_kernel_implERNS_14TensorIteratorEN3c1010ScalarTypeEENKUlvE_clEvENKUlvE6_clEvEUlbffE_NS_6detail5ArrayIPcLi4EEE16OffsetCalculatorILi3EjESE_ILi1EjENS0_6memory15LoadWithoutCastENSH_16StoreWithoutCastEEEviT_T0_T1_T2_T3_T4_
0.17% 4.4158ms 101 43.721us 43.490us 44.642us void at::native::_GLOBAL__N__63_tmpxft_000008d0_00000000_8_UpSampleNearest2d_compute_72_cpp1_ii_f539c38f::upsample_nearest2d_out_frame<c10::Half, float>(c10::Half const *, at::native::_GLOBAL__N__63_tmpxft_000008d0_00000000_8_UpSampleNearest2d_compute_72_cpp1_ii_f539c38f::upsample_nearest2d_out_frame<c10::Half, float>*, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, float, float)
0.17% 4.3586ms 101 43.154us 42.050us 46.306us void at::native::vectorized_elementwise_kernel<int=4, at::native::MulScalarFunctor<float, float>, at::detail::Array<char*, int=2>>(int, float, float)
0.17% 4.2543ms 101 42.121us 40.898us 50.498us _ZN2at6native27unrolled_elementwise_kernelIZZZNS0_21copy_device_to_deviceERNS_14TensorIteratorEbENKUlvE0_clEvENKUlvE6_clEvEUlfE_NS_6detail5ArrayIPcLi2EEE23TrivialOffsetCalculatorILi1EjESC_NS0_6memory12LoadWithCastILi1EEENSD_13StoreWithCastEEEviT_T0_T1_T2_T3_T4_
0.12% 3.0912ms 101 30.606us 29.761us 32.385us void at::native::vectorized_elementwise_kernel<int=4, at::native::BUnaryFunctor<at::native::CompareLTFunctor<float>>, at::detail::Array<char*, int=2>>(int, float, at::native::CompareLTFunctor<float>)
0.00% 28.481us 1 28.481us 28.481us 28.481us void at::native::vectorized_elementwise_kernel<int=4, at::native::FillFunctor<float>, at::detail::Array<char*, int=1>>(int, float, at::native::FillFunctor<float>)
0.00% 3.1690us 1 3.1690us 3.1690us 3.1690us void at::native::vectorized_elementwise_kernel<int=4, at::native::FillFunctor<c10::Half>, at::detail::Array<char*, int=1>>(int, c10::Half, at::native::FillFunctor<c10::Half>)
API calls: 77.15% 8.66442s 3 2.88814s 329.35us 8.58146s cudaMalloc
22.46% 2.52256s 101 24.976ms 24.758ms 26.537ms cudaStreamSynchronize
0.26% 28.697ms 507 56.600us 37.185us 205.16us cudaLaunchKernel
0.12% 13.345ms 5374 2.4830us 1.3120us 83.938us cudaGetDevice
0.01% 711.92us 611 1.1650us 576ns 28.513us cudaGetLastError
0.00% 224.04us 97 2.3090us 1.0880us 28.832us cuDeviceGetAttribute
0.00% 68.258us 1 68.258us 68.258us 68.258us cudaGetDeviceProperties
0.00% 31.840us 2 15.920us 2.4640us 29.376us cuDeviceGet
0.00% 14.496us 2 7.2480us 4.1920us 10.304us cudaSetDevice
0.00% 9.0240us 1 9.0240us 9.0240us 9.0240us cuDeviceTotalMem
0.00% 8.6400us 3 2.8800us 1.7920us 4.1280us cuDeviceGetCount
0.00% 3.8720us 2 1.9360us 1.2800us 2.5920us cudaGetDeviceCount
0.00% 2.0810us 1 2.0810us 2.0810us 2.0810us cuDeviceGetName
0.00% 1.5360us 1 1.5360us 1.5360us 1.5360us cuDeviceGetUuid
Jetpack 5
Time Total Time Instances Avg Med Min Max StdDev Name
99.5% 3.147 s 101 31.160 ms 31.154 ms 31.125 ms 31.410 ms 31.412 μs void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl<at::native::<unnamed>::where_kernel_impl(at::TensorIterator &)::[lambda() (instance 1)]::operator ()() const::[lambda() (instance 7)]::operator ()() const::[lambda(bool, float, float) (instance 1)]>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)
0.1% 4.406 ms 101 43.621 μs 43.424 μs 42.432 μs 48.000 μs 911 ns void at::native::vectorized_elementwise_kernel<(int)4, at::native::AUnaryFunctor<float, float, float, at::native::binary_internal::MulFunctor<float>>, at::detail::Array<char *, (int)2>>(int, T2, T3)
0.1% 4.312 ms 101 42.692 μs 42.624 μs 42.496 μs 45.152 μs 334 ns void at::native::<unnamed>::upsample_nearest2d_out_frame<c10::Half, &at::native::nearest_neighbor_compute_source_index>(const T1 *, T1 *, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, float, float)
0.1% 4.178 ms 101 41.368 μs 41.152 μs 39.936 μs 52.064 μs 1.473 μs void at::native::unrolled_elementwise_kernel<at::native::direct_copy_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 2)]::operator ()() const::[lambda() (instance 7)]::operator ()() const::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>, TrivialOffsetCalculator<(int)1, unsigned int>, TrivialOffsetCalculator<(int)1, unsigned int>, at::native::memory::LoadWithCast<(int)1>, at::native::memory::StoreWithCast<(int)1>>(int, T1, T2, T3, T4, T5, T6)
0.1% 3.118 ms 101 30.873 μs 30.816 μs 30.240 μs 32.384 μs 427 ns void at::native::vectorized_elementwise_kernel<(int)4, void at::native::compare_scalar_kernel<float>(at::TensorIteratorBase &, at::native::<unnamed>::OpType, T1)::[lambda(float) (instance 1)], at::detail::Array<char *, (int)2>>(int, T2, T3)
0.0% 27.584 μs 1 27.584 μs 27.584 μs 27.584 μs 27.584 μs 0 ns void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3)
0.0% 4.544 μs 1 4.544 μs 4.544 μs 4.544 μs 4.544 μs 0 ns void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<c10::Half>, at::detail::Array<char *, (int)1>>(int, T2, T3)
Thanks.