I tried to load a Tensorflow(1.13.1) implemented YOLOv3 model on Jetson Nano, but it takes about 30-40 seconds to load and inference the very first images, and after loading the inference seems good.
I acknowledged that Tensorflow has a lazy load feature for image inference. But my main problem is to reduce the loading time.
I noticed that the memory cache architecture might different from ARM Jetson system and X86 system:
on Section 3.2 Pinned Memory of CUDA for Tegra :: CUDA Toolkit Documentation
Some profile is also made:
[1] Bandwidth by /usr/local/cuda/samples/1_Utilities/bandwidthTest/
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: NVIDIA Tegra X1
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 10027.2
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 10232.8
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 16295.2
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
[2] NVProf result, this is a very long file, I only copy-n-paste part of it.
Please noticed that this part interested me:
API calls: 46.93% 14.7844s 8 1.84805s 191.36us 14.7818s cudaStreamCreateWithFlags
23.42% 7.37703s 3 2.45901s 19.740us 7.37699s cudaFree
2019-07-31 20:37:02.020447: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2019-07-31 20:37:02.021248: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x11462190 executing computations on platform Host. Devices:
2019-07-31 20:37:02.021320: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): <undefined>, <undefined>
==26697== NVPROF is profiling process 26697, command: python3 use_frozen_pb_cv.py
==26697== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
2019-07-31 20:37:02.627295: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:965] ARM64 does not support NUMA - returning NUMA node zero
2019-07-31 20:37:02.627651: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x11565da0 executing computations on platform CUDA. Devices:
2019-07-31 20:37:02.627707: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): NVIDIA Tegra X1, Compute Capability 5.3
2019-07-31 20:37:02.628831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: NVIDIA Tegra X1 major: 5 minor: 3 memoryClockRate(GHz): 0.9216
pciBusID: 0000:00:00.0
totalMemory: 3.87GiB freeMemory: 1.97GiB
2019-07-31 20:37:02.628887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-07-31 20:37:10.006752: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-31 20:37:10.006847: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-07-31 20:37:10.006881: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-07-31 20:37:10.007092: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1057 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
2019-07-31 20:37:37.301632: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.18GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-07-31 20:37:37.437082: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.71GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-07-31 20:37:37.467467: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.77GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-07-31 20:37:39.393261: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.17GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-07-31 20:37:39.405155: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.74GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-07-31 20:37:39.423244: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.76GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-07-31 20:37:39.679544: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.25GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-07-31 20:37:39.698698: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.25GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-07-31 20:37:39.723154: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.50GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-07-31 20:37:39.753625: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.32GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
==26697== Profiling application: python3 use_frozen_pb_cv.py
==26697== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 14.80% 1.14248s 400 2.8562ms 735.64us 24.469ms maxwell_gcgemm_64x64_nt
9.40% 726.05ms 511 1.4208ms 474.33us 2.5696ms maxwell_scudnn_128x128_relu_interior_nn
9.29% 717.27ms 320 2.2415ms 394.96us 6.9945ms void tensorflow::DepthwiseConv2dGPUKernelNCHW<float, int=3, int=3, int=1>(tensorflow::DepthwiseArgs, float const *, float const , tensorflow::DepthwiseArgs*, int)
7.97% 615.23ms 3 205.08ms 152.71ms 262.61ms maxwell_cgemm_64x64_tn
6.63% 512.26ms 24 21.344ms 55.730us 102.38ms void transpose_readWrite_alignment_kernel<float2, float2, int=1, bool=0, int=6, int=4, int=4>(cublasTransposeParams<float2>, float2 const *, float2*, float2 const *)
6.34% 489.82ms 4 122.46ms 65.600ms 176.88ms maxwell_cgemm_32x64_tn
4.98% 384.68ms 1296 296.82us 17.032us 3.3148ms void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_product_op<float, float>, Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, int>, int=16, Eigen::MakePointer> const , Eigen::TensorBroadcastingOp<Eigen::array<long, unsigned long=2> const , Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, int>, int=16, Eigen::MakePointer> const > const > const > const , Eigen::GpuDevice>, int>(float, int=2)
4.88% 376.55ms 1136 331.47us 15.104us 3.3492ms void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_sum_op<float, float>, Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, int>, int=16, Eigen::MakePointer> const , Eigen::TensorBroadcastingOp<Eigen::array<long, unsigned long=2> const , Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, int>, int=16, Eigen::MakePointer> const > const > const > const , Eigen::GpuDevice>, int>(float, int=2)
4.43% 341.70ms 1024 333.69us 32.241us 3.3214ms void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_min_op<float, float>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_max_op<float const , float const >, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const , Eigen::TensorCwiseNullaryOp<Eigen::internal::scalar_constant_op<float const >, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::TensorCwiseNullaryOp<Eigen::internal::scalar_constant_op<float>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_max_op<float const , float const >, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const , Eigen::TensorCwiseNullaryOp<Eigen::internal::scalar_constant_op<float const >, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const > const > const > const , Eigen::GpuDevice>, long>(float, int=1)
3.63% 280.59ms 8 35.074ms 390.74us 85.445ms void DSE::regular_fft_pad<int=0, int=1, int=128, int=16, int=32, int=1, float, float, float2>(float2*, float*, int, int3, float*, int, float*, float*, int, int, int, int, int, bool)
2.76% 212.90ms 188 1.1324ms 235.01us 9.3788ms maxwell_scudnn_128x32_relu_interior_nn
2.58% 199.04ms 8 24.880ms 274.90us 62.037ms void DSE::vector_fft<int=0, int=1, int=128, int=8, int=8, int=1, float, float, float2>(float2*, float2, int, int3, float2*)
2.42% 186.82ms 199 938.78us 215.16us 2.1786ms maxwell_scudnn_128x64_relu_interior_nn
1.88% 144.79ms 64 2.2623ms 741.68us 4.7555ms void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=3, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorPaddingOp<Eigen::array<Eigen::IndexPair<int>, unsigned long=3> const , Eigen::TensorMap<Eigen::Tensor<float const , int=3, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=3)
1.87% 144.51ms 928 155.72us 3.3340us 972.00us void tensorflow::functor::ShuffleInTensor3Simple<float, int=2, int=1, int=0, bool=0>(int, float const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::ShuffleInTensor3Simple<float, int=2, int=1, int=0, bool=0>*)
1.87% 144.35ms 3 48.116ms 20.803ms 82.021ms void fermiPlusCgemmLDS128_batched<bool=1, bool=0, bool=0, bool=0, int=4, int=4, int=4, int=3, int=3, bool=1, bool=0>(float2* const *, float2* const *, float2* const *, float2*, float2 const *, float2 const *, int, int, int, int, int, int, __int64, __int64, __int64, float2 const *, float2 const *, float2, float2, int)
1.85% 142.58ms 8 17.822ms 153.55us 44.904ms void fft2d_r2c_64x64<float>(float2*, float const *, int, int, int, int, int, int, int, int)
1.70% 131.07ms 2 65.535ms 32.642ms 98.428ms void fft2d_r2c_32x32<float, bool=0, unsigned int=1, bool=0>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool, int2, int, int)
1.59% 122.43ms 31 3.9493ms 77.865us 21.442ms void fft1d_r2c_32<float, float, float2, bool=1, bool=0>(float2*, float const *, int, int3, int3, int2, int2)
1.21% 93.137ms 50 1.8627ms 183.60us 17.884ms void cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>*, kernel_conv_params, int, int, float, float, int, float const *, float const *)
0.82% 63.314ms 82 772.12us 65.887us 4.2735ms void im2col4d_kernel<float, int>(im2col4d_params, cudnnConvolutionStruct, cudnnTensor4dStruct, float const *, float*, int)
0.72% 55.459ms 539 102.89us 208ns 3.6507ms [CUDA memcpy HtoD]
0.62% 47.976ms 27 1.7769ms 376.99us 7.3388ms void cudnn::detail::explicit_convolve_sgemm<float, int, int=1024, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::explicit_convolve_sgemm<float, int, int=1024, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>*, kernel_conv_params, int, int, float, float, int, float const *, float const *)
0.54% 41.491ms 16 2.5932ms 1.5372ms 18.355ms void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=4, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorPaddingOp<Eigen::array<Eigen::IndexPair<int>, unsigned long=4> const , Eigen::TensorMap<Eigen::Tensor<float const , int=4, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=4)
0.52% 40.179ms 29 1.3855ms 359.54us 15.518ms void cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
0.52% 40.128ms 16 2.5080ms 1.4894ms 17.731ms void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=1024, int=1024, int=2, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=1024, int=1024, int=2, bool=0>*)
0.45% 34.444ms 68 506.53us 104.01us 908.41us void fft1d_r2c_32<float, float, float2, bool=0, bool=0>(float2*, float const *, int, int3, int3, int2, int2)
0.35% 27.400ms 96 285.42us 189.12us 383.03us void tensorflow::DepthwiseConv2dGPUKernelNCHWSmall<float, tensorflow::DepthwiseConv2dDirection, int=3, int=3, int=4, bool=0, float>(tensorflow::DepthwiseArgs, float const *, float const , tensorflow::DepthwiseArgs*)
0.29% 22.267ms 29 767.83us 349.43us 2.4565ms maxwell_gcgemm_32x32_nt
0.28% 21.435ms 32 669.83us 441.73us 898.67us void tensorflow::_GLOBAL__N__79_tmpxft_00006606_00000000_8_resize_nearest_neighbor_op_gpu_cu_compute_72_cpp1_ii_9d63fafd::ResizeNearestNeighborNHWC<float, bool=0>(int, float const *, int, int, int, int, int, float, float, tensorflow::_GLOBAL__N__79_tmpxft_00006606_00000000_8_resize_nearest_neighbor_op_gpu_cu_compute_72_cpp1_ii_9d63fafd::ResizeNearestNeighborNHWC<float, bool=0>*)
0.26% 20.017ms 13 1.5397ms 319.02us 3.7690ms void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
0.24% 18.156ms 4 4.5389ms 983.04us 9.5538ms void fft1d_c2r_256<float2, float, float, bool=0, bool=1, bool=0, bool=0>(float*, float2 const *, int3, int3, int2, int, float, float, float*, float*)
0.22% 16.957ms 5 3.3914ms 1.6574ms 6.8495ms void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=6, int=7, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=6, int=7, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
0.18% 13.713ms 112 122.43us 8.1780us 371.78us void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=32, int=32, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=32, int=32, bool=0>*)
0.17% 13.500ms 160 84.376us 26.354us 304.07us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_sum_op<float const , float const >, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)
0.17% 12.962ms 361 35.906us 20.989us 132.87us void fft2d_c2r_32x32<float, bool=1, bool=0, unsigned int=0, bool=0, bool=0>(float*, float2 const *, int, int, int, int, int, int, int, int, int, float, float, cudnn::reduced_divisor, bool, float*, float*, int2, int, int)
0.16% 12.312ms 4 3.0779ms 1.3521ms 5.9912ms maxwell_gcgemm_64x32_nt
0.16% 12.184ms 4 3.0459ms 1.6120ms 4.8880ms void fft1d_r2c_256<float, float, float2, bool=0, bool=0>(float2*, float const *, int3, int3, int2, int2)
0.13% 10.225ms 160 63.905us 17.239us 227.51us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_sum_op<float, float>, Eigen::TensorBroadcastingOp<Eigen::array<long, unsigned long=2> const , Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, int>, int=16, Eigen::MakePointer> const > const , Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=2)
0.13% 9.7821ms 64 152.85us 63.804us 315.42us [CUDA memcpy DtoD]
0.12% 9.0018ms 5 1.8004ms 851.79us 2.6088ms void cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=6, int=7, int=3, int=3, int=5, int=0, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=6, int=7, int=3, int=3, int=5, int=0, bool=1>*, kernel_conv_params, int, int, float, float, int, float const *, float const *)
0.11% 8.7578ms 68 128.79us 19.583us 355.58us void fft1d_c2r_32<float2, float, float, bool=0, bool=1, bool=0, bool=0>(float*, float2 const *, int, int3, int3, int2, int, float, float, float*, float*)
0.11% 8.2866ms 363 22.828us 15.834us 702.36us void fft2d_r2c_32x32<float, bool=0, unsigned int=0, bool=0>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool, int2, int, int)
0.08% 6.3695ms 3 2.1232ms 1.1449ms 2.7957ms void cudnn::detail::implicit_convolve_sgemm<float, float, int=512, int=6, int=8, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=512, int=6, int=8, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
0.08% 6.1736ms 4 1.5434ms 660.95us 4.0799ms void DSE::regular_fft_clip<int=1, int=2, int=128, int=16, int=32, int=1, float, float, float2>(float*, float2*, int, int3, float2*, int, float2*, float2*, int, int, int, int, int, float, float, bool, int, float, float)
0.08% 5.8314ms 898 6.4930us 1.5620us 437.77us cudnn::maxwell::gemm::computeOffsetsKernel(cudnn::maxwell::gemm::ComputeOffsetsParams)
0.06% 4.5674ms 6 761.24us 281.41us 2.0896ms void flip_filter<float, float>(float*, float const *, int, int, int, int)
0.05% 3.7718ms 192 19.644us 2.5010us 67.553us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=5, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<int, int=5> const , Eigen::DSizes<int, int=5> const , Eigen::TensorMap<Eigen::Tensor<float const , int=5, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=5)
0.05% 3.5457ms 192 18.467us 2.7610us 56.669us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorSlicingOp<Eigen::array<int, unsigned long=2> const , Eigen::array<int, unsigned long=2> const , Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, int>, int=16, Eigen::MakePointer>>, Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, int>, int=16, Eigen::MakePointer> const > const , Eigen::GpuDevice>, int>(int, unsigned long=2)
0.04% 3.0087ms 48 62.680us 11.146us 148.91us void tensorflow::BiasNCHWKernel<float>(int, float const *, float const , tensorflow::BiasNCHWKernel<float>*, int, int)
0.03% 2.6555ms 928 2.8610us 1.5110us 22.605us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_product_op<float, float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)
0.03% 2.4404ms 4 610.09us 280.74us 1.5840ms void DSE::vector_fft<int=1, int=2, int=128, int=8, int=8, int=1, float, float, float2>(float2*, float2, int, int3, float2*)
0.02% 1.8090ms 48 37.688us 6.8750us 83.388us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<int, int=5, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorBroadcastingOp<Eigen::array<int, unsigned long=5> const , Eigen::TensorMap<Eigen::Tensor<int const , int=5, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(int, int=5)
0.02% 1.7707ms 512 3.4580us 1.5620us 22.241us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_difference_op<float, float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)
0.02% 1.5792ms 512 3.0840us 1.5620us 21.667us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_sum_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)
0.02% 1.5748ms 208 7.5700us 417ns 67.761us [CUDA memcpy DtoH]
0.02% 1.3160ms 4 329.00us 103.60us 857.62us void fft2d_c2r_64x64<float, bool=0>(float*, float2*, int, int, int, int, int, int, int, int, int, int, float, float, int, float*, float*)
0.02% 1.1924ms 464 2.5690us 1.3020us 18.593us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_rsqrt_op<float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)
0.01% 1.0475ms 10 104.75us 13.022us 206.10us compute_gemm_pointers(float2**, float2 const *, int, float2 const *, int, float2 const *, int, int)
0.01% 995.65us 96 10.371us 2.5000us 35.991us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_logistic_op<float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)
0.01% 646.21us 96 6.7310us 1.9800us 15.885us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_exp_op<float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)
0.01% 646.11us 2 323.05us 173.39us 472.72us void fft1d_r2c_256<float, float, float2, bool=1, bool=0>(float2*, float const *, int3, int3, int2, int2)
0.01% 536.31us 96 5.5860us 1.6670us 16.146us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)
0.01% 400.48us 96 4.1710us 1.8750us 7.4480us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<int, int=2, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorBroadcastingOp<Eigen::array<int, unsigned long=2> const , Eigen::TensorMap<Eigen::Tensor<int const , int=2, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(int, int=2)
0.00% 379.80us 48 7.9120us 2.3950us 22.344us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_sum_op<float, float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)
0.00% 367.61us 1 367.61us 367.61us 367.61us void fft2d_r2c_32x32<float, bool=0, unsigned int=1, bool=1>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool, int2, int, int)
0.00% 248.91us 48 5.1850us 1.7710us 15.053us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorConversionOp<float, Eigen::TensorMap<Eigen::Tensor<int const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)
0.00% 174.80us 2 87.398us 37.501us 137.30us void fft2d_c2r_32x32<float, bool=0, bool=0, unsigned int=1, bool=0, bool=0>(float*, float2 const *, int, int, int, int, int, int, int, int, int, float, float, cudnn::reduced_divisor, bool, float*, float*, int2, int, int)
0.00% 95.838us 48 1.9960us 1.3540us 3.0200us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<double, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<double, double, Eigen::internal::scalar_product_op<double, double>>, Eigen::TensorMap<Eigen::Tensor<double const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(double, int=1)
0.00% 92.350us 48 1.9230us 1.4060us 2.9690us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<double, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorConversionOp<double, Eigen::TensorMap<Eigen::Tensor<int const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(double, int=1)
0.00% 73.961us 48 1.5400us 1.1980us 2.3960us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<int, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorConversionOp<int, Eigen::TensorMap<Eigen::Tensor<double const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(int, int=1)
0.00% 54.479us 16 3.4040us 2.2400us 17.968us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<bool, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<bool, bool, Eigen::internal::scalar_boolean_and_op>, Eigen::TensorMap<Eigen::Tensor<bool const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(bool, int=1)
0.00% 22.863us 5 4.5720us 2.2390us 10.729us [CUDA memset]
[b]API calls: 46.93% 14.7844s 8 1.84805s 191.36us 14.7818s cudaStreamCreateWithFlags
23.42% 7.37703s 3 2.45901s 19.740us 7.37699s cudaFree[/b]
13.05% 4.11199s 146 28.164ms 111.20us 884.36ms cuEventSynchronize
7.05% 2.22204s 12331 180.20us 36.771us 115.99ms cudaLaunchKernel
2.77% 872.64ms 1 872.64ms 872.64ms 872.64ms cuMemAlloc
1.34% 423.70ms 162761 2.6030us 1.2500us 971.79us cuEventQuery
1.33% 419.13ms 1 419.13ms 419.13ms 419.13ms cuDevicePrimaryCtxRetain
0.74% 231.70ms 538 430.66us 34.793us 63.535ms cuMemcpyHtoDAsync
0.71% 223.51ms 64 3.4924ms 57.762us 88.971ms cudaMemcpyAsync
0.56% 176.36ms 1816 97.116us 2.5520us 45.091ms cuEventRecord
0.51% 159.34ms 1309 121.73us 2.0310us 82.706ms cudaDeviceGetAttribute
0.25% 77.792ms 1 77.792ms 77.792ms 77.792ms cudaMemcpy
0.21% 65.420ms 1578 41.457us 3.8540us 39.495ms cudaEventRecord
0.21% 65.361ms 214 305.42us 10.990us 28.245ms cudaBindTexture
0.18% 57.078ms 301 189.63us 1.3550us 56.209ms cuEventDestroy
0.15% 48.475ms 208 233.05us 33.855us 9.0938ms cuMemcpyDtoHAsync
0.13% 40.079ms 1 40.079ms 40.079ms 40.079ms cudaDeviceGetStreamPriorityRange
0.13% 39.604ms 320 123.76us 21.980us 2.6420ms cudaFuncGetAttributes
0.06% 19.482ms 312 62.442us 2.0310us 17.310ms cuEventCreate
0.05% 14.711ms 323 45.544us 4.8430us 12.027ms cudaGetDevice
0.05% 14.314ms 2 7.1571ms 3.4380us 14.311ms cudaGetDeviceCount
0.03% 8.7023ms 7 1.2432ms 33.334us 6.3506ms cudaMalloc
0.02% 7.7651ms 3 2.5884ms 1.5164ms 4.1124ms cuMemHostAlloc
0.02% 6.4986ms 4 1.6246ms 23.803us 6.4107ms cudaMemsetAsync
0.02% 6.2439ms 416 15.009us 5.0520us 536.31us cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
0.02% 6.0182ms 1015 5.9290us 3.7500us 116.93us cudaStreamWaitEvent
0.02% 4.8808ms 4357 1.1200us 469ns 177.97us cudaGetLastError
0.02% 4.8564ms 214 22.693us 3.5940us 224.38us cudaUnbindTexture
0.02% 4.7371ms 4 1.1843ms 123.39us 4.0659ms cudaStreamCreateWithPriority
0.01% 4.3089ms 762 5.6540us 2.3960us 213.18us cuStreamWaitEvent
0.00% 1.5101ms 1 1.5101ms 1.5101ms 1.5101ms cudaHostAlloc
0.00% 1.0841ms 11 98.552us 16.407us 723.14us cuStreamCreate
0.00% 743.09us 17 43.711us 17.970us 306.99us cuCtxSynchronize
0.00% 720.12us 146 4.9320us 2.9170us 128.13us cuEventElapsedTime
0.00% 367.35us 34 10.804us 8.1260us 41.668us cudaEventCreate
0.00% 348.75us 32 10.898us 8.0210us 37.709us cudaStreamQuery
0.00% 346.62us 202 1.7150us 886ns 42.813us cuDeviceGetAttribute
0.00% 268.81us 34 7.9060us 5.6250us 27.084us cudaEventDestroy
0.00% 253.34us 3 84.445us 77.761us 94.273us cudaGetDeviceProperties
0.00% 249.07us 28 8.8950us 4.3750us 40.521us cudaEventCreateWithFlags
0.00% 154.01us 1 154.01us 154.01us 154.01us cuMemsetD32
0.00% 74.115us 7 10.587us 4.1670us 17.135us cuCtxSetCurrent
0.00% 34.011us 1 34.011us 34.011us 34.011us cuDeviceGetProperties
0.00% 30.105us 2 15.052us 11.927us 18.178us cuMemGetInfo
0.00% 28.646us 11 2.6040us 1.3020us 5.4690us cuDeviceGetCount
0.00% 27.448us 3 9.1490us 6.9270us 11.719us cuDeviceTotalMem
0.00% 15.729us 1 15.729us 15.729us 15.729us cudaHostGetDevicePointer
0.00% 12.239us 2 6.1190us 5.8330us 6.4060us cudaSetDevice
0.00% 11.979us 2 5.9890us 3.3850us 8.5940us cuInit
0.00% 11.613us 4 2.9030us 1.7180us 4.9480us cuDeviceGet
0.00% 11.562us 3 3.8540us 1.0410us 8.3860us cuDriverGetVersion
0.00% 6.5620us 3 2.1870us 1.8230us 2.7080us cuDeviceGetName
0.00% 5.0520us 1 5.0520us 5.0520us 5.0520us cuDeviceGetPCIBusId
0.00% 3.7500us 1 3.7500us 3.7500us 3.7500us cuDevicePrimaryCtxGetState
0.00% 2.7090us 2 1.3540us 1.3540us 1.3550us cuDeviceGetUuid
0.00% 2.7080us 1 2.7080us 2.7080us 2.7080us cuCtxGetCurrent
0.00% 2.4480us 1 2.4480us 2.4480us 2.4480us cuDeviceComputeCapability
Could anyone tells me what actions can be taken to reduce loading time? Isn’t the Cuda Memory shared with CPU Memory on Jetson Nano board?