Jetson Nano takes 30-40 secs for loading a Tensorflow YOLOv3 model

I tried to load a Tensorflow(1.13.1) implemented YOLOv3 model on Jetson Nano, but it takes about 30-40 seconds to load and inference the very first images, and after loading the inference seems good.

I acknowledged that Tensorflow has a lazy load feature for image inference. But my main problem is to reduce the loading time.

I noticed that the memory cache architecture might different from ARM Jetson system and X86 system:
on Section 3.2 Pinned Memory of CUDA for Tegra :: CUDA Toolkit Documentation

Some profile is also made:

[1] Bandwidth by /usr/local/cuda/samples/1_Utilities/bandwidthTest/

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: NVIDIA Tegra X1
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes) Bandwidth(MB/s)
   33554432   10027.2

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes) Bandwidth(MB/s)
   33554432   10232.8

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes) Bandwidth(MB/s)
   33554432   16295.2

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

[2] NVProf result, this is a very long file, I only copy-n-paste part of it.
Please noticed that this part interested me:
API calls: 46.93% 14.7844s 8 1.84805s 191.36us 14.7818s cudaStreamCreateWithFlags
23.42% 7.37703s 3 2.45901s 19.740us 7.37699s cudaFree

2019-07-31 20:37:02.020447: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2019-07-31 20:37:02.021248: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x11462190 executing computations on platform Host. Devices:
2019-07-31 20:37:02.021320: I tensorflow/compiler/xla/service/service.cc:168]   StreamExecutor device (0): <undefined>, <undefined>
==26697== NVPROF is profiling process 26697, command: python3 use_frozen_pb_cv.py
==26697== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
2019-07-31 20:37:02.627295: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:965] ARM64 does not support NUMA - returning NUMA node zero
2019-07-31 20:37:02.627651: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x11565da0 executing computations on platform CUDA. Devices:
2019-07-31 20:37:02.627707: I tensorflow/compiler/xla/service/service.cc:168]   StreamExecutor device (0): NVIDIA Tegra X1, Compute Capability 5.3
2019-07-31 20:37:02.628831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: NVIDIA Tegra X1 major: 5 minor: 3 memoryClockRate(GHz): 0.9216
pciBusID: 0000:00:00.0
totalMemory: 3.87GiB freeMemory: 1.97GiB
2019-07-31 20:37:02.628887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-07-31 20:37:10.006752: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-31 20:37:10.006847: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-07-31 20:37:10.006881: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-07-31 20:37:10.007092: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1057 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
2019-07-31 20:37:37.301632: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.18GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-07-31 20:37:37.437082: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.71GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-07-31 20:37:37.467467: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.77GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-07-31 20:37:39.393261: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.17GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-07-31 20:37:39.405155: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.74GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-07-31 20:37:39.423244: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.76GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-07-31 20:37:39.679544: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.25GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-07-31 20:37:39.698698: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.25GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-07-31 20:37:39.723154: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.50GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-07-31 20:37:39.753625: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.32GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
==26697== Profiling application: python3 use_frozen_pb_cv.py
==26697== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   14.80%  1.14248s       400  2.8562ms  735.64us  24.469ms  maxwell_gcgemm_64x64_nt
                    9.40%  726.05ms       511  1.4208ms  474.33us  2.5696ms  maxwell_scudnn_128x128_relu_interior_nn
                    9.29%  717.27ms       320  2.2415ms  394.96us  6.9945ms  void tensorflow::DepthwiseConv2dGPUKernelNCHW<float, int=3, int=3, int=1>(tensorflow::DepthwiseArgs, float const *, float const , tensorflow::DepthwiseArgs*, int)
                    7.97%  615.23ms         3  205.08ms  152.71ms  262.61ms  maxwell_cgemm_64x64_tn
                    6.63%  512.26ms        24  21.344ms  55.730us  102.38ms  void transpose_readWrite_alignment_kernel<float2, float2, int=1, bool=0, int=6, int=4, int=4>(cublasTransposeParams<float2>, float2 const *, float2*, float2 const *)
                    6.34%  489.82ms         4  122.46ms  65.600ms  176.88ms  maxwell_cgemm_32x64_tn
                    4.98%  384.68ms      1296  296.82us  17.032us  3.3148ms  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_product_op<float, float>, Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, int>, int=16, Eigen::MakePointer> const , Eigen::TensorBroadcastingOp<Eigen::array<long, unsigned long=2> const , Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, int>, int=16, Eigen::MakePointer> const > const > const > const , Eigen::GpuDevice>, int>(float, int=2)
                    4.88%  376.55ms      1136  331.47us  15.104us  3.3492ms  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_sum_op<float, float>, Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, int>, int=16, Eigen::MakePointer> const , Eigen::TensorBroadcastingOp<Eigen::array<long, unsigned long=2> const , Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, int>, int=16, Eigen::MakePointer> const > const > const > const , Eigen::GpuDevice>, int>(float, int=2)
                    4.43%  341.70ms      1024  333.69us  32.241us  3.3214ms  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_min_op<float, float>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_max_op<float const , float const >, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const , Eigen::TensorCwiseNullaryOp<Eigen::internal::scalar_constant_op<float const >, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::TensorCwiseNullaryOp<Eigen::internal::scalar_constant_op<float>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_max_op<float const , float const >, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const , Eigen::TensorCwiseNullaryOp<Eigen::internal::scalar_constant_op<float const >, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const > const > const > const , Eigen::GpuDevice>, long>(float, int=1)
                    3.63%  280.59ms         8  35.074ms  390.74us  85.445ms  void DSE::regular_fft_pad<int=0, int=1, int=128, int=16, int=32, int=1, float, float, float2>(float2*, float*, int, int3, float*, int, float*, float*, int, int, int, int, int, bool)
                    2.76%  212.90ms       188  1.1324ms  235.01us  9.3788ms  maxwell_scudnn_128x32_relu_interior_nn
                    2.58%  199.04ms         8  24.880ms  274.90us  62.037ms  void DSE::vector_fft<int=0, int=1, int=128, int=8, int=8, int=1, float, float, float2>(float2*, float2, int, int3, float2*)
                    2.42%  186.82ms       199  938.78us  215.16us  2.1786ms  maxwell_scudnn_128x64_relu_interior_nn
                    1.88%  144.79ms        64  2.2623ms  741.68us  4.7555ms  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=3, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorPaddingOp<Eigen::array<Eigen::IndexPair<int>, unsigned long=3> const , Eigen::TensorMap<Eigen::Tensor<float const , int=3, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=3)
                    1.87%  144.51ms       928  155.72us  3.3340us  972.00us  void tensorflow::functor::ShuffleInTensor3Simple<float, int=2, int=1, int=0, bool=0>(int, float const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::ShuffleInTensor3Simple<float, int=2, int=1, int=0, bool=0>*)
                    1.87%  144.35ms         3  48.116ms  20.803ms  82.021ms  void fermiPlusCgemmLDS128_batched<bool=1, bool=0, bool=0, bool=0, int=4, int=4, int=4, int=3, int=3, bool=1, bool=0>(float2* const *, float2* const *, float2* const *, float2*, float2 const *, float2 const *, int, int, int, int, int, int, __int64, __int64, __int64, float2 const *, float2 const *, float2, float2, int)
                    1.85%  142.58ms         8  17.822ms  153.55us  44.904ms  void fft2d_r2c_64x64<float>(float2*, float const *, int, int, int, int, int, int, int, int)
                    1.70%  131.07ms         2  65.535ms  32.642ms  98.428ms  void fft2d_r2c_32x32<float, bool=0, unsigned int=1, bool=0>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool, int2, int, int)
                    1.59%  122.43ms        31  3.9493ms  77.865us  21.442ms  void fft1d_r2c_32<float, float, float2, bool=1, bool=0>(float2*, float const *, int, int3, int3, int2, int2)
                    1.21%  93.137ms        50  1.8627ms  183.60us  17.884ms  void cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>*, kernel_conv_params, int, int, float, float, int, float const *, float const *)
                    0.82%  63.314ms        82  772.12us  65.887us  4.2735ms  void im2col4d_kernel<float, int>(im2col4d_params, cudnnConvolutionStruct, cudnnTensor4dStruct, float const *, float*, int)
                    0.72%  55.459ms       539  102.89us     208ns  3.6507ms  [CUDA memcpy HtoD]
                    0.62%  47.976ms        27  1.7769ms  376.99us  7.3388ms  void cudnn::detail::explicit_convolve_sgemm<float, int, int=1024, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::explicit_convolve_sgemm<float, int, int=1024, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>*, kernel_conv_params, int, int, float, float, int, float const *, float const *)
                    0.54%  41.491ms        16  2.5932ms  1.5372ms  18.355ms  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=4, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorPaddingOp<Eigen::array<Eigen::IndexPair<int>, unsigned long=4> const , Eigen::TensorMap<Eigen::Tensor<float const , int=4, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=4)
                    0.52%  40.179ms        29  1.3855ms  359.54us  15.518ms  void cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
                    0.52%  40.128ms        16  2.5080ms  1.4894ms  17.731ms  void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=1024, int=1024, int=2, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=1024, int=1024, int=2, bool=0>*)
                    0.45%  34.444ms        68  506.53us  104.01us  908.41us  void fft1d_r2c_32<float, float, float2, bool=0, bool=0>(float2*, float const *, int, int3, int3, int2, int2)
                    0.35%  27.400ms        96  285.42us  189.12us  383.03us  void tensorflow::DepthwiseConv2dGPUKernelNCHWSmall<float, tensorflow::DepthwiseConv2dDirection, int=3, int=3, int=4, bool=0, float>(tensorflow::DepthwiseArgs, float const *, float const , tensorflow::DepthwiseArgs*)
                    0.29%  22.267ms        29  767.83us  349.43us  2.4565ms  maxwell_gcgemm_32x32_nt
                    0.28%  21.435ms        32  669.83us  441.73us  898.67us  void tensorflow::_GLOBAL__N__79_tmpxft_00006606_00000000_8_resize_nearest_neighbor_op_gpu_cu_compute_72_cpp1_ii_9d63fafd::ResizeNearestNeighborNHWC<float, bool=0>(int, float const *, int, int, int, int, int, float, float, tensorflow::_GLOBAL__N__79_tmpxft_00006606_00000000_8_resize_nearest_neighbor_op_gpu_cu_compute_72_cpp1_ii_9d63fafd::ResizeNearestNeighborNHWC<float, bool=0>*)
                    0.26%  20.017ms        13  1.5397ms  319.02us  3.7690ms  void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
                    0.24%  18.156ms         4  4.5389ms  983.04us  9.5538ms  void fft1d_c2r_256<float2, float, float, bool=0, bool=1, bool=0, bool=0>(float*, float2 const *, int3, int3, int2, int, float, float, float*, float*)
                    0.22%  16.957ms         5  3.3914ms  1.6574ms  6.8495ms  void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=6, int=7, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=6, int=7, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
                    0.18%  13.713ms       112  122.43us  8.1780us  371.78us  void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=32, int=32, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=32, int=32, bool=0>*)
                    0.17%  13.500ms       160  84.376us  26.354us  304.07us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_sum_op<float const , float const >, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)
                    0.17%  12.962ms       361  35.906us  20.989us  132.87us  void fft2d_c2r_32x32<float, bool=1, bool=0, unsigned int=0, bool=0, bool=0>(float*, float2 const *, int, int, int, int, int, int, int, int, int, float, float, cudnn::reduced_divisor, bool, float*, float*, int2, int, int)
                    0.16%  12.312ms         4  3.0779ms  1.3521ms  5.9912ms  maxwell_gcgemm_64x32_nt
                    0.16%  12.184ms         4  3.0459ms  1.6120ms  4.8880ms  void fft1d_r2c_256<float, float, float2, bool=0, bool=0>(float2*, float const *, int3, int3, int2, int2)
                    0.13%  10.225ms       160  63.905us  17.239us  227.51us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_sum_op<float, float>, Eigen::TensorBroadcastingOp<Eigen::array<long, unsigned long=2> const , Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, int>, int=16, Eigen::MakePointer> const > const , Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=2)
                    0.13%  9.7821ms        64  152.85us  63.804us  315.42us  [CUDA memcpy DtoD]
                    0.12%  9.0018ms         5  1.8004ms  851.79us  2.6088ms  void cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=6, int=7, int=3, int=3, int=5, int=0, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=6, int=7, int=3, int=3, int=5, int=0, bool=1>*, kernel_conv_params, int, int, float, float, int, float const *, float const *)
                    0.11%  8.7578ms        68  128.79us  19.583us  355.58us  void fft1d_c2r_32<float2, float, float, bool=0, bool=1, bool=0, bool=0>(float*, float2 const *, int, int3, int3, int2, int, float, float, float*, float*)
                    0.11%  8.2866ms       363  22.828us  15.834us  702.36us  void fft2d_r2c_32x32<float, bool=0, unsigned int=0, bool=0>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool, int2, int, int)
                    0.08%  6.3695ms         3  2.1232ms  1.1449ms  2.7957ms  void cudnn::detail::implicit_convolve_sgemm<float, float, int=512, int=6, int=8, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=512, int=6, int=8, int=3, int=3, int=5, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
                    0.08%  6.1736ms         4  1.5434ms  660.95us  4.0799ms  void DSE::regular_fft_clip<int=1, int=2, int=128, int=16, int=32, int=1, float, float, float2>(float*, float2*, int, int3, float2*, int, float2*, float2*, int, int, int, int, int, float, float, bool, int, float, float)
                    0.08%  5.8314ms       898  6.4930us  1.5620us  437.77us  cudnn::maxwell::gemm::computeOffsetsKernel(cudnn::maxwell::gemm::ComputeOffsetsParams)
                    0.06%  4.5674ms         6  761.24us  281.41us  2.0896ms  void flip_filter<float, float>(float*, float const *, int, int, int, int)
                    0.05%  3.7718ms       192  19.644us  2.5010us  67.553us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=5, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<int, int=5> const , Eigen::DSizes<int, int=5> const , Eigen::TensorMap<Eigen::Tensor<float const , int=5, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=5)
                    0.05%  3.5457ms       192  18.467us  2.7610us  56.669us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorSlicingOp<Eigen::array<int, unsigned long=2> const , Eigen::array<int, unsigned long=2> const , Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, int>, int=16, Eigen::MakePointer>>, Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, int>, int=16, Eigen::MakePointer> const > const , Eigen::GpuDevice>, int>(int, unsigned long=2)
                    0.04%  3.0087ms        48  62.680us  11.146us  148.91us  void tensorflow::BiasNCHWKernel<float>(int, float const *, float const , tensorflow::BiasNCHWKernel<float>*, int, int)
                    0.03%  2.6555ms       928  2.8610us  1.5110us  22.605us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_product_op<float, float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)
                    0.03%  2.4404ms         4  610.09us  280.74us  1.5840ms  void DSE::vector_fft<int=1, int=2, int=128, int=8, int=8, int=1, float, float, float2>(float2*, float2, int, int3, float2*)
                    0.02%  1.8090ms        48  37.688us  6.8750us  83.388us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<int, int=5, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorBroadcastingOp<Eigen::array<int, unsigned long=5> const , Eigen::TensorMap<Eigen::Tensor<int const , int=5, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(int, int=5)
                    0.02%  1.7707ms       512  3.4580us  1.5620us  22.241us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_difference_op<float, float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)
                    0.02%  1.5792ms       512  3.0840us  1.5620us  21.667us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_sum_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)
                    0.02%  1.5748ms       208  7.5700us     417ns  67.761us  [CUDA memcpy DtoH]
                    0.02%  1.3160ms         4  329.00us  103.60us  857.62us  void fft2d_c2r_64x64<float, bool=0>(float*, float2*, int, int, int, int, int, int, int, int, int, int, float, float, int, float*, float*)
                    0.02%  1.1924ms       464  2.5690us  1.3020us  18.593us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_rsqrt_op<float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)
                    0.01%  1.0475ms        10  104.75us  13.022us  206.10us  compute_gemm_pointers(float2**, float2 const *, int, float2 const *, int, float2 const *, int, int)
                    0.01%  995.65us        96  10.371us  2.5000us  35.991us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_logistic_op<float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)
                    0.01%  646.21us        96  6.7310us  1.9800us  15.885us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_exp_op<float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)
                    0.01%  646.11us         2  323.05us  173.39us  472.72us  void fft1d_r2c_256<float, float, float2, bool=1, bool=0>(float2*, float const *, int3, int3, int2, int2)
                    0.01%  536.31us        96  5.5860us  1.6670us  16.146us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1)
                    0.01%  400.48us        96  4.1710us  1.8750us  7.4480us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<int, int=2, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorBroadcastingOp<Eigen::array<int, unsigned long=2> const , Eigen::TensorMap<Eigen::Tensor<int const , int=2, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(int, int=2)
                    0.00%  379.80us        48  7.9120us  2.3950us  22.344us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_sum_op<float, float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)
                    0.00%  367.61us         1  367.61us  367.61us  367.61us  void fft2d_r2c_32x32<float, bool=0, unsigned int=1, bool=1>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool, int2, int, int)
                    0.00%  248.91us        48  5.1850us  1.7710us  15.053us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorConversionOp<float, Eigen::TensorMap<Eigen::Tensor<int const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1)
                    0.00%  174.80us         2  87.398us  37.501us  137.30us  void fft2d_c2r_32x32<float, bool=0, bool=0, unsigned int=1, bool=0, bool=0>(float*, float2 const *, int, int, int, int, int, int, int, int, int, float, float, cudnn::reduced_divisor, bool, float*, float*, int2, int, int)
                    0.00%  95.838us        48  1.9960us  1.3540us  3.0200us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<double, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<double, double, Eigen::internal::scalar_product_op<double, double>>, Eigen::TensorMap<Eigen::Tensor<double const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(double, int=1)
                    0.00%  92.350us        48  1.9230us  1.4060us  2.9690us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<double, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorConversionOp<double, Eigen::TensorMap<Eigen::Tensor<int const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(double, int=1)
                    0.00%  73.961us        48  1.5400us  1.1980us  2.3960us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<int, int=1, int=1, long>, int=16, Eigen::MakePointer>, Eigen::TensorConversionOp<int, Eigen::TensorMap<Eigen::Tensor<double const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(int, int=1)
                    0.00%  54.479us        16  3.4040us  2.2400us  17.968us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<bool, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<bool, bool, Eigen::internal::scalar_boolean_and_op>, Eigen::TensorMap<Eigen::Tensor<bool const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(bool, int=1)
                    0.00%  22.863us         5  4.5720us  2.2390us  10.729us  [CUDA memset]
      [b]API calls:   46.93%  14.7844s         8  1.84805s  191.36us  14.7818s  cudaStreamCreateWithFlags
                   23.42%  7.37703s         3  2.45901s  19.740us  7.37699s  cudaFree[/b]
                   13.05%  4.11199s       146  28.164ms  111.20us  884.36ms  cuEventSynchronize
                    7.05%  2.22204s     12331  180.20us  36.771us  115.99ms  cudaLaunchKernel
                    2.77%  872.64ms         1  872.64ms  872.64ms  872.64ms  cuMemAlloc
                    1.34%  423.70ms    162761  2.6030us  1.2500us  971.79us  cuEventQuery
                    1.33%  419.13ms         1  419.13ms  419.13ms  419.13ms  cuDevicePrimaryCtxRetain
                    0.74%  231.70ms       538  430.66us  34.793us  63.535ms  cuMemcpyHtoDAsync
                    0.71%  223.51ms        64  3.4924ms  57.762us  88.971ms  cudaMemcpyAsync
                    0.56%  176.36ms      1816  97.116us  2.5520us  45.091ms  cuEventRecord
                    0.51%  159.34ms      1309  121.73us  2.0310us  82.706ms  cudaDeviceGetAttribute
                    0.25%  77.792ms         1  77.792ms  77.792ms  77.792ms  cudaMemcpy
                    0.21%  65.420ms      1578  41.457us  3.8540us  39.495ms  cudaEventRecord
                    0.21%  65.361ms       214  305.42us  10.990us  28.245ms  cudaBindTexture
                    0.18%  57.078ms       301  189.63us  1.3550us  56.209ms  cuEventDestroy
                    0.15%  48.475ms       208  233.05us  33.855us  9.0938ms  cuMemcpyDtoHAsync
                    0.13%  40.079ms         1  40.079ms  40.079ms  40.079ms  cudaDeviceGetStreamPriorityRange
                    0.13%  39.604ms       320  123.76us  21.980us  2.6420ms  cudaFuncGetAttributes
                    0.06%  19.482ms       312  62.442us  2.0310us  17.310ms  cuEventCreate
                    0.05%  14.711ms       323  45.544us  4.8430us  12.027ms  cudaGetDevice
                    0.05%  14.314ms         2  7.1571ms  3.4380us  14.311ms  cudaGetDeviceCount
                    0.03%  8.7023ms         7  1.2432ms  33.334us  6.3506ms  cudaMalloc
                    0.02%  7.7651ms         3  2.5884ms  1.5164ms  4.1124ms  cuMemHostAlloc
                    0.02%  6.4986ms         4  1.6246ms  23.803us  6.4107ms  cudaMemsetAsync
                    0.02%  6.2439ms       416  15.009us  5.0520us  536.31us  cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
                    0.02%  6.0182ms      1015  5.9290us  3.7500us  116.93us  cudaStreamWaitEvent
                    0.02%  4.8808ms      4357  1.1200us     469ns  177.97us  cudaGetLastError
                    0.02%  4.8564ms       214  22.693us  3.5940us  224.38us  cudaUnbindTexture
                    0.02%  4.7371ms         4  1.1843ms  123.39us  4.0659ms  cudaStreamCreateWithPriority
                    0.01%  4.3089ms       762  5.6540us  2.3960us  213.18us  cuStreamWaitEvent
                    0.00%  1.5101ms         1  1.5101ms  1.5101ms  1.5101ms  cudaHostAlloc
                    0.00%  1.0841ms        11  98.552us  16.407us  723.14us  cuStreamCreate
                    0.00%  743.09us        17  43.711us  17.970us  306.99us  cuCtxSynchronize
                    0.00%  720.12us       146  4.9320us  2.9170us  128.13us  cuEventElapsedTime
                    0.00%  367.35us        34  10.804us  8.1260us  41.668us  cudaEventCreate
                    0.00%  348.75us        32  10.898us  8.0210us  37.709us  cudaStreamQuery
                    0.00%  346.62us       202  1.7150us     886ns  42.813us  cuDeviceGetAttribute
                    0.00%  268.81us        34  7.9060us  5.6250us  27.084us  cudaEventDestroy
                    0.00%  253.34us         3  84.445us  77.761us  94.273us  cudaGetDeviceProperties
                    0.00%  249.07us        28  8.8950us  4.3750us  40.521us  cudaEventCreateWithFlags
                    0.00%  154.01us         1  154.01us  154.01us  154.01us  cuMemsetD32
                    0.00%  74.115us         7  10.587us  4.1670us  17.135us  cuCtxSetCurrent
                    0.00%  34.011us         1  34.011us  34.011us  34.011us  cuDeviceGetProperties
                    0.00%  30.105us         2  15.052us  11.927us  18.178us  cuMemGetInfo
                    0.00%  28.646us        11  2.6040us  1.3020us  5.4690us  cuDeviceGetCount
                    0.00%  27.448us         3  9.1490us  6.9270us  11.719us  cuDeviceTotalMem
                    0.00%  15.729us         1  15.729us  15.729us  15.729us  cudaHostGetDevicePointer
                    0.00%  12.239us         2  6.1190us  5.8330us  6.4060us  cudaSetDevice
                    0.00%  11.979us         2  5.9890us  3.3850us  8.5940us  cuInit
                    0.00%  11.613us         4  2.9030us  1.7180us  4.9480us  cuDeviceGet
                    0.00%  11.562us         3  3.8540us  1.0410us  8.3860us  cuDriverGetVersion
                    0.00%  6.5620us         3  2.1870us  1.8230us  2.7080us  cuDeviceGetName
                    0.00%  5.0520us         1  5.0520us  5.0520us  5.0520us  cuDeviceGetPCIBusId
                    0.00%  3.7500us         1  3.7500us  3.7500us  3.7500us  cuDevicePrimaryCtxGetState
                    0.00%  2.7090us         2  1.3540us  1.3540us  1.3550us  cuDeviceGetUuid
                    0.00%  2.7080us         1  2.7080us  2.7080us  2.7080us  cuCtxGetCurrent
                    0.00%  2.4480us         1  2.4480us  2.4480us  2.4480us  cuDeviceComputeCapability

Could anyone tells me what actions can be taken to reduce loading time? Isn’t the Cuda Memory shared with CPU Memory on Jetson Nano board?

Hi,

It’s known that TensorFlow is not an optimal solution on the Jetson platform.
Do you have any dependency on the TensorFlow?

If not, it’s recommended to use pure TensorRT to get a better performance.
You can find the YOLO sample in our deepstream SDK:
https://developer.nvidia.com/deepstream-sdk
/opt/nvidia/deepstream/deepstream-4.0/sources/objectDetector_Yolo/

Thanks.

Hi AastaLLL,

Thanks for your advice, I deployed DeepStream-4.0(DS-4) on Jetson Nano.

However, it is much much slower than Tensorflow-GPU. It almost takes 20+ minues to load Yolo3, and it only gives only 1.8 FPS. But Tnesorflow GPU can has 3-4 FPS, with 20-30secs bootstrap time. (About FPS, I guess DS-4 uses 1080p input, but Tensorflow GPU uses 480p input.)

I used following command to install and run, please correct me if anything goes wrong or unexpected.

cd /opt/nvidia/deepstream/deepstream-4.0/sources/objectDetector_Yolo
./prebuild.sh
export CUDA_VER=10.0
make -C nvdsinfer_custom_impl_Yolo
deepstream-app -c deepstream_app_config_yoloV3.txt

I noticed that Building the TensorRT Engine takes a lot of time. Here are some output:

Output blob names :
yolo_83
yolo_95
yolo_107
Total number of layers: 257
Total number of layers on DLA: 0
Building the TensorRT Engine...
Building complete!
0:23:01.939275664  9751     0x257cb960 INFO                 nvinfer gstnvinfer.cpp:519:gst_nvinfer_logger:<primary_gie_classifier> NvDsInferContext[UID 1]:generateTRTModel(): Storing the serialized cuda engine to file at /opt/nvidia/deepstream/deepstream-4.0/sources/objectDetector_Yolo/model_b1_fp16.engine
Deserialize yoloLayerV3 plugin: yolo_83
Deserialize yoloLayerV3 plugin: yolo_95
Deserialize yoloLayerV3 plugin: yolo_107

Runtime commands:
	h: Print this help
	q: Quit

	p: Pause
	r: Resume

NOTE: To expand a source in the 2D tiled display and view object details, left-click on the source.
      To go back to the tiled display, right-click anywhere on the window.

Please let me know if any actions can be take to reduce time consumption.
Otherwise, the DS-4 or TensorRT doesn’t give any advantages than Tensorflow GPU.

Thanks.

Hi arthur

The model engine only needs to be built the first time that the sample is run. If you set the ‘model-engine-file’ parameter in deepstream_app_config_yoloV3.txt to the path of the previously built model engine, it will reload it instead of generating a new one every time.

This won’t help the slow fps, but at least it makes the load time shorter.

I’ve found that I have to do this for all the deepstream samples on the jetson nano because the config files all point to ‘int8’ engine files but, as the nano doesn’t support int8, all the generated engines are named ‘fp16’.

Hi, arthur

This is not in our expectation.
Suppose you should get much better performance with deepstream compared to TensorFlow.

Would you mind to help us checking following things?
1. Please check if GPU utilization of Deepstream already reaches 99%?
2. Please check if TensorFlow and Deepstream use the same YOLO model?

Thanks.

Hi, Arthur

Has your problem been solved now?