Slow model loading on a Jetson AGX Xavier with TensorFlow 2.5.0

tobias.lalala · August 10, 2021, 10:31am

Hi there,

I’ve used nvcr.io/nvidia/l4t-tensorflow:r32.6.1-tf2.5-py3 as a base and updated to tensorflow-2.5.0+nv21.6-cp36-cp36m-linux_aarch64 inside the container. I then downloaded this model http://download.tensorflow.org/models/object_detection/tf2/20200711/faster_rcnn_resnet50_v1_800x1333_coco17_gpu-8.tar.gz and used the following code to load it:

import tensorflow as tf
import time

load_start = time.time()
with tf.device("GPU:0"):
    # using CPU instead of GPU doesnt change anything
    tf.saved_model.load('/sig/models/523368bcf91411eba96d0242ac110002/saved_model')
    print(f"Loading took {time.time() - load_start}s")

This is the output:

root@8b121c2c4b52:~# python3 test_tf.py
2021-08-10 10:23:41.298229: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2021-08-10 10:23:47.536714: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-08-10 10:23:47.574417: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:47.574641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: Xavier computeCapability: 7.2
coreClock: 1.377GHz coreCount: 8 deviceMemorySize: 31.18GiB deviceMemoryBandwidth: 82.08GiB/s
2021-08-10 10:23:47.574747: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2021-08-10 10:23:47.638952: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
2021-08-10 10:23:47.639613: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.10
2021-08-10 10:23:47.669661: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-08-10 10:23:47.705917: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-08-10 10:23:47.754288: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
2021-08-10 10:23:47.781275: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.10
2021-08-10 10:23:47.783776: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-08-10 10:23:47.784057: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:47.784366: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:47.784501: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1872] Adding visible gpu devices: 0
2021-08-10 10:23:47.792187: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:47.792479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: Xavier computeCapability: 7.2
coreClock: 1.377GHz coreCount: 8 deviceMemorySize: 31.18GiB deviceMemoryBandwidth: 82.08GiB/s
2021-08-10 10:23:47.792711: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:47.792890: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:47.793019: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1872] Adding visible gpu devices: 0
2021-08-10 10:23:47.793267: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2021-08-10 10:23:50.274390: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-08-10 10:23:50.274489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0
2021-08-10 10:23:50.274526: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N
2021-08-10 10:23:50.274911: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:50.275160: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:50.275354: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:50.275510: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 27510 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
Loading took 71.70220994949341s

so, as you can see the loading takes ~72 seconds, which feels a bit too long. On my Ubuntu machine with a 5 year old i7 and a GTX 1060 the loading takes 6 seconds, looking at the hardware alone there shouldn’t be a >10x increase in loading time, or would that be plausible?

The Ubuntu machine uses CUDA 11.x and the NVIDIA TF 2.5.0 build links to CUDA 10.x, could that be the reason?

Why is the most recent NVIDIA TF build still linked to CUDA 10.x instead of 11.x?

Thanks in advance

dkreutz · August 10, 2021, 10:59am

Looks like are are not running in MAXN mode (nvp model 0) with jetson-clocks enabled.

tobias.lalala · August 10, 2021, 11:54am

The Jetson is running in MAXN mode, what do you mean by “with jetson-clocks enabled”? This is the ouput of jeson_clocks --show:

SOC family:tegra194  Machine:Jetson-AGX
Online CPUs: 0-7
cpu0: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu1: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu2: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu3: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu4: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu5: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu6: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu7: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
GPU MinFreq=1377000000 MaxFreq=1377000000 CurrentFreq=1377000000
EMC MinFreq=204000000 MaxFreq=2133000000 CurrentFreq=2133000000 FreqOverride=1
Fan: PWM=0
NV Power Mode: MAXN

AastaLLL · August 11, 2021, 2:35am

Hi,

CUDA 11 will be supported in the next major release.
You can find some details below:

May I know which JetPack version do you use on Xavier?
Is it also version 4.6?

More, have you tested the loading time on the previous TensorFlow docker?
Is this issue specified to JetPack 4.6?

Thanks.

tobias.lalala · August 12, 2021, 12:56pm

I’m sorry, I’m using nvcr.io/nvidia/l4t-tensorflow:r32.5.0-tf2.3-py3 which is the correct one for Jetpack 4.5
and I’m using Jetpack 4.5 and I’ve also tested it on tf 2.3.1 which was pre-installed in the container, same result

CUDA doesn’t seem to play a role at all, if I do this on a fresh container without CUDA I’ll get the error messages that CUDA couldn’t get loaded and it takes ~72s to load nonetheless

tobias.lalala · August 17, 2021, 6:03am

@AastaLLL any ideas?

AastaLLL · August 31, 2021, 5:35am

Hi,

Would you mind sharing the TensorFlow log with us first?
For example, you should see it try to load the CUDA library like below:

>>> import tensorflow
2021-08-31 05:35:08.231756: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.

Thanks.

tobias.lalala · September 6, 2021, 5:27am

Hi,

that log is in the original post, but like I said in the post from Aug 12th, CUDA doesn’t seem to matter at all, the time it takes to load the model doesn’t change whether CUDA is installed or not

AastaLLL · September 17, 2021, 7:18am

Hi,

We have tested the importing time of r32.6.1-tf2.5-py3.
It takes around 5s to load the TensorFlow package:

root@nvidia-desktop:/# time python3 -c "import tensorflow"
2021-09-17 07:15:08.854028: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2

real    0m5.532s
user    0m5.092s
sys     0m0.376s

Not sure if any difference in the power mode setting.
Below is the command we use for boosting the device for your reference:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

tobias.lalala · September 20, 2021, 7:11am

Hi,

loading the TensorFlow package is not the issue, loading a model with TensorFlow is what takes a long time. Please see the original post

Thanks

AastaLLL · October 5, 2021, 7:58am

Hi,

Sorry for the missing.

We can reproduce this issue in our environment as well.
Since the model is complicated, the performance may be bounded by the memory amount or bandwidth.

Have you checked this issue with the TensorFlow team to see if any flag or API can help on this issue?

Thanks.

tobias.lalala · October 6, 2021, 1:05pm

Hi,

I’ve found something in this forum that helped: rebuilding the protobuf python package from source with the --cpp_implementation flag set. It now takes the same time as with Jetpack 4.2 and tensorflow 1.

It would be very helpful if you could include this in future Jetpack releases and/or tensorflow packages you provide. It seems to be that the cpp implementation used to be included in Jetpack4.2/tf1, because I didn’t have this problem there and didn’t have to build anything from source

Thanks

AastaLLL · October 19, 2021, 5:57am

Hi,

Thanks for the feedback.
We will share this with our internal team.

system · November 10, 2021, 4:00am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tensorflow Object Detection - Really slow loading time Jetson AGX Xavier tensorflow , jetson-inference , ai	7	1757	September 18, 2021
Slow startup and model loading time Frameworks (archived) tensorflow , ubuntu	6	4952	April 27, 2024
Available: TensorFlow 1.5 for Jetson TX2 Jetson TX2	18	7882	May 21, 2018
Cannot import TF 2.6.0 correctly on Xavier NX Jetson Xavier NX tensorflow	27	4640	December 29, 2021
Tensorflow Model loading takes longer or not able to load Jetson Nano	11	2498	October 18, 2021
TensorFlow wheel for JetPack 4.0 !! Jetson AGX Xavier	16	3724	October 15, 2018
Extremely long time to load TRT-optimized frozen TF graphs TensorRT	31	10157	October 12, 2021
Surprised at how slow Xavier is on training small regression model compared to x86 with no GPU Maybe something wrong? Jetson AGX Xavier	2	1052	October 18, 2021
Problem to install tensorflow on Xavier (Solved) Jetson AGX Xavier	19	8693	October 18, 2021
Create engine usint TF 2.x Jetson AGX Xavier tensorflow	4	732	October 17, 2021

Slow model loading on a Jetson AGX Xavier with TensorFlow 2.5.0

Related topics