Slow model loading on a Jetson AGX Xavier with TensorFlow 2.5.0

Hi there,

I’ve used nvcr.io/nvidia/l4t-tensorflow:r32.6.1-tf2.5-py3 as a base and updated to tensorflow-2.5.0+nv21.6-cp36-cp36m-linux_aarch64 inside the container. I then downloaded this model http://download.tensorflow.org/models/object_detection/tf2/20200711/faster_rcnn_resnet50_v1_800x1333_coco17_gpu-8.tar.gz and used the following code to load it:

import tensorflow as tf
import time

load_start = time.time()
with tf.device("GPU:0"):
    # using CPU instead of GPU doesnt change anything
    tf.saved_model.load('/sig/models/523368bcf91411eba96d0242ac110002/saved_model')
    print(f"Loading took {time.time() - load_start}s")

This is the output:

root@8b121c2c4b52:~# python3 test_tf.py
2021-08-10 10:23:41.298229: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2021-08-10 10:23:47.536714: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-08-10 10:23:47.574417: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:47.574641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: Xavier computeCapability: 7.2
coreClock: 1.377GHz coreCount: 8 deviceMemorySize: 31.18GiB deviceMemoryBandwidth: 82.08GiB/s
2021-08-10 10:23:47.574747: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2021-08-10 10:23:47.638952: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
2021-08-10 10:23:47.639613: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.10
2021-08-10 10:23:47.669661: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-08-10 10:23:47.705917: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-08-10 10:23:47.754288: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
2021-08-10 10:23:47.781275: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.10
2021-08-10 10:23:47.783776: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-08-10 10:23:47.784057: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:47.784366: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:47.784501: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1872] Adding visible gpu devices: 0
2021-08-10 10:23:47.792187: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:47.792479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: Xavier computeCapability: 7.2
coreClock: 1.377GHz coreCount: 8 deviceMemorySize: 31.18GiB deviceMemoryBandwidth: 82.08GiB/s
2021-08-10 10:23:47.792711: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:47.792890: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:47.793019: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1872] Adding visible gpu devices: 0
2021-08-10 10:23:47.793267: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2021-08-10 10:23:50.274390: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-08-10 10:23:50.274489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0
2021-08-10 10:23:50.274526: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N
2021-08-10 10:23:50.274911: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:50.275160: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:50.275354: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:50.275510: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 27510 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
Loading took 71.70220994949341s

so, as you can see the loading takes ~72 seconds, which feels a bit too long. On my Ubuntu machine with a 5 year old i7 and a GTX 1060 the loading takes 6 seconds, looking at the hardware alone there shouldn’t be a >10x increase in loading time, or would that be plausible?

The Ubuntu machine uses CUDA 11.x and the NVIDIA TF 2.5.0 build links to CUDA 10.x, could that be the reason?

Why is the most recent NVIDIA TF build still linked to CUDA 10.x instead of 11.x?

Thanks in advance

Looks like are are not running in MAXN mode (nvp model 0) with jetson-clocks enabled.

The Jetson is running in MAXN mode, what do you mean by “with jetson-clocks enabled”? This is the ouput of jeson_clocks --show:

SOC family:tegra194  Machine:Jetson-AGX
Online CPUs: 0-7
cpu0: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu1: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu2: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu3: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu4: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu5: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu6: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu7: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
GPU MinFreq=1377000000 MaxFreq=1377000000 CurrentFreq=1377000000
EMC MinFreq=204000000 MaxFreq=2133000000 CurrentFreq=2133000000 FreqOverride=1
Fan: PWM=0
NV Power Mode: MAXN

Hi,

CUDA 11 will be supported in the next major release.
You can find some details below:

May I know which JetPack version do you use on Xavier?
Is it also version 4.6?

More, have you tested the loading time on the previous TensorFlow docker?
Is this issue specified to JetPack 4.6?

Thanks.

I’m sorry, I’m using nvcr.io/nvidia/l4t-tensorflow:r32.5.0-tf2.3-py3 which is the correct one for Jetpack 4.5
and I’m using Jetpack 4.5 and I’ve also tested it on tf 2.3.1 which was pre-installed in the container, same result

CUDA doesn’t seem to play a role at all, if I do this on a fresh container without CUDA I’ll get the error messages that CUDA couldn’t get loaded and it takes ~72s to load nonetheless

@AastaLLL any ideas?

Hi,

Would you mind sharing the TensorFlow log with us first?
For example, you should see it try to load the CUDA library like below:

>>> import tensorflow
2021-08-31 05:35:08.231756: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.

Thanks.

Hi,

that log is in the original post, but like I said in the post from Aug 12th, CUDA doesn’t seem to matter at all, the time it takes to load the model doesn’t change whether CUDA is installed or not

Hi,

We have tested the importing time of r32.6.1-tf2.5-py3.
It takes around 5s to load the TensorFlow package:

root@nvidia-desktop:/# time python3 -c "import tensorflow"
2021-09-17 07:15:08.854028: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2

real    0m5.532s
user    0m5.092s
sys     0m0.376s

Not sure if any difference in the power mode setting.
Below is the command we use for boosting the device for your reference:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

Hi,

loading the TensorFlow package is not the issue, loading a model with TensorFlow is what takes a long time. Please see the original post

Thanks

Hi,

Sorry for the missing.

We can reproduce this issue in our environment as well.
Since the model is complicated, the performance may be bounded by the memory amount or bandwidth.

Have you checked this issue with the TensorFlow team to see if any flag or API can help on this issue?

Thanks.

Hi,

I’ve found something in this forum that helped: rebuilding the protobuf python package from source with the --cpp_implementation flag set. It now takes the same time as with Jetpack 4.2 and tensorflow 1.

It would be very helpful if you could include this in future Jetpack releases and/or tensorflow packages you provide. It seems to be that the cpp implementation used to be included in Jetpack4.2/tf1, because I didn’t have this problem there and didn’t have to build anything from source

Thanks

Hi,

Thanks for the feedback.
We will share this with our internal team.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.