Slow model loading on a Jetson AGX Xavier with TensorFlow 2.5.0

Hi there,

I’ve used nvcr.io/nvidia/l4t-tensorflow:r32.6.1-tf2.5-py3 as a base and updated to tensorflow-2.5.0+nv21.6-cp36-cp36m-linux_aarch64 inside the container. I then downloaded this model http://download.tensorflow.org/models/object_detection/tf2/20200711/faster_rcnn_resnet50_v1_800x1333_coco17_gpu-8.tar.gz and used the following code to load it:

import tensorflow as tf
import time

load_start = time.time()
with tf.device("GPU:0"):
    # using CPU instead of GPU doesnt change anything
    tf.saved_model.load('/sig/models/523368bcf91411eba96d0242ac110002/saved_model')
    print(f"Loading took {time.time() - load_start}s")

This is the output:

root@8b121c2c4b52:~# python3 test_tf.py
2021-08-10 10:23:41.298229: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2021-08-10 10:23:47.536714: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-08-10 10:23:47.574417: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:47.574641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: Xavier computeCapability: 7.2
coreClock: 1.377GHz coreCount: 8 deviceMemorySize: 31.18GiB deviceMemoryBandwidth: 82.08GiB/s
2021-08-10 10:23:47.574747: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2021-08-10 10:23:47.638952: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10
2021-08-10 10:23:47.639613: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.10
2021-08-10 10:23:47.669661: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-08-10 10:23:47.705917: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-08-10 10:23:47.754288: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
2021-08-10 10:23:47.781275: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.10
2021-08-10 10:23:47.783776: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-08-10 10:23:47.784057: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:47.784366: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:47.784501: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1872] Adding visible gpu devices: 0
2021-08-10 10:23:47.792187: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:47.792479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: Xavier computeCapability: 7.2
coreClock: 1.377GHz coreCount: 8 deviceMemorySize: 31.18GiB deviceMemoryBandwidth: 82.08GiB/s
2021-08-10 10:23:47.792711: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:47.792890: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:47.793019: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1872] Adding visible gpu devices: 0
2021-08-10 10:23:47.793267: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.2
2021-08-10 10:23:50.274390: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-08-10 10:23:50.274489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0
2021-08-10 10:23:50.274526: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N
2021-08-10 10:23:50.274911: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:50.275160: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:50.275354: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1001] ARM64 does not support NUMA - returning NUMA node zero
2021-08-10 10:23:50.275510: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 27510 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
Loading took 71.70220994949341s

so, as you can see the loading takes ~72 seconds, which feels a bit too long. On my Ubuntu machine with a 5 year old i7 and a GTX 1060 the loading takes 6 seconds, looking at the hardware alone there shouldn’t be a >10x increase in loading time, or would that be plausible?

The Ubuntu machine uses CUDA 11.x and the NVIDIA TF 2.5.0 build links to CUDA 10.x, could that be the reason?

Why is the most recent NVIDIA TF build still linked to CUDA 10.x instead of 11.x?

Thanks in advance

Looks like are are not running in MAXN mode (nvp model 0) with jetson-clocks enabled.

The Jetson is running in MAXN mode, what do you mean by “with jetson-clocks enabled”? This is the ouput of jeson_clocks --show:

SOC family:tegra194  Machine:Jetson-AGX
Online CPUs: 0-7
cpu0: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu1: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu2: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu3: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu4: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu5: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu6: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
cpu7: Online=1 Governor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600 IdleStates: C1=0 c6=0
GPU MinFreq=1377000000 MaxFreq=1377000000 CurrentFreq=1377000000
EMC MinFreq=204000000 MaxFreq=2133000000 CurrentFreq=2133000000 FreqOverride=1
Fan: PWM=0
NV Power Mode: MAXN

Hi,

CUDA 11 will be supported in the next major release.
You can find some details below:

May I know which JetPack version do you use on Xavier?
Is it also version 4.6?

More, have you tested the loading time on the previous TensorFlow docker?
Is this issue specified to JetPack 4.6?

Thanks.

I’m sorry, I’m using nvcr.io/nvidia/l4t-tensorflow:r32.5.0-tf2.3-py3 which is the correct one for Jetpack 4.5
and I’m using Jetpack 4.5 and I’ve also tested it on tf 2.3.1 which was pre-installed in the container, same result

CUDA doesn’t seem to play a role at all, if I do this on a fresh container without CUDA I’ll get the error messages that CUDA couldn’t get loaded and it takes ~72s to load nonetheless

@AastaLLL any ideas?

Hi,

Would you mind sharing the TensorFlow log with us first?
For example, you should see it try to load the CUDA library like below:

>>> import tensorflow
2021-08-31 05:35:08.231756: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.

Thanks.

Hi,

that log is in the original post, but like I said in the post from Aug 12th, CUDA doesn’t seem to matter at all, the time it takes to load the model doesn’t change whether CUDA is installed or not