Hello,
I am trying to run an image processing model on the Jetson Xavier which is running jetpack 4.3
. I am running this process inside of a docker container. For some reason, when the code is run on my local computer and using the tensorflow/tensorflow
base image it produces results from my test image. However, when using the nvcr.io/nvidia/l4t-tensorflow:r32.4.3-tf2.2-py3
image and running on the Jetson it will fail to create inference results and return nan
where the predictions should be.
I cannot run tensorflow/tensorflow
on the Jetson or nvcr.io/nvidia/l4t-tensorflow:r32.4.3-tf2.2-py3
locally because of underlying hardware differences on my computer/theJetson.
If anyone has any idea what could be causing this issue please let me know. I can provide more details if you think they are relevant. Thanks in advance!
In an attempt to diagnose the issue I have checked the differences in both docker environments using pip list
.
Differences Computer/Jetson library versions
absl-py____________ 0.10.0/0.9.0
google-auth________1.22.0/1.18.0
grpcio_____________1.32.0/1.30.0
idna_______________2.6/2.10
importlib-metadata__2.0.0/1.7.0
opt-einsum_________ 3.3.0/3.2.1
pip________________20.2.4/20.0.2
scipy______________1.5.3/1.4.1 *
setuptools__________50.3.0/47.3.1
tensorboard________ 2.3.0/2.2.2 *
tensorflow__________2.3.1/2.2.0+nv20.6 *
tensorflow-estimator_2.3.0/2.2.0 *
urllib3_____________1.25.10/1.25.9
zipp_______________3.2.0/3.1.0
*cannot be changed on the Jetson as it is part of the base image
-
I used pip to update/downgrade all of the libraries on the Jetson execept those marked with the * in the above list to match the other docker env. This had no effect on my output.
-
In order to trouble shoot I used
pip install tensorflow==2.2.0
on the docker container my computer to see if that would break the inference. It did not. I also retrained the model on my computer usingtensorflow==2.2.0
this time and it still worked on my laptop but not on the Jetson. -
Both the Jetson and the docker instance on my laptop are using
python==3.6.9
. -
The input image is the same for both platforms and I am performing the same test from inside the docker containers.
-
The model is loading in because when I print it I get:
<tensorflow.python.saved_model.load.Loader._recreate_base_user_object.<locals>._UserObject object at 0x7efb8597f0>
-
The images are opening and reading correctly. I printed out their numpy arrays and they look reasonable.
-
The model is returning the output dictionary but it has
nan
where the predictions should be. -
The model is not optimized with
TensorRT
Model Inference Code Block:
image = np.asarray(image)
input_tensor = tf.convert_to_tensor(image)The model expects a batch of images, so add an axis with
tf.newaxis
.input_tensor = input_tensor[tf.newaxis,…]
Run inference
model_fn = model.signatures[‘serving_default’]
output_dict = model_fn(input_tensor)
Random things I have noticed but probably aren’t relevant:
Running this takes forever. ~7min:
tf.keras.backend.clear_session()
model = tf.saved_model.load(PATH)
It also produces a warning that I had read is a tf bug caused by training my own model so I don’t think is the issue.
WARNING:tensorflow:Importing a function (__inference_EfficientDet-D0_layer_call_and_return_conditional_losses_95612) with ops with custom gradients. Will likely fail if a gradient is requested.
WARNING:tensorflow:Importing a function (__inference_EfficientDet-D0_layer_call_and_return_conditional_losses_82391) with ops with custom gradients. Will likely fail if a gradient is requested.
Every time I use the nvcr.io/nvidia/l4t-tensorflow:r32.4.3-tf2.2-py3
image I get the Tensorflow Can't find Cuda
error. It is possible I just need to mount something onto my docker file. I don’t think this is my issue because I had a previous premade tf model running just fine on the image with this error.
Sorry for the long post. Just wanted to get all the info out. :)