Trying to run TensorFlow 1.15 produced graphdefs with TF2 based tensorRT but TensorRT model is not building correctly

Description

Trying to create a TensorRT server on our platform for real time inference that can both accept models created originally by Tensorflow 1.15 and also serve models created by Tensorflow 2.4.1. Since all of the models that were created in TF1.15 were mostly created in tf-slim, models from both versions on our platform are exported as graphdefs. Converting this to TensorRT models was a pretty easy process previously as these graphdefs could be directly converted (using Originally built with nvcr.io/nvidia/tensorflow:20.03-tf1-py3). However with TF-TRT in 2.4.1, we have to convert these to saved-models and then proceed to convert to TensorRT models. By doing this, we face two problems:

  • Previous models that go through this pipeline and are used for predictions always give probabilities with the first call being 1 and the rest being 0. This should not be the case with all images that require inference.
  • The speed of the inference has changed quite a bit. The inference time went up from 20 ms to 5 s. This is absolutely unacceptable.

It must be noted that there is a whole different branch for Keras compatibility. This will allow us to skip the graphdef part of the pipeline. But we will lose out on the previous few models and therefore need to fix this version of the pipeline too.

Environment

TensorRT Version: Not really sure, but it’s using the TF-TRT from TF 2.4.1

GPU Type: GeForce GTX 1080 Ti

Nvidia Driver Version:

NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2

CUDA Version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on #
Cuda compilation tools, release 7.5, V7.5.17

CUDNN Version:

#define CUDNN_MAJOR 7
#define CUDNN_MINOR 5
#define CUDNN_PATCHLEVEL 0

Operating System + Version:

NAME="Ubuntu"
VERSION="16.04.5 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.5 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

Python Version (if applicable): 3.6 while training the model, 3.8.5 in the container
TensorFlow Version (if applicable): As stated above 1.15, 2.4.1 for training the models and 2.4.0 within the container to export the model.
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/tensorflow:21.03-py3

Relevant Files

In order to preserve the proprietary nature of our models, the models have not been provided as a relavant file above. However, if needed, I can train an imagenet model to a low amount of iterations specifically for this post.

export_model_old.py(export_model_old.py · GitHub)

export_model_new.py(export_model_new.py · GitHub)

check_inference.py(check_inference.py · GitHub)

output_tf1_model.txt & output_tf2_model.txt

Can’t post the links to these files since new users only get 3 links to post. I have posted the links to these two logs in one of the comments below.

Steps To Reproduce

TF 1.15 Model

Exporting the model from a checkpoint

The function export_model()from export_model_old.py was used to convert a checkpoint into a graphdef. This is often done on the cloud or on a mac laptop.

Exporting the tensorRT model

The exported model and the checkpoints are downloaded from the cloud and we get a sub directory structure as such:

<MODEL_NAME>.pb  
config.ini  
model.ckpt-100000.data-00000-of-00001  
model.ckpt-100000.index  
model.ckpt-100000.meta  

Then the docker image is run with the command:

docker run --gpus all --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -eMODEL_NAME=${MODEL_NAME} -ti -v/home/trainer:/trainer ${NVIDIA_TF_IMAGE}

Once inside the docker, use these commands:

cd /trainer
apt-get update && apt-get install -y libcurl4 libcurl4-openssl-dev
export PYTHONPATH=`pwd`
mkdir /opt/tensorflow/horovod-source/.eggs/
touch /opt/tensorflow/horovod-source/.eggs/easy-install.pth
pip install tensorflow-probability==0.12.1
pip install nvidia-pyindex
pip install tritonclient
pip install <proprietary_package> --extra-index-url=proprietary_package_url.com/proprietary_package --no-cache-dir
pip install opencv-python-headless
pip install tf-slim
python ./trainer/export_model_main.py --model_dir=${MODEL_NAME} --device_query_tool ''
exit

The export_model_main pretty much just calls export_tf_trt_model(). The output to the export_model process is given in export_model_TF1.txt.

Running the TensorRT server (Triton)

Now to run the server, this command is called:

docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/home/trainer/models:/models nvcr.io/nvidia/tritonserver:21.03-py3 tritonserver --model-repository=/models --log-verbose=1

Running Inference

Now if we run check_inference.py The output is:

[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
total time is 59.1446053981781
average time is 5.708556079864502

We can see both the issue with the timing and the probabilities being wrong with this case. Checking the output_tf1_model.txt tells us that:

2021-05-11 01:05:24.917187: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at trt_engine_resource_ops.cc:193 : Not found: Container TF-TRT does not exist. (Could not find resource: TF-TRT/TRTEngineOp_0_0)

Not really sure what this means though.

TF 2.4.0 Model

Exporting the model from a checkpoint

The function export_model()from export_model_new.py was used to convert a checkpoint into a graphdef. This is often done on the cloud or on a mac laptop.

Exporting the tensorRT model

The exported model and the checkpoints are downloaded from the cloud and we get a sub directory structure as such:

<MODEL_NAME>.pb  
config.ini  
model.ckpt-100000.data-00000-of-00001  
model.ckpt-100000.index  
model.ckpt-100000.meta  

Then the docker image is run with the command:

docker run --gpus all --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -eMODEL_NAME=${MODEL_NAME} -ti -v/home/trainer:/trainer ${NVIDIA_TF_IMAGE}

Once inside the docker, use these commands:

cd /trainer
apt-get update && apt-get install -y libcurl4 libcurl4-openssl-dev
export PYTHONPATH=`pwd`
mkdir /opt/tensorflow/horovod-source/.eggs/
touch /opt/tensorflow/horovod-source/.eggs/easy-install.pth
pip install tensorflow-probability==0.12.1
pip install nvidia-pyindex
pip install tritonclient
pip install <proprietary_package> --extra-index-url=proprietary_package_url.com/proprietary_package --no-cache-dir
pip install opencv-python-headless
pip install tf-slim
python ./trainer/export_model_main.py --model_dir=${MODEL_NAME} --device_query_tool ''
exit

The export_model_main pretty much just calls export_tf_trt_model(). The output to the export_model process is given in export_model_TF1.txt.

Running the TensorRT server (Triton)

Now to run the server, this command is called:

docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/home/trainer/models:/models nvcr.io/nvidia/tritonserver:21.03-py3 tritonserver --model-repository=/models --log-verbose=1

Running Inference

Now if we run check_inference.py The output is:

[[9.24720589e-05 1.15816403e-15 1.20078304e-20 8.00195771e-37
  1.54931442e-17 1.41216694e-20 1.60215558e-12 5.36196575e-13
  6.83647333e-16 9.99874830e-01 3.26632762e-05 4.05459044e-15
  7.99138273e-12 1.80558260e-20 2.32593431e-13 2.72247086e-19
  7.35061362e-27 1.06639757e-22]
 [9.09204464e-05 1.46073500e-15 1.48219570e-20 1.41789109e-36
  1.77494695e-17 2.24178079e-20 1.89969590e-12 6.30734614e-13
  8.74303438e-16 9.99877095e-01 3.19979408e-05 4.54731501e-15
  6.33200003e-12 2.47874752e-20 2.58575171e-13 3.62880438e-19
  1.24502504e-26 1.12094206e-22]]
[[4.43831450e-05 6.40613612e-16 5.54295627e-21 5.77696659e-37
  1.23563363e-17 7.74815314e-21 9.66964356e-13 3.82282612e-13
  8.36260594e-16 9.99941707e-01 1.39719823e-05 2.45621936e-15
  2.29985085e-12 6.50754741e-21 1.21738001e-13 1.00174170e-19
  3.38641847e-27 2.97709700e-23]
 [9.09204464e-05 1.46073500e-15 1.48219570e-20 1.41789109e-36
  1.77494695e-17 2.24178079e-20 1.89969590e-12 6.30734614e-13
  8.74303438e-16 9.99877095e-01 3.19979408e-05 4.54731501e-15
  6.33200003e-12 2.47874752e-20 2.58575171e-13 3.62880438e-19
  1.24502504e-26 1.12094206e-22]]
[[7.8976111e-05 1.2423146e-15 1.5904446e-20 1.8957797e-36 1.5453941e-17
  2.3208371e-20 1.8510425e-12 7.2671161e-13 9.5343172e-16 9.9989331e-01
  2.7699223e-05 4.4301187e-15 6.8085182e-12 2.4396503e-20 1.8725756e-13
  3.5172418e-19 1.2684686e-26 9.8560913e-23]
 [6.7018766e-05 7.3002842e-16 9.4504165e-21 7.0321184e-37 1.0371708e-17
  1.0917105e-20 1.0792307e-12 4.8982888e-13 6.0529562e-16 9.9991095e-01
  2.2024951e-05 2.3674177e-15 4.4784259e-12 1.4200780e-20 1.1594869e-13
  1.7972171e-19 6.4683474e-27 5.7517628e-23]]
[[1.27087638e-04 1.18580801e-15 1.10669237e-20 7.23715502e-37
  1.57537358e-17 1.10474365e-20 1.12867528e-12 4.19154430e-13
  5.66998243e-16 9.99834895e-01 3.80308920e-05 3.10202820e-15
  4.15269685e-12 1.57932431e-20 2.03523518e-13 2.15562027e-19
  7.95262542e-27 9.37618678e-23]
 [6.25898392e-05 1.16116009e-15 1.06046477e-20 1.23950091e-36
  1.51711955e-17 1.82027330e-20 1.38282235e-12 5.10432246e-13
  7.53759003e-16 9.99913931e-01 2.34895524e-05 3.21498153e-15
  5.61361157e-12 1.49201346e-20 1.89470604e-13 3.00349591e-19
  1.02761120e-26 6.89254531e-23]]
[[9.0920446e-05 1.4607350e-15 1.4821957e-20 1.4178911e-36 1.7749537e-17
  2.2417808e-20 1.8997104e-12 6.3073104e-13 8.7430344e-16 9.9987710e-01
  3.1997875e-05 4.5472976e-15 6.3320000e-12 2.4787286e-20 2.5857517e-13
  3.6288044e-19 1.2450156e-26 1.1209421e-22]
 [9.6318086e-05 1.2741858e-15 1.5071555e-20 1.5206685e-36 1.7354825e-17
  1.7591283e-20 2.0668111e-12 6.8121496e-13 1.1242764e-15 9.9987197e-01
  3.1697695e-05 5.3538310e-15 9.2311280e-12 2.2817326e-20 2.5095944e-13
  3.0943245e-19 8.8856715e-27 1.2083465e-22]]
[[9.6318086e-05 1.2741858e-15 1.5071555e-20 1.5206685e-36 1.7354825e-17
  1.7591283e-20 2.0668111e-12 6.8121496e-13 1.1242764e-15 9.9987197e-01
  3.1697695e-05 5.3538310e-15 9.2311280e-12 2.2817326e-20 2.5095944e-13
  3.0943245e-19 8.8856715e-27 1.2083465e-22]
 [7.8976707e-05 1.2423146e-15 1.5904446e-20 1.8957797e-36 1.5453823e-17
  2.3208194e-20 1.8510356e-12 7.2671578e-13 9.5343172e-16 9.9989331e-01
  2.7699223e-05 4.4301521e-15 6.8085307e-12 2.4396597e-20 1.8725756e-13
  3.5172149e-19 1.2684783e-26 9.8560534e-23]]
[[6.75492556e-05 9.08302228e-16 1.07235993e-20 1.38627045e-36
  1.59258076e-17 1.57485899e-20 1.37643347e-12 6.64740832e-13
  9.73771099e-16 9.99906540e-01 2.58971286e-05 4.61890243e-15
  4.78584230e-12 1.28890750e-20 1.86195487e-13 2.52767370e-19
  9.11914809e-27 6.23559773e-23]
 [6.70194058e-05 7.30028422e-16 9.45041652e-21 7.03217218e-37
  1.03717877e-17 1.09171871e-20 1.07924509e-12 4.89832621e-13
  6.05300173e-16 9.99910951e-01 2.20248676e-05 2.36741770e-15
  4.47849396e-12 1.42008332e-20 1.15949784e-13 1.79723070e-19
  6.46839673e-27 5.75189409e-23]]
[[6.2589839e-05 1.1611601e-15 1.0604648e-20 1.2395009e-36 1.5171196e-17
  1.8202733e-20 1.3828224e-12 5.1043225e-13 7.5375900e-16 9.9991393e-01
  2.3489552e-05 3.2149815e-15 5.6136116e-12 1.4920135e-20 1.8947060e-13
  3.0034959e-19 1.0276112e-26 6.8925453e-23]
 [7.9832163e-05 1.0315083e-15 1.1508682e-20 1.6861944e-36 1.6996691e-17
  2.0532414e-20 1.6076384e-12 6.3466501e-13 1.0150244e-15 9.9989414e-01
  2.5957881e-05 4.4762257e-15 6.3288472e-12 1.8538668e-20 1.7562233e-13
  2.9604616e-19 1.0381739e-26 8.1428330e-23]]
[[6.25898392e-05 1.16116009e-15 1.06046477e-20 1.23950091e-36
  1.51711955e-17 1.82027330e-20 1.38282235e-12 5.10432246e-13
  7.53759003e-16 9.99913931e-01 2.34895524e-05 3.21498153e-15
  5.61361157e-12 1.49201346e-20 1.89470604e-13 3.00349591e-19
  1.02761120e-26 6.89254531e-23]
 [9.09204464e-05 1.46073500e-15 1.48219570e-20 1.41789109e-36
  1.77494695e-17 2.24178079e-20 1.89969590e-12 6.30734614e-13
  8.74303438e-16 9.99877095e-01 3.19979408e-05 4.54731501e-15
  6.33200003e-12 2.47874752e-20 2.58575171e-13 3.62880438e-19
  1.24502504e-26 1.12094206e-22]]
[[6.0081846e-05 8.8077498e-16 8.9329585e-21 1.0491477e-36 1.3709619e-17
  1.4646831e-20 1.4391729e-12 5.3806894e-13 8.8702291e-16 9.9991608e-01
  2.3818859e-05 3.9102365e-15 4.4914597e-12 1.3944417e-20 1.6928574e-13
  2.1685137e-19 6.6129894e-27 5.5876242e-23]
 [7.3959287e-05 9.7573971e-16 1.1123053e-20 1.0863227e-36 1.1953950e-17
  1.6075837e-20 1.3670584e-12 5.6232417e-13 8.1033112e-16 9.9990034e-01
  2.5762769e-05 3.4194390e-15 6.1401023e-12 1.7694487e-20 1.6305617e-13
  2.5829449e-19 7.9829935e-27 6.2002403e-23]]
total time is 58.42938160896301
average time is 5.637585234642029

While the inference seems somewhat correct, the issue is with the amount of time it is taking

Hi,
We recommend you to check the below samples links, as they might answer your concern

If issue persist, request you to share the model and script so that we can try reproducing the issue at our end.
Thanks!

@NVES I built all of my code using those links. I have also added my script to the post above. I will train a special imagenet tomorrow and get you a model that you can reproduce this error with.

Continuing the discussion from Trying to run TensorFlow 1.15 produced graphdefs with TF2 based tensorRT but TensorRT model is not building correctly:

Also for further reference, here are the two logs from the TF1 and TF2 model exports:

TF1_log
TF2_log

Hi @sharan,

We request you to share issue reproducible model/scripts with us and steps to reproduce the issue to try from our end.

Thank you.