Trying to run TensorFlow 1.15 and 2.4.1 produced graphdefs with TF2 based tensorRT but Triton Server inference not working correctly

Description

Trying to create a TensorRT server on our platform for real time inference that can both accept models created originally by Tensorflow 1.15 and also serve models created by Tensorflow 2.4.1. Since all of the models that were created in TF1.15 were mostly created in tf-slim, models from both versions on our platform are exported as graphdefs. Converting this to TensorRT models was a pretty easy process previously as these graphdefs could be directly converted (using Originally built with nvcr.io/nvidia/tensorflow:20.03-tf1-py3). However with TF-TRT in 2.4.1, we have to convert these to saved-models and then proceed to convert to TensorRT models. By doing this, we face two problems:

  • Previous models that go through this pipeline and are used for predictions always give probabilities with all the classes being 0 and maybe one of them being 1. This is clearly erroneous. The probabilities must be distributed within the output.
  • The speed of the inference has changed quite a bit. The inference time went up from 20 ms to 5 s and 0.170-0.180s. Our systems downstream require faster inference time.

It must be noted that there is a whole different branch for Keras compatibility. This will allow us to skip the graphdef part of the pipeline. But we will lose out on the previous few models and therefore need to fix this version of the pipeline too. No parts of this question is related.to the ongoing Keras efforts.

Environment

TensorRT Version: Not really sure, but it’s using the TF-TRT from TF 2.4.1 now and previously the tensorflow.python.compiler.tensorrt from TF 1.15.

GPU Type: GeForce GTX 1080 Ti

Nvidia Driver Version:

NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3

CUDA Version:

CUDA Version: 11.3

CUDNN Version:

#define CUDNN_MAJOR 8
#define CUDNN_MINOR 2
#define CUDNN_PATCHLEVEL 0

Operating System + Version:

NAME="Ubuntu"
VERSION="16.04.5 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.5 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

Python Version (if applicable): 3.6 while training the model, 3.8.5 in the container
TensorFlow Version (if applicable): As stated above 1.15, 2.4.1 for training the models and 2.4.0 within the container to export the model.
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/tensorflow:21.03-py3

Relevant Files

There is an attached zipped file that unzips to the following structure:

nvidia_reproducible/
L__ tf2
    |__ Pipfile
    |__ export_model.py
    |__ export_model_main.py
    |__ models/
        L__ open_images_inception_V3_TF2
            |__ config.ini
            |__ model.ckpt-8000.data-00000-of-00001
            |__ model.ckpt-8000.index
            |__ model.ckpt-8000.meta
            L__ open_images_inception_V3_TF2.pb
        L__ open_images_open_images_inception_V3_TF1_100k
            |__ config.ini
            |__ model.ckpt-100000.data-00000-of-00001
            |__ model.ckpt-100000.index
            |__ model.ckpt-100000.meta
            L__ open_images_open_images_inception_V3_TF1_100k.pb

Link to the compressed folder. In order to run the inference, we need the file check_inference.py. Please install the appropriate libraries at the top of the python script to run this.

open_images_inception_V3_TF2 is the model that was exclusively trained and exported in TF2 and exported using the code given in the compressed folder. open_images_open_images_inception_V3_TF1_100k was trained and exported in code that is TF1, but needs to run with the new TensorRT server.

Steps To Reproduce

Pipenv Environment

Pipfile file exists within the compressed folder for you to check the packages that are being used in the repository. If not necessary, please ignore this file.

Exporting the model from a checkpoint

Within the folder of tf2/ from the compressed folder, the function export_model()from export_model.py was used to convert a checkpoint into a graphdef. This is often done on the cloud or on the ubuntu device itself. The resulting open_images_open_images_inception_V3_TF1_100k.pb from the given checkpoint model.ckpt-100000 is provided within the compressed folder under tf1/ and open_images_open_images_inception_V3_TF1_100k/ and conversely for open_images_inception_V3_TF2

Exporting the tensorRT model

Then the docker image is run with the command:

docker run --gpus all --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -eMODEL_NAME=open_images_open_images_inception_V3_TF1_100k -ti -v<path_to_project>/nvidia_reproducible/tf2/:/trainer nvcr.io/nvidia/tensorflow:21.03-tf2-py3

Once inside the docker, use these commands:

cd /trainer
apt-get update && apt-get install -y libcurl4 libcurl4-openssl-dev
export PYTHONPATH=`pwd`
mkdir /opt/tensorflow/horovod-source/.eggs/
touch /opt/tensorflow/horovod-source/.eggs/easy-install.pth    
pip install tensorflow-probability==0.8
pip install opencv-python-headless
pip install tensorrtserver
pip install tf_slim
pip install nvidia-pyindex
pip install tritonclient
python ./export_model_main.py --model_dir=${MODEL_NAME} --device_query_tool ''
# Now to convert the TF2 model
./export_model_main.py --model_dir=open_images_open_images_inception_V3_TF1_100k --device_query_tool ''
exit

Running the TensorRT server (Triton)

Now to run the server, this command is called:

docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v<path_to_project>/nvidia_reproducible/tf2/models:/models nvcr.io/nvidia/tritonserver:21.03-py3 tritonserver --model-repository=/models --log-verbose=1

Running Inference for the TF1 model

Now if we run check_inference.py with NUM_TRIES = 10 and MODEL_NAME = 'open_images_open_images_inception_V3_TF1_100k', we get this output on the server side:

I0522 04:39:54.510930 1 tensorflow.cc:2100] model open_images_open_images_inception_V3_TF1_100k, instance open_images_open_images_inception_V3_TF1_100k_0_1, executing 1 requests
I0522 04:39:54.510965 1 tensorflow.cc:1389] TRITONBACKEND_ModelExecute: Running open_images_open_images_inception_V3_TF1_100k_0_1 with 1 requests
I0522 04:39:54.511341 1 tensorflow.cc:1617] TRITONBACKEND_ModelExecute: input 'input' is GPU tensor: false
I0522 04:39:54.518132 1 infer_response.cc:165] add response output: output: out, type: FP32, shape: [2,601]
I0522 04:39:54.518173 1 http_server.cc:1200] HTTP using buffer for: 'out', size: 4808, addr: 0x7f1f9a21dc60
I0522 04:39:54.518193 1 tensorflow.cc:1800] TRITONBACKEND_ModelExecute: output 'out' is GPU tensor: false
I0522 04:39:54.518245 1 http_server.cc:1215] HTTP release: size 4808, addr 0x7f1f9a21dc60
I0522 04:39:54.518270 1 tensorflow.cc:1858] TRITONBACKEND_ModelExecute: model open_images_open_images_inception_V3_TF1_100k_0_1 released 1 requests
I0522 04:39:54.801961 1 http_server.cc:1229] HTTP request: 2 /v2/models/open_images_open_images_inception_V3_TF1_100k/infer
I0522 04:39:54.802026 1 model_repository_manager.cc:656] GetInferenceBackend() 'open_images_open_images_inception_V3_TF1_100k' version -1
I0522 04:39:54.802044 1 model_repository_manager.cc:656] GetInferenceBackend() 'open_images_open_images_inception_V3_TF1_100k' version -1
I0522 04:39:54.898306 1 infer_request.cc:502] prepared: [0x0x7f210c009560] request id: , model: open_images_open_images_inception_V3_TF1_100k, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 2, priority: 0, timeout (us): 0
original inputs:
[0x0x7f210c008098] input: input, type: FP32, original shape: [2,299,299,3], batch + shape: [2,299,299,3], shape: [299,299,3]
override inputs:
inputs:
[0x0x7f210c008098] input: input, type: FP32, original shape: [2,299,299,3], batch + shape: [2,299,299,3], shape: [299,299,3]
original requested outputs:
requested outputs:
out

I0522 04:39:54.898440 1 tensorflow.cc:2100] model open_images_open_images_inception_V3_TF1_100k, instance open_images_open_images_inception_V3_TF1_100k_0_1, executing 1 requests
I0522 04:39:54.898478 1 tensorflow.cc:1389] TRITONBACKEND_ModelExecute: Running open_images_open_images_inception_V3_TF1_100k_0_1 with 1 requests
I0522 04:39:54.898838 1 tensorflow.cc:1617] TRITONBACKEND_ModelExecute: input 'input' is GPU tensor: false
I0522 04:39:54.905740 1 infer_response.cc:165] add response output: output: out, type: FP32, shape: [2,601]
I0522 04:39:54.905789 1 http_server.cc:1200] HTTP using buffer for: 'out', size: 4808, addr: 0x7f1f9a21dc60
I0522 04:39:54.905807 1 tensorflow.cc:1800] TRITONBACKEND_ModelExecute: output 'out' is GPU tensor: false
I0522 04:39:54.905860 1 http_server.cc:1215] HTTP release: size 4808, addr 0x7f1f9a21dc60
I0522 04:39:54.905892 1 tensorflow.cc:1858] TRITONBACKEND_ModelExecute: model open_images_open_images_inception_V3_TF1_100k_0_1 released 1 requests

And on the side of check_inference.py, we get:

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
total time is 61.868407249450684
average time is 5.981230926513672
[20.122692584991455, 19.282654523849487, 19.213167190551758, 0.17621302604675293, 0.16901206970214844, 0.17193841934204102, 0.171095609664917, 0.16964077949523926, 0.1677391529083252, 0.16815590858459473]

The second time check_inference was called:

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
total time is 3.9601998329162598
average time is 0.17387678623199462
[0.19607853889465332, 0.1770946979522705, 0.17362165451049805, 0.17151832580566406, 0.17002081871032715, 0.16779732704162598, 0.17086052894592285, 0.16854619979858398, 0.17175626754760742, 0.17147350311279297]

Running Inference for the TF2 model

Now if we run check_inference.py with NUM_TRIES = 10 and MODEL_NAME = 'open_images_inception_V3_TF2', we get this output on the server side:

I0522 04:47:32.612510 1 grpc_server.cc:3427] New request handler for ModelStreamInferHandler, 3
I0522 04:47:32.612555 1 grpc_server.cc:2146] Thread started for ModelStreamInferHandler
I0522 04:47:32.612572 1 grpc_server.cc:3983] Started GRPCInferenceService at 0.0.0.0:8001
I0522 04:47:32.613084 1 http_server.cc:2717] Started HTTPService at 0.0.0.0:8000
I0522 04:47:32.655274 1 http_server.cc:2736] Started Metrics Service at 0.0.0.0:8002
I0522 04:49:07.003303 1 http_server.cc:1229] HTTP request: 0 /v2/health/live
I0522 04:49:07.274691 1 http_server.cc:1229] HTTP request: 2 /v2/models/open_images_open_images_inception_V3_TF1_100k/infer
I0522 04:49:07.274759 1 model_repository_manager.cc:656] GetInferenceBackend() 'open_images_open_images_inception_V3_TF1_100k' version -1
I0522 04:49:07.274778 1 model_repository_manager.cc:656] GetInferenceBackend() 'open_images_open_images_inception_V3_TF1_100k' version -1
I0522 04:49:07.372327 1 infer_request.cc:502] prepared: [0x0x7f5698003cd0] request id: , model: open_images_open_images_inception_V3_TF1_100k, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 2, priority: 0, timeout (us): 0
original inputs:
[0x0x7f5698001f38] input: input, type: FP32, original shape: [2,299,299,3], batch + shape: [2,299,299,3], shape: [299,299,3]
override inputs:
inputs:
[0x0x7f5698001f38] input: input, type: FP32, original shape: [2,299,299,3], batch + shape: [2,299,299,3], shape: [299,299,3]
original requested outputs:
requested outputs:
out

I0522 04:49:07.372520 1 tensorflow.cc:2100] model open_images_open_images_inception_V3_TF1_100k, instance open_images_open_images_inception_V3_TF1_100k_0_2, executing 1 requests
I0522 04:49:07.372564 1 tensorflow.cc:1389] TRITONBACKEND_ModelExecute: Running open_images_open_images_inception_V3_TF1_100k_0_2 with 1 requests
I0522 04:49:07.374647 1 tensorflow.cc:1617] TRITONBACKEND_ModelExecute: input 'input' is GPU tensor: false
2021-05-22 04:49:08.245720: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:568] remapper failed: Invalid argument: Mutation::Apply error: multiple nodes with the name: 'InceptionV3/InceptionV3/Mixed_7c/Branch_3/Conv2d_0b_1x1/BatchNorm/FusedBatchNormV3/NCHWShapedOffset' exists in Mutation.
2021-05-22 04:49:08.954258: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:568] remapper failed: Invalid argument: Mutation::Apply error: multiple nodes with the name: 'InceptionV3/InceptionV3/Mixed_7c/Branch_3/Conv2d_0b_1x1/BatchNorm/FusedBatchNormV3/NCHWShapedOffset' exists in Mutation.
2021-05-22 04:49:09.254271: I tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:733] Building a new TensorRT engine for PartitionedCall_1/PartitionedCall/InceptionV3/TRTEngineOp_0_0 input shapes: [[2,299,299,3]]
2021-05-22 04:49:09.254794: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libnvinfer_plugin.so.7
I0522 04:49:26.836016 1 infer_response.cc:165] add response output: output: out, type: FP32, shape: [2,601]
I0522 04:49:26.836082 1 http_server.cc:1200] HTTP using buffer for: 'out', size: 4808, addr: 0x7f565fd13ff0
I0522 04:49:26.836104 1 tensorflow.cc:1800] TRITONBACKEND_ModelExecute: output 'out' is GPU tensor: false
I0522 04:49:26.836179 1 http_server.cc:1215] HTTP release: size 4808, addr 0x7f565fd13ff0
I0522 04:49:26.836211 1 tensorflow.cc:1858] TRITONBACKEND_ModelExecute: model open_images_open_images_inception_V3_TF1_100k_0_2 released 1 requests
I0522 04:49:27.121741 1 http_server.cc:1229] HTTP request: 2 /v2/models/open_images_open_images_inception_V3_TF1_100k/infer
I0522 04:49:27.121809 1 model_repository_manager.cc:656] GetInferenceBackend() 'open_images_open_images_inception_V3_TF1_100k' version -1
I0522 04:49:27.121828 1 model_repository_manager.cc:656] GetInferenceBackend() 'open_images_open_images_inception_V3_TF1_100k' version -1
I0522 04:49:27.220424 1 infer_request.cc:502] prepared: [0x0x7f5698004c90] request id: , model: open_images_open_images_inception_V3_TF1_100k, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 2, priority: 0, timeout (us): 0
original inputs:
[0x0x7f5698004978] input: input, type: FP32, original shape: [2,299,299,3], batch + shape: [2,299,299,3], shape: [299,299,3]
override inputs:
inputs:
[0x0x7f5698004978] input: input, type: FP32, original shape: [2,299,299,3], batch + shape: [2,299,299,3], shape: [299,299,3]
original requested outputs:
requested outputs:
out

Now if we run check_inference.py The output is:

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
total time is 61.49000334739685
average time is 5.935466170310974
[19.63507580757141, 0.17602181434631348, 19.014800310134888, 0.17351651191711426, 0.17123889923095703, 19.48599362373352, 0.17773032188415527, 0.17619085311889648, 0.17372965812683105, 0.1703639030456543]

Now if we run check_inference.py the second time, the output is:

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
total time is 3.957824230194092
average time is 0.17999637126922607
[0.23171639442443848, 0.1756727695465088, 0.17092442512512207, 0.17671537399291992, 0.17761540412902832, 0.16989755630493164, 0.1754920482635498, 0.17484045028686523, 0.1747884750366211, 0.1723008155822754]