supported ops problem for Tensorflow-TensorRT

86108429 · April 15, 2019, 8:58am

Hello, I get the resnet-50 saved_model by curl -s https://storage.googleapis.com/download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC_jpg.tar.gz | tar --strip-components=2 -C /tmp/resnet -xvz following the steps from https://medium.com/tensorflow/optimizing-tensorflow-serving-performance-with-nvidia-tensorrt-6d8a2347869a. Then I use nvcr.io/nvidia/tensorflow:19.03-py2 to optimize this resnet-50 saved_model by running(using docker):

docker run --rm --runtime=nvidia -it --env CUDA_VISIBLE_DEVICES=2 -v /tmp:/tmp nvcr.io/nvidia/tensorflow:19.03-py2 /usr/local/bin/saved_model_cli convert --dir /tmp/resnet/1538687457 --output_dir /tmp/resnet_trt/1538687457 --tag_set serve tensorrt --precision_mode FP32 --max_batch_size 1 --is_dynamic_op True

It shows info:

2019-04-15 08:41:36.445773: I tensorflow/contrib/tensorrt/segment/segment.cc:461] There are 70 ops of 36 different types in the graph that are not converted to TensorRT: ArgMax, Exit, NextIteration, TensorArrayWriteV3, Slice, FloorDiv, Softmax, Squeeze, Pack, Range, Sub, Minimum, TensorArraySizeV3, Less, DecodeJpeg, Merge, ResizeBilinear, TensorArrayV3, TensorArrayScatterV3, Shape, Enter, NoOp, LoopCond, StridedSlice, TensorArrayReadV3, Transpose, LogicalAnd, TensorArrayGatherV3, Switch, Identity, Cast, Placeholder, Add, RealDiv, Mul, ExpandDims, (For more information see https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#support-ops).
2019-04-15 08:41:36.496972: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:928] Number of TensorRT candidate segments: 1
2019-04-15 08:41:36.910235: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:1030] TensorRT node resnet_model/TRTEngineOp_0 added for segment 0 consisting of 441 nodes succeeded.
2019-04-15 08:41:37.038405: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:616] Optimization results for grappler item: tf_graph
2019-04-15 08:41:37.038483: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:618]   constant folding: Graph size after: 550 nodes (-256), 613 edges (-258), time = 617.799ms.
2019-04-15 08:41:37.038504: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:618]   layout: Graph size after: 557 nodes (7), 615 edges (2), time = 160.261ms.
2019-04-15 08:41:37.038517: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:618]   constant folding: Graph size after: 552 nodes (-5), 615 edges (0), time = 479.515ms.
2019-04-15 08:41:37.038533: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:618]   TensorRTOptimizer: Graph size after: 112 nodes (-440), 159 edges (-456), time = 824.339ms.

As you can see, some ops like Softmax, Sub, Add , Identity, Mul, Slice and others are not converted to TensorRT, however these tensorflow ops are supported in docs from https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html#supported-ops. So what causes this? Anyone can give some advises?

86108429 · April 16, 2019, 9:41am

Can administrator move my this topic to [url]https://devtalk.nvidia.com/default/board/225/container-tensorflow/[/url] ?
thanks very much~

TomNVIDIA · April 16, 2019, 5:10pm

Done!

tmorris · April 16, 2019, 11:01pm

Thanks for your question. Some TensorFlow ops are only able to be converted in certain circumstances, due to limitations in TensorRT. For example, since TRT does not support integer arithmetic, we cannot convert an Add, Sub, Mul, etc which is operating on integer types.

If you are interested in seeing why specific nodes in your graph aren’t converting, you can set the environment variable TF_CPP_VMODULE=segment=1 when running the conversion script. I think you will find that the core of your ResNet model is converting and only the preprocessing is incompatible with TF-TRT. You should still get good speedup from TF-TRT.

86108429 · April 27, 2019, 12:36pm

@tmorris Thanks for your reply. But sorry, after I run

export TF_CPP_VMODULE=segment=1

as you said, there is no more logs, and it still shows the same informations:

There are 70 ops of 36 different types in the graph that are not converted to TensorRT: ArgMax, Exit, NextIteration, TensorArrayWriteV3, Slice, FloorDiv, Softmax, Squeeze, Pack, Range, Sub, Minimum, TensorArraySizeV3, Less, DecodeJpeg, Merge, ResizeBilinear, TensorArrayV3, TensorArrayScatterV3, Shape, Enter, NoOp, LoopCond, StridedSlice, TensorArrayReadV3, Transpose, LogicalAnd, TensorArrayGatherV3, Switch, Identity, Cast, Placeholder, Add, RealDiv, Mul, ExpandDims, (For more information see https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#support-ops).
2019-04-15 08:41:36.496972: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:928] Number of TensorRT candidate segments: 1
...

Besides, I operate on integer types? I do not understand

tmorris · April 29, 2019, 6:38pm

Are you sure you are setting the environment variable inside docker? Your command should look like this:

docker run --rm --runtime=nvidia -it --env CUDA_VISIBLE_DEVICES=2 --env TF_CPP_VMODULE=segment=1 -v /tmp:/tmp nvcr.io/nvidia/tensorflow:19.03-py2 /usr/local/bin/saved_model_cli convert --dir /tmp/resnet/1538687457 --output_dir /tmp/resnet_trt/1538687457 --tag_set serve tensorrt --precision_mode FP32 --max_batch_size 1 --is_dynamic_op True

Based on the logs you posted, your model is already being fully converted and you should have good performance. The ops that are not converted are used for image preprocessing and are not a cause of concern. Have you measured your performance gain with TF-TRT?

86108429 · July 11, 2019, 7:48am

@tmorris sorry, reply too late. I use tensorflow/serving:late-gpu to serve this converted saved_model which has 20% speed up in contrast to the original saved_model:

model             GPU                 use trt or not                  the time of infering an image                    image size
resnet-50	v100		no			                10.38/10.49/10.39	                                    (360, 480, 3)
resnet-50	v100		yes(fp32)			7.72/7.76/7.79	                                            (360, 480, 3)
resnet-50	v100		yes(fp16)			7.19/7.16/7.21	                                            (360, 480, 3)

Now, I have two question:

the speed up of trt is not obvious, the inference speed of the converted saved_model is only 25% faster than original tf model.
why fp16 is only little faster than the fp32, using trt?

Topic		Replies	Views
cannot convert from a tensorflow saved_model to a saved_model optimized by tensorrt TensorRT	11	2219	October 12, 2021
No speed up with TensorRT FP16 or INT8 on NVIDIA V100 TensorRT	7	2833	November 15, 2019
No performance improvement with TF-TRT optimization (ResNet50, DenseNet121) TensorRT	4	1097	June 15, 2020
TF-TRT graph conversion failed for Tensorflow version 1 TensorRT tensorrt , tensorflow , ubuntu , python , tf-trt	1	828	May 17, 2022
Conversion with no speed improvement, TRT-TF TensorRT	2	1141	October 12, 2021
No SpeedUp after TensorRT INT8 (PointNet ++ tensorflow model) TensorRT	6	1261	February 25, 2020
Tf v2 to trt in xavier no time improvement Jetson AGX Xavier tensorrt , tensorflow	2	405	October 18, 2021
what is the meaning of some terms when using `nvcr.io/nvidia/tensorflow:19.03-py2` to convert a tf saved_model? TensorRT	1	627	July 15, 2019
TF-TRTModel loading time is very slow TensorRT tensorrt , tensorflow	10	1066	September 1, 2023
TF-TRT not generating .engine file TensorRT	1	728	May 18, 2022

supported ops problem for Tensorflow-TensorRT

Related topics