supported ops problem for Tensorflow-TensorRT

Hello, I get the resnet-50 saved_model by curl -s https://storage.googleapis.com/download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC_jpg.tar.gz | tar --strip-components=2 -C /tmp/resnet -xvz following the steps from https://medium.com/tensorflow/optimizing-tensorflow-serving-performance-with-nvidia-tensorrt-6d8a2347869a. Then I use nvcr.io/nvidia/tensorflow:19.03-py2 to optimize this resnet-50 saved_model by running(using docker):

docker run --rm --runtime=nvidia -it --env CUDA_VISIBLE_DEVICES=2 -v /tmp:/tmp nvcr.io/nvidia/tensorflow:19.03-py2 /usr/local/bin/saved_model_cli convert --dir /tmp/resnet/1538687457 --output_dir /tmp/resnet_trt/1538687457 --tag_set serve tensorrt --precision_mode FP32 --max_batch_size 1 --is_dynamic_op True

It shows info:

2019-04-15 08:41:36.445773: I tensorflow/contrib/tensorrt/segment/segment.cc:461] There are 70 ops of 36 different types in the graph that are not converted to TensorRT: ArgMax, Exit, NextIteration, TensorArrayWriteV3, Slice, FloorDiv, Softmax, Squeeze, Pack, Range, Sub, Minimum, TensorArraySizeV3, Less, DecodeJpeg, Merge, ResizeBilinear, TensorArrayV3, TensorArrayScatterV3, Shape, Enter, NoOp, LoopCond, StridedSlice, TensorArrayReadV3, Transpose, LogicalAnd, TensorArrayGatherV3, Switch, Identity, Cast, Placeholder, Add, RealDiv, Mul, ExpandDims, (For more information see https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#support-ops).
2019-04-15 08:41:36.496972: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:928] Number of TensorRT candidate segments: 1
2019-04-15 08:41:36.910235: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:1030] TensorRT node resnet_model/TRTEngineOp_0 added for segment 0 consisting of 441 nodes succeeded.
2019-04-15 08:41:37.038405: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:616] Optimization results for grappler item: tf_graph
2019-04-15 08:41:37.038483: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:618]   constant folding: Graph size after: 550 nodes (-256), 613 edges (-258), time = 617.799ms.
2019-04-15 08:41:37.038504: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:618]   layout: Graph size after: 557 nodes (7), 615 edges (2), time = 160.261ms.
2019-04-15 08:41:37.038517: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:618]   constant folding: Graph size after: 552 nodes (-5), 615 edges (0), time = 479.515ms.
2019-04-15 08:41:37.038533: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:618]   TensorRTOptimizer: Graph size after: 112 nodes (-440), 159 edges (-456), time = 824.339ms.

As you can see, some ops like Softmax, Sub, Add , Identity, Mul, Slice and others are not converted to TensorRT, however these tensorflow ops are supported in docs from https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html#supported-ops. So what causes this? Anyone can give some advises?

Can administrator move my this topic to [url]https://devtalk.nvidia.com/default/board/225/container-tensorflow/[/url] ?
thanks very much~

Done!

Thanks for your question. Some TensorFlow ops are only able to be converted in certain circumstances, due to limitations in TensorRT. For example, since TRT does not support integer arithmetic, we cannot convert an Add, Sub, Mul, etc which is operating on integer types.

If you are interested in seeing why specific nodes in your graph aren’t converting, you can set the environment variable TF_CPP_VMODULE=segment=1 when running the conversion script. I think you will find that the core of your ResNet model is converting and only the preprocessing is incompatible with TF-TRT. You should still get good speedup from TF-TRT.

@tmorris Thanks for your reply. But sorry, after I run

export TF_CPP_VMODULE=segment=1

as you said, there is no more logs, and it still shows the same informations:

There are 70 ops of 36 different types in the graph that are not converted to TensorRT: ArgMax, Exit, NextIteration, TensorArrayWriteV3, Slice, FloorDiv, Softmax, Squeeze, Pack, Range, Sub, Minimum, TensorArraySizeV3, Less, DecodeJpeg, Merge, ResizeBilinear, TensorArrayV3, TensorArrayScatterV3, Shape, Enter, NoOp, LoopCond, StridedSlice, TensorArrayReadV3, Transpose, LogicalAnd, TensorArrayGatherV3, Switch, Identity, Cast, Placeholder, Add, RealDiv, Mul, ExpandDims, (For more information see https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#support-ops).
2019-04-15 08:41:36.496972: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:928] Number of TensorRT candidate segments: 1
...

Besides, I operate on integer types? I do not understand

Are you sure you are setting the environment variable inside docker? Your command should look like this:

docker run --rm --runtime=nvidia -it --env CUDA_VISIBLE_DEVICES=2 --env TF_CPP_VMODULE=segment=1 -v /tmp:/tmp nvcr.io/nvidia/tensorflow:19.03-py2 /usr/local/bin/saved_model_cli convert --dir /tmp/resnet/1538687457 --output_dir /tmp/resnet_trt/1538687457 --tag_set serve tensorrt --precision_mode FP32 --max_batch_size 1 --is_dynamic_op True

Based on the logs you posted, your model is already being fully converted and you should have good performance. The ops that are not converted are used for image preprocessing and are not a cause of concern. Have you measured your performance gain with TF-TRT?

@tmorris sorry, reply too late. I use tensorflow/serving:late-gpu to serve this converted saved_model which has 20% speed up in contrast to the original saved_model:

model             GPU                 use trt or not                  the time of infering an image                    image size
resnet-50	v100		no			                10.38/10.49/10.39	                                    (360, 480, 3)
resnet-50	v100		yes(fp32)			7.72/7.76/7.79	                                            (360, 480, 3)
resnet-50	v100		yes(fp16)			7.19/7.16/7.21	                                            (360, 480, 3)

Now, I have two question:

  1. the speed up of trt is not obvious, the inference speed of the converted saved_model is only 25% faster than original tf model.
  2. why fp16 is only little faster than the fp32, using trt?