cannot convert from a tensorflow saved_model to a saved_model optimized by tensorrt

Now I get the resnet-50 saved_model by curl -s https://storage.googleapis.com/download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC_jpg.tar.gz | tar --strip-components=2 -C /tmp/resnet -xvz following the steps from here https://medium.com/tensorflow/optimizing-tensorflow-serving-performance-with-nvidia-tensorrt-6d8a2347869a, for tensorflow serving. Then I optimize my this resnet-50 saved_model by running(using docker):

docker run --rm --runtime=nvidia -it \
    -v /tmp:/tmp tensorflow/tensorflow:nightly-gpu \
    /usr/local/bin/saved_model_cli convert \
    --dir /tmp/resnet/1538686847 \
    --output_dir /tmp/resnet_trt/1538686847 \
    --tag_set serve \
    tensorrt --precision_mode FP32 --max_batch_size 1 --is_dynamic_op True

it shows no errors. But the dir /tmp/resnet_trt/1538686847/variables/ is empty, which has no variables.data-00000-of-00001 and variables.index, and there is only a frozen graph file saved_model.pb with bigger size (it looks like this is a TensorRT engine) under the dir /tmp/resnet_trt/1538686847. So this convert failed? how to fix it ?

looking forward to your reply.

You need to convert your tensorflow pb graph to a UFF format first.

@alexander.spivakovsky

I read the tutorialhttps://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#integrate-ovr, but there is no the need of converting tensorflow pb graph to a UFF format first. it seems that we just need to use the function create_inference_graph https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/compiler/tensorrt/trt_convert.py#L775 for saved_model, pb or ckpt.

Hi,

For future reference, tensorflow/tensorflow:nightly-gpu is not one of our containers. We can better support containers from https://ngc.nvidia.com, such as this one: https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow

Thanks,
NVIDIA Enterprise Support

@NVES_R thanks for your reply. After I pull nvcr.io/nvidia/tensorflow:19.03-py2 from urlhttps://ngc.nvidia.com/catalog/containers/nvidia:tensorflow, I run

docker run --rm --runtime=nvidia -it --env CUDA_VISIBLE_DEVICES=2 -v /tmp:/tmp nvcr.io/nvidia/tensorflow:19.03-py2 /usr/local/bin/saved_model_cli convert --dir /tmp/resnet/1538686847 --output_dir /tmp/resnet_trt/1538686847 --tag_set serve tensorrt --precision_mode FP32 --max_batch_size 1 --is_dynamic_op True

it seems to do some successful convert in log:

2019-03-26 11:43:20.122105: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:928] Number of TensorRT candidate segments: 1
2019-03-26 11:43:20.299920: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:1030] TensorRT node resnet_model/TRTEngineOp_0 added for segment 0 consisting of 199 nodes succeeded.
2019-03-26 11:43:20.358346: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:616] Optimization results for grappler item: tf_graph
2019-03-26 11:43:20.358418: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:618]   constant folding: Graph size after: 308 nodes (-112), 363 edges (-114), time = 313.582ms.
2019-03-26 11:43:20.358432: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:618]   layout: Graph size after: 308 nodes (0), 363 edges (0), time = 73.565ms.
2019-03-26 11:43:20.358445: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:618]   constant folding: Graph size after: 308 nodes (0), 363 edges (0), time = 127.602ms.
2019-03-26 11:43:20.358457: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:618]   TensorRTOptimizer: Graph size after: 110 nodes (-198), 157 edges (-206), time = 470.955ms.

But the dir /tmp/resnet_trt/1538686847/variables/ is still empty, which has no variables.data-00000-of-00001 and variables.index, and there still is only a frozen graph file saved_model.pb with bigger size(it looks like this is a TensorRT engine) under the dir /tmp/resnet_trt/1538686847.

Hi,

Unfortunately, this is a Tensorflow API, so TensorRT doesn’t maintain documentation on saved_model_cli. Maybe this will help: https://www.tensorflow.org/guide/saved_model?

However, for what it’s worth - I followed the steps from here: https://medium.com/tensorflow/optimizing-tensorflow-serving-performance-with-nvidia-tensorrt-6d8a2347869a, and the output is still named saved_model.pb, but the size has changed and it looks like this is a TensorRT engine (just poorly named by saved_model_cli). Also, the variables directory is empty as well. So it seems like this is expected.

If you try to serve it using the steps from the article

docker run --rm --runtime=nvidia -p 8501:8501 \
    --name tfserving_resnet \
    -v /tmp/resnet_trt:/models/resnet \
    -e MODEL_NAME=resnet \
    -t tensorflow/serving:latest-gpu &

python /tmp/resnet/resnet_client.py

It seems to be working.

Thanks,
NVIDIA Enterprise Support

@NVES_R yes, you are right, the size of saved_model.pb under /tmp/resnet_trt/1538686847 has bigger size, and it looks like this is a TensorRT engine. I have updated my question.
Then, serve it by tensorflow serving and run python /tmp/resnet/resnet_client.py. It works. Thank you very much. But I am just wondering why only a single frozen graph file can be served by tensorflow serving.

Hi,

I’m not too sure of Tensorflow serving’s capabilities, but I do know that TensorRT Inference Server can serve a configurable number of instances of multiple models. I would recommend checking it out, it is also available as a container on NGC.

Thanks,
NVIDIA Enterprise Support

@NVES_R thanks. After I do some tests , I find a problem:
Now I have a saved_model of my model with a another signature_def predict_images in addition to the default signature_def serving_default. but after I do a convert using tensorflow/tensorflow:nightly-gpu docker image:

docker run --rm --runtime=nvidia -it --env CUDA_VISIBLE_DEVICES=2 -v /tmp:/tmp tensorflow/tensorflow:nightly-gpu /usr/local/bin/saved_model_cli convert --dir /tmp/myModel/111111 --output_dir /tmp/myModel_trt/111111 --tag_set serve tensorrt --precision_mode FP32 --max_batch_size 1 --is_dynamic_op True

I find that my new saved_model.pb (under dir /tmp/myModel_trt/111111) optimized tensorrt has no the another signature_def predict_images, and has only the default signature_def serving_default. But if I use nvcr.io/nvidia/tensorflow:19.03-py2 docker image from NGC and run above command, this problem do not appears.

thank you very much~

Hi,

I’m sorry to hear that, but we do not support the tensorflow/tensorflow:nightly-gpu image.

If the nvcr.io/nvidia/tensorflow:19.03-py2 image is working for you, then I’d suggest using that, because we can better support that image.

Thanks,
NVIDIA Enterprise Support

@NVES_R thanks for your reply. Sorry bother you again. Now I have another two questions:
I use the saved_model from here https://storage.googleapis.com/download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC_jpg.tar.gz, and use tensorrt to optimize this saved_model following the steps https://medium.com/tensorflow/optimizing-tensorflow-serving-performance-with-nvidia-tensorrt-6d8a2347869a. Then I modify the resnet_client_grpc.py https://github.com/tensorflow/serving/blob/master/tensorflow_serving/example/resnet_client_grpc.py to test the inference time:

...
    request.model_spec.name = 'resnet'
    request.model_spec.signature_name = 'predict'
    request.inputs['image_bytes'].CopyFrom(
    tf.contrib.util.make_tensor_proto(data, shape=[1]))
    
    times=[]
    for count in range(1000):
        start = time.time()
        result = stub.Predict(request, 30.0)  # 10 secs timeout
        end = time.time()
        times.append((end-start)*1000.0)
        
    avg_time = np.mean(times[100:]) # don't include first 100 run
    print("predict image cat.jpg avg_time:%f\n" %avg_time)

My test results for avg_time is below:

original tensorflow saved_model: 10.38/10.49/10.39(ms)
tf saved_model optimized tensorrt FP32: 7.72/7.76/7.79(ms) 
tf saved_model optimized tensorrt FP16: 7.19/7.16/7.21(ms)

It seems that the inference speed optimized by tensorrt FP32/FP16 is much faster than the original tensorflow, about 30%.

Next I test my own saved_model based resnet-50 which has some up-sampling ops, And you can get my saved_model from here https://github.com/IvyGongoogle/trt_of_tf. Using tensorrt FP32/16 to optimize it and serving it with tensorflow serving, then I use the above same modified resnet_client_grpc.py to test the inference time, the test results for avg_time is below:

original tensorflow saved_model: 48.65/48.74/48.96(ms)
tf saved_model optimized tensorrt FP32: 47.46/47.33/47.19(ms) 
tf saved_model optimized tensorrt FP16: 43.85/43.77/43.84(ms)

First question: For my this model, it is sad that the inference speed optimized tensorrt FP32/FP16 is nearly same with the original tensorflow. What causes this? There is a little ops can be optimized tensorrt for my model which yet is based resnet-50( for resnet-50 we have good effect proved above)?

Second question: For original resnet-50 and my own model, it seems that the inference speed optimized by tensorrt FP16 is only slightly faster than tensorrt FP32, but theoretically this ratio is almost two times?

Can you give some advises? Looking forward your reply