Get optimized engine using TF-TRT

I’m trying to convert a tf frozen graph into trt graph using the TrtGraphConverter
class. My input has undefined shape so I have to set is_dynamic_op=True. I can see that is building a new TensorRT engine at runtime.

My question are:

  1. Can I get the optimized engine on a specific input shape and serialize it to disk?
  2. If I change my tf frozen graph to have fixed shape and run the conversion with is_dynamic_op=False, it will build the optimized engine during conversion, will this engine be serialized along with my trt graph?

Hi

Yes, you can generate a serialized engine file while using static mode for specified input shape.
The argument “maximum_cached_engines” can be used to control how many engines will be stored at a time.

For more info on static and dynamic mode, please refer below link:
https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#static-dynamic-mode

Sample Example:
https://github.com/tensorflow/tensorrt/blob/master/tftrt/examples/image-classification/image_classification.py

Please refer below link to generate a standalone plan file and to serialize the model:
https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#tensorrt-plan

Thanks

Thanks! I was able to save the optimized engine.

One more thing that is a little confusing to me in the documentation. This section (https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#cache-var) is talking about reusing engine when the input batch is smaller than the engine batch size.

In this case do I have to pad my input batch to be the same as engine batch? I tried feeding the graph with smaller batch size inputs, but it raise an ValueError saying that input shape does not match.

Sorry my bad, just some stupid typo in my testing code. There is no need to pad your input when input batch <= engine batch size. When input has an larger batch size, trt will fall back to the original tf graph.

Hi,

Two possible cases:

  • is_dynamic_op is true
        In this case the batch size is determined by input shapes during execution.
        An engine can be reused for a new input, if input batch size is less than or equal to engine batch size and other non-batch dimension match for new input.
        Otherwise based on the maximum_cached_engines value either additional new engine will be created or engine will be recreated for new input.
  • is_dynamic_op is false
        In this case “max_batch_size” param will define the max allowed input batch size. By default max_batch_size value is 1.
        If input batch size is less than or equal to max_batch_size and other non-batch dimension match for new input, generated engine can be reused.

In your case could you please check following this:
    - Other non-batch input dimension match for new input.
    - max_batch_size values is greater than or equal to input batch size. (Engine batch size is greater than or equal to input batch size)
    
Thanks

Thanks for your help!

Hi,

Related to this problem, I’m currently trying to get the current workflow working:

  1. Train a model in Tensorflow 2.0
  2. Convert a SavedModel to a serialized TensorRT engine with TF-TRT
  3. Deserialize the engine with the TensorRT API

So far, I’m stuck at step 2.

Since the training is done with TensorFlow 2.0, TrtGraphConverterV2 [1] should be used.

However, as opposed to using TrtGraphConverter or create_inference_graph (for SavedModels trained with TF1), TrtGraphConverterV2 returns a function and not the graph itself. So serializing the model as per [2] does not work.

So, my question is then: What is the intended workflow for serializing graphs converted with TrtGraphConverterV2 into TensorRT engines that can be deserialized with the TensorRT API?

[1]: https://github.com/tensorflow/tensorflow/blob/3bdf127fbb0ee3345099393bb957f3dc12c2ea88/tensorflow/python/compiler/tensorrt/trt_convert.py#L781
[2]: https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#tensorrt-plan

1 Like

+1

+1

+1

Is there already a fix for this? I’m struggling with the same problem.
The TF-TRT model is faster then the simply-converted TRT model, but I can’t afford the RAM needed to load TF, so building the ICudaEngine from the TF-TRT model would be great.