Lower performance with TRT than plain TF?

Description

When running a Keras Retinanet model using Tensorflow on a Jetson Xavier NX, I get just under 1fps.

After optimising the model with TensorRT, I get 1/4fps.

I expect a higher throughput with TensorRT optimisation than without. Optimising and running on an EC2 instance (G4DN and P3) shows slightly higher throughput (but still not fast enough for real-time video use).

Issues like this have previously been direct back at the repo maintainers (thread) or directed at C+±centric documentation (thread) but debugging these issues seem to be a generic problem and the short section on Python doesn’t explain how to analyse performance of TRT models invoked through Python.

Environment

TensorRT Version: 7.1.3
GPU Type: NVidia Jetson Xavier NX
Nvidia Driver Version: ???
CUDA Version: 10.2.89
CUDNN Version: 8.0.0.180
Operating System + Version: L4T from Jetpack 4.4.1
Python Version (if applicable): 3.6.9
TensorFlow Version (if applicable): 2.3.1+nv20.11
PyTorch Version (if applicable): n/a
Baremetal or Container (if container which image + tag): Baremetal (Jetson)

Relevant Files

Retinanet implementation - fizyr/keras-retinanet: Keras implementation of RetinaNet object detection. (github.com)

Steps To Reproduce

  • Clone Retinanet implementation and install dependencies
  • Train a model and save to disk
  • Convert model to TRT FP16 using tf.experimental.tensorrt.Converter and save to disk
    • Some steps are listed as not being supported for conversion, but presumably these should be no slower than pure Tensorflow
  • Build a collection of images
  • Load Retinanet model as “model” using Retinanet functions
  • For each image, record time.time(), predict on the image using model(np.expand_dims(image, axis=0)), record the new time.time() and subtract the two to get elapsed time (1/fps)
  • Load optimised model as “model” using tensorflow.python.keras.saving.save.import_model
  • For each image, record time.time(), predict on the image using model(np.expand_dims(image, axis=0)), record the new time.time() and subtract the two to get elapsed time (1/fps)

Expected result:

  • TensorRT model is faster (probably at least 50% faster for inference)
  • There are clear and obvious tools and instructions to debug the performance under Python that allow us to understand why the model isn’t performing well, and whether it is due to something simple around CUDA libraries not being used or some other cause

Actual result:

  • TensorRT model is four times slower
  • There is no obvious way to debug TensorRT within Python

Hi @sjbertram,
This issue looks like Jetson NX issue, hence moving it to the respective forum.

Thanks!

Hi,

Suppose you are using TF-TRT. Please correct me if this is not true.

Please noted that TensorRT does not support all TensorFlow operations.
It’s possible that the data transfer overhead between two frameworks over the acceleration from TensorRT.

Please check how many layers are using TensorRT implementation first.
You can find this information in the TensorFlow output log or below document:

Thanks.

I’m using Python and have tried tf.experimental.tensorrt.Converter and tf.python.compiler.tensorrt.trt_convert.TrtGraphConverterV2, and I don’t call anything with trtexec or other command-line tools, so I think that’s TF-TRT. But some of the documentation about how TF-TRT got merged in to core TF doesn’t make that clear.

When converting the model I get:

There are 70 ops of 9 different types in the graph that are not converted to TensorRT: Identity, ResizeNearestNeighbor, Reshape, Placeholder, NoOp, Pack, Shape, DataFormatVecPermute, StridedSlice, (For more information see https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#supported-ops).

(Only logged at INFO level, so I can only see it if I also see lots of other TensorFlow logging)

One of the other models that I’m working with lists more ops but still gains performance.

By going to the “debug” section of that link (which I’ve read before but not fully understood), I got to Migrating tf.summary usage to TF 2.x  |  TensorBoard  |  TensorFlow (the link in 4.1.2.1 redirects). From there I found tf.summary.trace_on(), which is used in a tutorial on Tensorboard, which has a section on profiling functions without modifying them.

I’ve now got Tensorboard rendering the graph, but I can’t tell which bits are using TF-TRT and which bits are standard TF. The “main graph” is small but the “Functions” are a huge long list. All but one node (StatefulPartitionedCall) is shown in green if I use “TPU compatibility” for colouring.

Screenshot from 2020-12-09 13-04-37

Data transfer delays would make sense. I’d assumed that it would be no slower because unconverted parts would run at normal speed, but hadn’t taken account of memory copies.

I’ll try some different options for minimum segment sizes to see if that improves performance by making fewer transfers. But it would be useful to understand what is being run in TF-TRT and what’s just in TF so that I could have an idea of the effect that the settings are having.

Hi,

To get more details about the placement, please check the instructions listed in the below document first:

Based on the statement, you can also get the placement from the Tensor Board directly.
Thanks.

Thanks for the pointer. I’ve already been working through that document, but as you can see from my previous post, my output doesn’t look like what is shown in 4.1.2.1.

You’ve specifically linked to section 4.3, which says that TRTEngineOp functions store the original native segment. But I’ve not been able to identify the optimised segments on TensorBoard! I’ve used some code like section 4.4 (but updated for TF2) to identify the node types and I can see that there are TRTEngineOps in there somewhere, but I’ve not found a way to understand where they are and what they’re optimising.

Hi,

Have you tried a similar code like below?

for node in frozen_graph.node:
    print(node.op)

Based on the document, all the output node runs with TensorFlow except for TRTEngineOp_0, standing for the TensorRT engine.

With this information, you can compare to the model without TensorRT accelerated for precise placement.

Thanks.

I must have mentioned it in my other thread, but I’ve already looked at that output. It wasn’t particularly helpful for understanding what’s going on and why it is slow, though.

I tried running that process on two graphs (one optimised and one not optimised) and comparing the output with diff. I don’t know whether the order is supposed to be consistent, but the diff showed lots of changes, and not just “these 12 lines have been replaced with TRTEngineOp_0” as I’d expect. This probably isn’t helped by the fact that Retinanet is a non-trivial model and so has lots of nodes in the network.

Also, it isn’t clear to me whether TRTEngineOp_0 blocks mean that the operation is fully optimised and taking advantage of the GPU and Tensor cores, or whether it just means that TensorRT swapped it out for its own code (which could potentially just be CUDA calculations that should be faster than Python but still not at peak optimisation)

I’ve now managed to get a different run in to a different intstallation of Tensorboard (one on a laptop rather than the Jetson, to hopefully improve the performance of the browser rendering).

I get the following “profile” visualisation which is… not great. Am I reading it right that it’s doing a lot of Python but not much in optimised compute steps? I had read an article somewhere that said what “good” looked like, but I can’t find it at the moment.

Plotting the graph and colouring by TPU compatibility gives me a score of 99%, which suggests that most of the calls should be optimised and taking advantage of the hardware. But I’m still getting the less useful graph that I showed above with lots of small “Function” blocks listed beside it. The node search doesn’t appear to be finding any TRTEngine nodes.

Two extra points:

  1. Could this be relevant? Could it be that this particular model is running slower because TRT has ended up optimising for efficiency rather than performance? And if so, how do we tell?

When DLA is enabled on NX, the speed is slower - Jetson & Embedded Systems / Jetson Xavier NX - NVIDIA Developer Forums

  1. If I do a search in Tensorboard for nodes matching TRT (or even .*TRT.*) then nothing is listed. But if I print the node.op values then I can see some (as well as LOTS of other changes to the graph). Is that expected?

Thanks.

Hi,

1. No. TensorRT targets for performance. The compiler will also pick a faster one.
The topic you shared is for DLA. DLA is not supported in the TF-TRT.

2. If it is not easy to compare the difference between TF-only and TF-TRT.
You can count the operation amount to roughly know the ratio of the optimized node.

Please noted that if the inference is frequently switched from TensorFlow and TensorRT.
The overhead may be over the gain from the TensorRT optimization.

Thanks.

Okay, thanks for clarifying. I’ll continue to examine the models and check out the optimisation ratio.

Is it normal to have more steps in the optimised model? I’m currently seeing a 30-50% increase in the number of ops, but with only around 10% of that being nodes that include “TRT” in the name.

And are the TRTEngineOp calls still in the node ops in TF2? If I run the following code then I find that the ops include CreateTRTResourceHandle but the TRTEngineOp is a function and isn’t suffixed with a number.

    from tensorflow.core.protobuf.saved_model_pb2 import SavedModel
    include_functions = True

    saved_model = SavedModel()
    with open("path/to/input_model", 'rb') as f:
        saved_model.ParseFromString(f.read())
    model_op_names = list()
    # Iterate over every metagraph in case there is more than one
    for meta_graph in saved_model.meta_graphs:
        # Add operations in the graph definition
        model_op_names.extend(node.op for node in meta_graph.graph_def.node)

        if include_functions:
            # Go through the functions in the graph definition
            for func in meta_graph.graph_def.library.function:
                # Add operations in each function
                model_op_names.extend(node.op for node in func.node_def)

(This is based on some code that I found that supports TF2, since “frozen graphs” don’t appear to be a thing in the same way that they used to be in TF1)

Thanks.

Hi,

1.
Not sure what kind of operation is added to the model. But it is possible.
When TensorRT is enabled within the TensorFow graph, some bridging operation is required.
Ex. data type conversion, reshape, transpose, …, etc.

Due to this, the overhead will be high if the graph is frequently switching between TensorFlow and TensorRT.

2.
TF-TRT API for v2.0 is different from v1.x. Please find the following document for details:

Thanks.