TensorRT INT8 inference is slower than FP16 in models with conditional flow

I’m trying to implement branchynet on some models and testing with the CIFAR-10 dataset on the Jetson Orin Nano 8GB. Basically, I split the model into a first subgraph (common) that will be executed eagerly, and at a certain point, I introduce a conditional to check if the result is good enough, in which case the model finishes prematurely (branch1), thus saving time. If it doesn’t meet the condition, the output of the common subgraph goes through the rest of the model (branch2). I’ve been creating the models in tflite like so:

model = tf.keras.models.load_model(r"resnet8.h5")
model.trainable = False

# Common path
common = Model(inputs=model.input, outputs=model.layers[18].output)

# Conditional branches
branch1 = Model(inputs=model.layers[18].output, outputs=model.layers[-2].output)
branch2 = Model(inputs=model.layers[18].output, outputs=model.layers[-1].output)


# Custom layer to choose between branches
class ChooseBranchLayer(tf.keras.layers.Layer):
    def __init__(self):
        super(ChooseBranchLayer, self).__init__()
        self.branch1 = branch1
        self.branch2 = branch2

    def call(self, inputs):
        common_output = inputs
        output1 = self.branch1(common_output)
        condition = tf.reduce_max(output1) > 0.90
        return tf.cond(condition, lambda: output1, lambda: self.branch2(common_output))


# Input layer
inputs = tf.keras.layers.Input(shape=(32, 32, 3))

# Common output
common_output = common(inputs)

# Use the custom layer to choose output based on condition
final_output = ChooseBranchLayer()(common_output)

model_EE= tf.keras.Model(inputs=inputs, outputs=final_output)

spec = (tf.TensorSpec((None, 32, 32, 3), tf.float32, name="input"),)

converter = tf.lite.TFLiteConverter.from_keras_model(model_EE)
tflite_model = converter.convert()

interpreter = tf.lite.Interpreter(model_content=tflite_model)
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()


with open(r"EE_resnet8.tflite", 'wb') as f:
    f.write(tflite_model)

Then I convert the model to onnx with python -m tf2onnx.convert --tflite EE_resnet8.tflite --output EE_resnet8.onnx --opset 17. For testing, I need to know which branch the model has taken, so I tried returning two values, the output and a flag to signal which branch was taken: return tf.cond(condition, lambda: [output1,0], lambda: [self.branch2(common_output),1]), but when I build the engine with that, I get error:

[E] [TRT] ModelImporter.cpp:771: --- End node ---
[E] [TRT] ModelImporter.cpp:773: ERROR: ModelImporter.cpp:222 In function parseGraph:
[5] Assertion failed: (node.output().size() <= static_cast<int32_t>(outputs.size())) && "Node has more output tensors than TRT expected."
[E] Failed to parse onnx file

onnxruntime, however, runs ok and gives the same results as tflite. I’ve read in the developer guide that onnx is more flexible with the outputs of the conditional construct, but I don’t get what I am doing wrong here. A workaround I found is to concatenate the flag to the output tensor, so instead of having one vector of 10 elements with the predicted classes and one constant indicating the executed branch, I have one output of 11 elements where the last element is the flag:

class ChooseBranchLayer(tf.keras.layers.Layer):
    def __init__(self):
        super(ChooseBranchLayer, self).__init__()
        self.branch1 = branch1
        self.branch2 = branch2

    def call(self, inputs):
        common_output = inputs
        output1 = self.branch1(common_output)

        flag0 = tf.broadcast_to(tf.constant([[0.0]]), [tf.shape(output1)[0], 1])
        flag1 = tf.broadcast_to(tf.constant([[1.0]]), [tf.shape(output1)[0], 1])

        condition = tf.reduce_max(output1) > 0.9
        return tf.cond(condition, lambda: tf.concat([output1, flag0], axis=-1), lambda: tf.concat([self.branch2(common_output), flag1], axis=-1))

This does work and produces the expected results. However, I have some problems. First, I measure latency, so I use CUDA graphs because the original resnet8 has an average execution time of 0.15 ms, and this is clearly an enqueue-bound workload. However, I’ve seen it is not possible to use them in this context due to the conditional flow of the model. I’ve read in this blog Dynamic Control Flow in CUDA Graphs with Conditional Nodes that " Beginning in CUDA 12.4, CUDA Graphs supports conditional nodes, which enable the conditional or repeated execution of portions of a graph without returning control to the CPU", so I guess until CUDA 12.4 is implemented in JetPack we are out of luck.

When I measure the latencies of this branchynet model, it is way slower than the original model, which I assume is due to the lack of CUDA graphs. However, I don’t understand that this implementation does not scale well with model quantization. For some reason, INT8 is noticeably slower than FP16, whereas in the original model, the latency is FP32 > FP16 > INT8, as expected. I’ve tested this for Resnet8, Resnet56 and Alexnet, and all of them show this problem. I have no idea why this is happening, and I would like to know if this has to do with a poor implementation of conditional flow on my part. I tried using Nsight Systems to check if the branchynet models were not using tensor cores, but they seem to be active. I have also tried changing the threshold in the conditional so it always takes the same branch, and I get the same latency for both; sometimes, even the second branch (which is longer) beats the first.

I wanted to know the cause of this unexpected behavior and possible ways to address it. Thank you in advance for your attention.

Some details of my set-up:

Jetson Orin Nano 8GB
Jetpack 6.0
TensorRT version: 8.6.2

Update:

I’ve tested the same model on a system with an A100 and the tensorRT docker and got some weird results.

With the docker image nvcr.io/nvidia/tensorrt:24.04-py3, which has TensorRT 8.6.3, I get slower results than with the nvcr.io/nvidia/tensorrt:24.05-py3, which has TensorRT 10.0.1.

Moreover, I was expecting to be able to implement cuda graphs, given that both versions appear to have CUDA 12.4. However, trtexec throws the following:

[I] Capturing CUDA graph for the current execution context
[05/30/2024-08:36:41] [E] Error[3]: [runner.cpp::execute::768] Error Code 3: API Usage Error (Parameter check failed at: runtime/myelin/runner.cpp::execute::768, condition: !isCapturing || isCapturable The CUDA stream is in capturing mode, but this TRT engine is not stream capturable!)
[05/30/2024-08:36:41] [W] The CUDA graph capture on the stream has failed.
[05/30/2024-08:36:41] [W] The built TensorRT engine contains operations that are not permitted under CUDA graph capture mode.
[05/30/2024-08:36:41] [W] The specified --useCudaGraph flag has been ignored. The inference will be launched without using CUDA graph launch.

Hi,

How do you check the performance between models?

Have you tried trtexec?
The binary tool has a configuration to enable cuda graph so you can check the performance issue with cuda graph enabled directly.

More, is the condition check you mentioned here a TensorRT layer, TensorRT plugin or a CUDA code?
Thanks.

Hi @AastaLLL ,

Thank you very much for your response.

The latencies I was measuring were done using the C++ API. I’ve tested using trtexec, and in that case, I do get the expected behaviour. For example, on the Jetson Orin Nano 8GB (TRT 8.6.2), I get (running sudo nvpmodel -m 0 && sudo jetson_clocks):

fp16 int8 best
Branch Resnet8 0.302238 0.289783 0.337576
Resnet8 0.110275 0.10108 0.102951
Branch Resnet56 1.26184 1.10442 1.12782
Resnet56 0.694252 0.518942 0.505991

For this, I’ve used the command: trtexec --builderOptimizationLevel=99 --useCudaGraph --warmUp=500 --avgRuns=10000 --iterations=1 --onnx=model.onnx --useSpinWait --fp16. Running this same command on the system with the A100 and TRT 10.0.1, I get

fp16 int8 best
Branch Resnet8 0.174846 0.172852 0.175026
Resnet8 0.0790818 0.0687353 0.0697222
Branch Resnet56 0.616997 0.616511 0.617289
Resnet56 0.325581 0.268751 0.282244

However, in both systems, for the smallest branchynet model from resnet8, I get the warning “[W] * GPU compute time is unstable, with coefficient of variance = 3.26966%.” . In the A100 it goes up to 29% for branchynet resnet8 and 12.5% for branchynet resnet56. This may cause overlap between two precision modes. In both devices, I got the warning Error[3]: [runner.cpp::execute::768] Error Code 3: API Usage Error (Parameter check failed at: runtime/myelin/runner.cpp::execute::768, condition: !isCapturing || isCapturable The CUDA stream is in capturing mode, but this TRT engine is not stream capturable!), and INT8 is indeed faster than FP16, but for some reason --best is slower than --int8. Also, I don’t get why my implementation is consistently slower than trtexec and also does not show the same improvement. The timings I’ve got are:

fp32 fp16 int8
Branch Resnet8 0.407942 0.396847 0.744405
Resnet8 0.212785 0.183042 0.17125
Branch Resnet56 1.71137 1.36377 1.54847
Resnet56 1.10207 0.762705 0.640057

(Note that in this case, I’ve measured fp32, fp16 and int8. I tried adding fp16 to the int8 calibrator, and indeed, it performs worse than just int8.) So now I think this points to a bad implementation on my part. I don’t know if I should open a new thread for this or share my code in this one.

Anyway, coming back to your question of whether the condition check is a TensorRT layer, TensorRT plugin or a CUDA code, I have no idea. I’m building the model in TFLite and then converting to ONNX. From there, I compiled an engine plan to be run on TensorRT, and I don’t know how it is converting the model in that case. I have tried using trt engine explorer to see if it can clarify this a bit, and here are the results for the branchynet and the original Resnet8:

engine_inspect.zip (9.8 MB)

I’m not sure how to interpret those results, especially because for the branchynet model, I got the message “Partial profiling data: The number of layers in the engine graph (26) does not match the number of layers (24) in the performance JSON.
This can happen if you’re not using the first shape-profile.” which later caused some cells to throw errors .

Thank you again for your help. Regards.

I have found a solution to my problem. Instead of creating the model in TensorFlow, I used Pytorch. To build the model, I did the following:

import tensorflow as tf
import tf2onnx
import onnx

model = tf.keras.models.load_model(r"resnet8.h5")
model.trainable = False

# Define common model
common = Model(inputs=model.input, outputs=model.layers[18].output)

# Define branch models
branch1 = Model(inputs=model.layers[18].output, outputs=model.layers[-2].output)
branch2 = Model(inputs=model.layers[18].output, outputs=model.layers[-1].output)

# convert from TF to onnx
common_model, _ = tf2onnx.convert.from_keras(common, opset=17)
branch1_model, _ = tf2onnx.convert.from_keras(branch1, opset=17)
branch2_model, _ = tf2onnx.convert.from_keras(branch2, opset=17)

# -----  Build the model using Pytorch ----
import torch.onnx
import torch
import torch.nn as nn
import numpy as np
import onnx

from onnx2pytorch import ConvertModel

class ConditionalModel(nn.Module):
    def __init__(self, common_model, branch1_model, branch2_model):
        super().__init__()

        # Integrate ONNX models to get the subgraphs
        self.common = ConvertModel(common_model)
        self.branch1 = ConvertModel(branch1_model)
        self.branch2 = ConvertModel(branch2_model)

    def forward(self, input_tensor):
        output = self.common(input_tensor)
        branch1_output = self.branch1(output)

        # Find the maximum value in branch1_output
        max_value = torch.max(branch1_output)

        if max_value.item() > 0.9:
            return branch1_output
        else:
            return self.branch2(output)

model = ConditionalModel(common_model, branch1_model, branch2_model)

dummy_input = torch.randn(1, 32, 32, 3)  # Batch size of 1, 3 channels, 32x32 image
torch.onnx.export(model, dummy_input, "EE_resnet8.onnx", input_names=["input"], output_names=["output"])

To include the branch indicator, I did the same as in TensorFlow: concatenating the flag to the end of the output tensor:

if max_value.item() > 0.9:
    return torch.cat((branch1_output, torch.tensor([[0.0]])), dim=1)
else:
    return torch.cat((self.branch2(output), torch.tensor([[1.0]])), dim=1)

With this implementation, the model behaves as expected: it outperforms the original model, latency improves with quantization, and better yet, it can use CUDA Graphs. So, my conclusion is that it has to do with the model that TFLite was producing compared to the model PyTorch is producing when both are converted to ONNX.

I still have one problem, though, which is that trtexec outperforms my C++ API implementation by a big margin. My program is between 66% and 68% slower than the trtexec benchmark on the same engine. Would it be possible to share my code to see if we can figure out the origin of that discrepancy?

Once again, I truly appreciate your assistance in this matter.