Does TensorRT rewrite ONNX models to NHWC?

We are training with our convolutional networks tensorflow 2.3 and are exporting our models to onnx using keras2onnx.
A visualization of the beginning of the onnx model can be seen below.
The input is in NHWC, but since onnx uses NCHW it adds a transpose layer before the convolutions.
I would expect that tensorrt removes this transpose layer and executes the convolutions with NHWC on GPUs.
However, when profiling with trtexec it shows a PushTranspose Layer (see below) that also consumes time.

Does this mean the convolutions are indeed executed with NCHW or how can I know what is going on?
I am certain that the GPU is used since I saw activity with nvidia-smi.

Command for profiling

./trtexec --onnx=<model_path.onnx> --int8 --shapes=input_1:1x704x1280x3 --exportTimes=trace.json --dumpProfile --exportProfile=prof.json

Beginning of Profile from trtexec

  { "count" : 834 }
, { "name" : "(Unnamed Layer* 0) [Constant] + (Unnamed Layer* 1) [Shuffle] + Mul input reformatter 0", "timeMs" : 21.8493, "averageMs" : 0.0261982, "percentage" : 0.929405 }
, { "name" : "(Unnamed Layer* 0) [Constant] + (Unnamed Layer* 1) [Shuffle] + Mul", "timeMs" : 19.3699, "averageMs" : 0.0232253, "percentage" : 0.823939 }
, { "name" : "PushTranspose_1162", "timeMs" : 51.4201, "averageMs" : 0.0616548, "percentage" : 2.18726 }
, { "name" : "conv2d", "timeMs" : 34.2201, "averageMs" : 0.0410313, "percentage" : 1.45563 }
, { "name" : "leaky_re_lu", "timeMs" : 16.6442, "averageMs" : 0.0199571, "percentage" : 0.707997 }
, { "name" : "conv2d_1", "timeMs" : 28.3778, "averageMs" : 0.0340262, "percentage" : 1.20711 }
, { "name" : "leaky_re_lu_1", "timeMs" : 15.0495, "averageMs" : 0.018045, "percentage" : 0.640163 }

Model Start

Onnx model visualized with Netron:


TensorRT Version:
GPU Type: RTX 2080Ti
Nvidia Driver Version: 460
Operating System + Version: Ubuntu 18.04
Python Version (if applicable): 3.6
TensorFlow Version (if applicable): 2.3
Baremetal: Yes

Hi, Request you to share the ONNX model and the script so that we can assist you better.

Alongside you can try validating your model with the below snippet

import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)

Alternatively, you can try running your model with trtexec command.


Hi, thanks for your reply, is there a way to privately share the model, if yes I can provide you with an example onnx model.

The model is valid. As I already described I used trtexec to get the profiling. I will share the model with randomly initialized weights, I just have to export it again, I will get back to you in a couple of hours.


Hi @jean.wanka,

Please DM by attaching the model.

Thank you.

Thanks, I shared the model in the DM.
If it is necessary I can also create a script that creates a reduced version of this model and uses keras2onnx to export it.

Is there any update on this?
The main point I’m trying to understand is what the engine Builder (IBuilder) does in detail and how it rewrites and optimizes the graph.

Is it able to:

  • remove layers like unnecessary transposes?
  • rewrite the graph from channel first to an equivalent channel last graph?
  • fuse layers like Convolution and BatchNorm? Where is it listed what is supported here?


Hi @jean.wanka,

We have ONNX GraphSurgeon that can modify the onnx file manually.
For the FP16, conv + leakyReLu can be fused together. For Int8 in some case we did not fuse conv and activation together because of register pressure (the extra requested register file will decrease the occupancy).

Thank you.

Hi @spolisetty ,
thanks for the update!

One thing that is still not clear for me is the channels first/last question.
I’ve read in your documentation that channels last (NHWC) is preferred.
ONNX, however, only uses channels first layout, does this mean the tensorrt engine is also always in channels first layout?
Is there a way to change this or are the benefits not significant?

any update on this would be much appreciated.


Hi @jean.wanka,

TRT engine always doesn’t use channels first layout.
It depends on the kernel implementation, TRT will always insert reformat when the adjacent layers has mismatched kernel I/O.

Thank you.