I’m evaluating TensorRT on a VGG like model and my input is nchw.
However, I noticed that TensorRT will transform my model to nhwc for faster inference.
Since my model is Tensorflow, can we directly use nhwc as input so that we don’t need an input reformatter in TensorRT?
Input reformatter is very slow when input is large:
conv1_1_input/Conv2D + (Unnamed Layer* 2) [Activation] input reformatter 0 0.55792
conv1_1_input/Conv2D + (Unnamed Layer* 2) [Activation] 0.98768
NHWC tensor is faster than NCHW tensor, to perform a 32x32x3x3 conv on a tensor of size 1,32,300,1680
NCHW + FP32: 3ms on 2070.
NHWC + FP32: 1.9ms on 2070.
Therefore, can we add NHWC support in TensorRT directly?
Maybe I miss something, but in that page I only see NHWC8. Also there’s NHWC for plugins but I didn’t see that we can directly pass in a NHWC tensor to a convolution layer.
Can we go back to the original question about NHWC format support for conv layer since it is faster on the latest GPUs?
In your blog post there’s nothing about NHWC format.
In the link you provide it set the input to NCHW as well. parser->registerInput(“Input_0”, DimsCHW(1, 28, 28), UffInputOrder::kNCHW);
When I look at tensorflow code
It transpose tensor from nhwc to nchw in order to use IConvLayer.
But I know that IConvLayer try to transpose it back in order to use TensorCore.
In this case, why not make TensorRT support nhwc for IConvLayer?
Please, I’m wondering if we can make IConvLayer support NHWC input instead of NCHW input so that we can avoid any shuffle or reformatter when doing the convolution compute?
In the blog post, the shape is [1, 224, 224, 3], but if you look at tf2onnx code it was referring to, it does the transpose when convert tf code to onnx code.
I know UFF parser support kNHWC, but it will transpose to NCHW when pass to IConvLayer and it is the additional latency that we want to avoid. Do you get what the problem is?
TensorRT uses NCHW uniformly when defining the semantics of operations. You can use the TensorFormat enum to gain access to TensorRT’s internal data layouts at network boundaries, which are optimized for TensorCore.