High image res & low no of channels -> really bad speed

Description

When using high image resolutions (e.g. 1920x1080 or 3840x2160) and low number of channels (8 or 16), TensorRT speed is unexpectedly slow.

Environment

TensorRT Version: 8.4
GPU Type: 3080
Nvidia Driver Version: 516.01
CUDA Version: 11.7
CUDNN Version: 8.2.0
Operating System + Version: Windows 10 x64
Python Version (if applicable): 3.9.12
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.11.0
Baremetal or Container (if container which image + tag): Baremetal

Relevant Files

http://madshi.net/trtBench.zip

Steps To Reproduce

You can download the zip file (link above) to run the same benchmarks I did to measure TensorRT performance. The zip file contains full python files to create the ONNX networks, and to benchmark them. I also included the ONNX network files.

I’ve tested 1920x1080 image resolution with very simple neural networks with the following structure:

  • 32 convolutional 2D layers
  • 8, 16, 24, 32 or 64 filters
  • 1x1 or 3x3 kernels
  • no other stuff, no RELUs, no nothing

Please note that the neural networks I’m using for testing produce garbage output. They are not supposed to do anything useful. The only purpose is to test how fast TensorRT is when doing inference with high resolution image with low number of channels.

Here are the results I’m getting here:

# 3080 TensorRT results:
# 08x01 FP32/FP16/INT8:  8.264 ms,  3.671 ms,  4.848 ms, T(FL)OPS:  0.578,  1.301,  0.986
# 08x03 FP32/FP16/INT8: 25.107 ms,  5.657 ms,  8.883 ms, T(FL)OPS:  1.543,  6.850,  4.362
# 16x01 FP32/FP16/INT8: 13.994 ms,  7.071 ms,  5.229 ms, T(FL)OPS:  1.290,  2.552,  3.451
# 16x03 FP32/FP16/INT8: 39.289 ms,  8.801 ms, 14.854 ms, T(FL)OPS:  3.918, 17.491, 10.364
# 24x01 FP32/FP16/INT8: 19.591 ms, 10.350 ms,  6.822 ms, T(FL)OPS:  2.032,  3.847,  5.836
# 24x03 FP32/FP16/INT8: 60.204 ms, 16.179 ms, 14.940 ms, T(FL)OPS:  5.740, 21.360, 23.131
# 32x01 FP32/FP16/INT8: 25.872 ms, 14.757 ms,  7.768 ms, T(FL)OPS:  2.708,  4.748,  9.020
# 32x03 FP32/FP16/INT8: 80.142 ms, 20.639 ms, 15.075 ms, T(FL)OPS:  7.657, 29.733, 40.706
# 64x01 FP32/FP16/INT8: 13.497 ms,  8.601 ms,  4.828 ms, T(FL)OPS:  5.113,  8.023, 14.295
# 64x03 FP32/FP16/INT8: 51.950 ms, 16.296 ms,  8.234 ms, T(FL)OPS: 11.792, 37.592, 74.402

To give you a point of reference, several years ago I wrote simple Direct3D9 PixelShaders to do inference for 8 and 16 channel convolutional layers with 3x3 kernels. The PixelShaders are extremely simple, and probably not perfectly optimized. Yet they still beat TensorRT performance by quite a big margin. See here:

# 3080 Direct3D9 PixelShader results:
# 08x03 FP32:  7.264 ms, TFLOPS: 5.335
# 16x03 FP32: 29.056 ms, TFLOPS: 5.298

Now let’s look at theoretical 3080 T(FL)OPS numbers:

# Peak FP32 TFLOPS (non-Tensor): 29.8
# Peak INT32 TOPS (non-Tensor): 14.9
# Peak FP16 Tensor TFLOPS (with FP16 Accumulate1): 119
# Peak FP16 Tensor TFLOPS (with FP32 Accumulate1): 59.5
# Peak INT8 Tensor TOPS: 238

My understanding is that the Tensor cores do 8x8 matrix multiplication in FP16, or 16x16 matrix multiplications in INT8. So I can fully understand that INT8 performance cannot be great when using convolutional layers with only 8 channels. However, convolutional layers with 8 channels should be fine for FP16, and convolutions layers with 16 filters should be fine for INT8. No?

Looking at the benchmark results, there are various things I notice:

  1. Performance with 8 channels is really bad. Using FP32, my Direct3D9 PixelShaders are almost 3.5x as fast as TensorRT!

  2. Performance with 16 channels is still relatively bad. Using FP32, my D3D9 PixelShaders are still 35% faster than TensorRT.

  3. Performance with FP16 is always better than when using FP32, so for FP16 and INT8 inference, TensorRT clearly uses the Tensor cores. Ampere is not faster at FP16 compared to FP32. And memory bandwidth alone cannot explain the speed advantage of FP16, so that clearly shows Tensor cores are being used (for FP16 and INT8).

  4. When doing 16 channels with 3x3 filters with FP16, we’re only at 30% of the theoretical performance. With 1x1 kernels, we’re only at 4.3% of the theoretical performance! Why is that? I would have thought that using 16 channels input and output for each layer should produce nearly optimal results, considering that Tensor cores do either 8x8 multiplications (FP16) or 16x16 multiplications (INT8).

  5. When using 32 channels with 3x3 filters, we’re at 50% of the theoretical performance. That’s an improvement over 16 channels. But still, I’m not sure I understand why we’re not closer to 100%? Thanks to the high image resolution, it should be easy as cake to keep all Tensor cores busy at all times. Unless it’s all limited by memory bandwidth?

  6. INT8 results are extremely weird. Ok, let’s disregard INT8 results when using 8 channels, since Tensor cores do 16x16 multiplications when using INT8. But even at 16 channels, INT8 often performs worse than FP16! How is that possible!? Only starting with 32 filters, INT8 becomes consistently faster than FP16. That really makes no sense to me. How can INT8 ever be slower than FP16, when using 16+ channel convolutional layers??

  7. Another INT8 weirdness is that when using 1x1 filters, INT8 beats FP16 (except when using 8 channels, but that doesn’t count). However, when using 3x3 filters, INT8 is much less effective. That doesn’t make any sense to me. Can anybody explain?

  8. Even when using 64 channels, we’re only at 31% of the theoretical performance of INT8. Why are we not closer to 100%?

  9. It seems that TensorRT consumes extremely high amounts of VRAM. Which is very worrying for me. My Direct3D9 PixelShaders consume nearly zero extra VRAM (just a couple MBs for the ping pong textures). I’ve not tried yet to load multiple neural networks into TensorRT at the same time, so I can run multiple networks at once. I wonder if the VRAM stacks up for each neural network? If so, I’d run out of VRAM very quickly.

Overall, looking at these benchmark results, my impression is that TensorRT was mainly optimized for low image resolutions with high channel and batch numbers. But it does not seem to be optimized at all for high image resolutions and low channel numbers. Is that a fair thing to say?

I would politely request that the TensorRT team invests some time to optimize for high image resolutions and low channel numbers, with both 1x1 and 3x3 filter sizes. Also, please optimize for low VRAM consumption, when loading multiple different neural networks into TensorRT, for the purpose of running multiple different networks in a row, or at the same time.

FYI, this is all for the purpose of real time video-processing (during video playback). In this situation, using high channel numbers is not feasible, due to real-time constraints. Ideally I’d like to use 8 or max 16 channels. We do have very high image resolutions. E.g. 1920x1080 for Blu-Ray or even 3840x2160 for UHD Blu-Ray. So could you please also optimize TensorRT for this use case?

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#measure-performance

https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#model-accuracy

Thanks!

I’ve already shared everything. If you have PyTorch installed, you should be able to reproduce my benchmarks with ease.

P.S: Since we have 8K TVs and computer monitors these days, the neural network may also be used to upscale video to 8K in real time. So TensorRT should also be able to handle convolutional neural networks with 8K image resolutions without consuming insane amounts of VRAM. Not sure if that has been tested/optimized yet?

Hi,

We are unable to download the resource you have shared to reproduce the issue.
Could you please share again.

Thank you.

I’ve just tried to download the zip file I provided, and it downloads fine for me. But I’ve now also uploaded the zip to MediaFire, in case that helps:

https://www.mediafire.com/file/q3mys69cr3657ey/trtBench.zip/file

In any case, thank you for looking into this!

Thank you for sharing the issue repro resources. Our team will work on this.

Thank you very much!!

Here’s a crazy optimization idea: When dealing with large image resolutions and low channel numbers, you could try using tiles which overlap just enough to cover the receptive field of the convolutional layers. Choosing the tile size wisely might allow you to run only 1 tile at a time through all convolutional layers. Ideally, such a tile would be just big enough to keep all tensor cores busy, and just small enough to fit the whole tile into L2 cache, thus eliminating any memory bandwidth bottlenecks.

Of course fitting stuff into L2 cache will become much easier with Ada Lovelace… :-)

P.S: While we’re at it, here are 2 more optimizations that I’m using in my Direct3D9 pixel shaders, which probably contributed to them beating TensorRT in speed:

  1. I store the output of convolutional layers in FP16. I think TensorRT always uses FP32? It might be worth adding an option to use FP16 everywhere, including storage and accumulation (since the consumer Ampere Tensor cores are twice as fast with FP16 accumulation). For my CNNs, it does not seem to hurt accuracy at all.

  2. I’m using constant registers for the weights. I’m not sure if the Tensor cores are able to read constant registers, so this might not be technically possible for you to do. But if you can use constant registers somehow, you might be able to reduce memory bandwidth consumption even more. Of course this will only work for CNNs with really small channel numbers and/or filter sizes because otherwise you’ll run out of available constant registers very quickly.

By default, TRT always assume that the network inputs and outputs are in FP32 and in LINEAR format. You can override the “dtype” property of the input/output tensors to the data type you want. For example, this code snippet sets the input/output to FP16 or INT8 depending on which precision is used to build the engine:

  if fp16:
    print("- %2d layers, %2d channels, %dx%d kernel, FP16" % (layers, channels, kernelSize, kernelSize))
    config.set_flag(trt.BuilderFlag.FP16)
    network.get_input(0).dtype = trt.DataType.HALF
    network.get_input(0).allowed_formats = 1 << int(trt.TensorFormat.HWC8)
    network.get_output(0).dtype = trt.DataType.HALF
    network.get_output(0).allowed_formats = 1 << int(trt.TensorFormat.HWC8)
  elif int8:
    print("- %2d layers, %2d channels, %dx%d kernel, INT8" % (layers, channels, kernelSize, kernelSize))
    for i in range(network.num_layers):
      layer = network.get_layer(i)
      if (layer.type != trt.LayerType.CONSTANT) and (layer.type != trt.LayerType.CONCATENATION) and (layer.type != trt.LayerType.SHAPE) and (layer.type != trt.LayerType.GATHER):
        layer.precision = trt.DataType.INT8
        for j in range(layer.num_outputs):
           output = layer.get_output(j)
           if output.is_execution_tensor:
             layer.set_output_type(j, trt.DataType.INT8)
    for i in range(network.num_inputs):
      input = network.get_input(i)
      input.set_dynamic_range(0, 50)
    for i in range(network.num_layers):
      layer = network.get_layer(i)
      for j in range(layer.num_outputs):
        output = layer.get_output(j)
        output.set_dynamic_range(0, 50)
    config.set_flag(trt.BuilderFlag.INT8)
    network.get_input(0).dtype = trt.DataType.INT8
    network.get_input(0).allowed_formats = 1 << int(trt.TensorFormat.CHW32)
    network.get_output(0).dtype = trt.DataType.INT8
    network.get_output(0).allowed_formats = 1 << int(trt.TensorFormat.CHW32)
  else:
    print("- %2d layers, %2d channels, %dx%d kernel, FP32" % (layers, channels, kernelSize, kernelSize))
    network.get_input(0).allowed_formats = 1 << int(trt.TensorFormat.HWC)
    network.get_output(0).allowed_formats = 1 << int(trt.TensorFormat.HWC)

Thank you very much! I’ve tried this on my PC, and there seems to be a small speed improvement, but it’s really modest. E.g. 16x03 goes from 8.8ms down to about 8.3ms and 32x03 goes from 20.6ms to 19.7ms. That’s about 5% faster. While in theory, the TFLOPS capability with FP16 Accumulate should be twice as fast as with FP32, when using a 3080 GPU.

Can you comment on the reason for why we’re pretty far away from theoretical numbers? What is the bottleneck? Is it memory bandwidth? But if so, shouldn’t going from FP32 storage & accumulation to FP16, or going from FP16 to INT8 help a lot more than it does?

The most important convolutions for me are 3x3 filters with either 16 or 32 channels. Do you see any hope for improving those, when using high image resolutions?

And if I may ask: Do you think Hopper’s TMA (Tensor Memory Accelerator) will help a lot getting nearer to the theoretical performance limits? Or do you think the TMA will only help a little bit?