Description
When using high image resolutions (e.g. 1920x1080 or 3840x2160) and low number of channels (8 or 16), TensorRT speed is unexpectedly slow.
Environment
TensorRT Version: 8.4
GPU Type: 3080
Nvidia Driver Version: 516.01
CUDA Version: 11.7
CUDNN Version: 8.2.0
Operating System + Version: Windows 10 x64
Python Version (if applicable): 3.9.12
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.11.0
Baremetal or Container (if container which image + tag): Baremetal
Relevant Files
http://madshi.net/trtBench.zip
Steps To Reproduce
You can download the zip file (link above) to run the same benchmarks I did to measure TensorRT performance. The zip file contains full python files to create the ONNX networks, and to benchmark them. I also included the ONNX network files.
I’ve tested 1920x1080 image resolution with very simple neural networks with the following structure:
- 32 convolutional 2D layers
- 8, 16, 24, 32 or 64 filters
- 1x1 or 3x3 kernels
- no other stuff, no RELUs, no nothing
Please note that the neural networks I’m using for testing produce garbage output. They are not supposed to do anything useful. The only purpose is to test how fast TensorRT is when doing inference with high resolution image with low number of channels.
Here are the results I’m getting here:
# 3080 TensorRT results:
# 08x01 FP32/FP16/INT8: 8.264 ms, 3.671 ms, 4.848 ms, T(FL)OPS: 0.578, 1.301, 0.986
# 08x03 FP32/FP16/INT8: 25.107 ms, 5.657 ms, 8.883 ms, T(FL)OPS: 1.543, 6.850, 4.362
# 16x01 FP32/FP16/INT8: 13.994 ms, 7.071 ms, 5.229 ms, T(FL)OPS: 1.290, 2.552, 3.451
# 16x03 FP32/FP16/INT8: 39.289 ms, 8.801 ms, 14.854 ms, T(FL)OPS: 3.918, 17.491, 10.364
# 24x01 FP32/FP16/INT8: 19.591 ms, 10.350 ms, 6.822 ms, T(FL)OPS: 2.032, 3.847, 5.836
# 24x03 FP32/FP16/INT8: 60.204 ms, 16.179 ms, 14.940 ms, T(FL)OPS: 5.740, 21.360, 23.131
# 32x01 FP32/FP16/INT8: 25.872 ms, 14.757 ms, 7.768 ms, T(FL)OPS: 2.708, 4.748, 9.020
# 32x03 FP32/FP16/INT8: 80.142 ms, 20.639 ms, 15.075 ms, T(FL)OPS: 7.657, 29.733, 40.706
# 64x01 FP32/FP16/INT8: 13.497 ms, 8.601 ms, 4.828 ms, T(FL)OPS: 5.113, 8.023, 14.295
# 64x03 FP32/FP16/INT8: 51.950 ms, 16.296 ms, 8.234 ms, T(FL)OPS: 11.792, 37.592, 74.402
To give you a point of reference, several years ago I wrote simple Direct3D9 PixelShaders to do inference for 8 and 16 channel convolutional layers with 3x3 kernels. The PixelShaders are extremely simple, and probably not perfectly optimized. Yet they still beat TensorRT performance by quite a big margin. See here:
# 3080 Direct3D9 PixelShader results:
# 08x03 FP32: 7.264 ms, TFLOPS: 5.335
# 16x03 FP32: 29.056 ms, TFLOPS: 5.298
Now let’s look at theoretical 3080 T(FL)OPS numbers:
# Peak FP32 TFLOPS (non-Tensor): 29.8
# Peak INT32 TOPS (non-Tensor): 14.9
# Peak FP16 Tensor TFLOPS (with FP16 Accumulate1): 119
# Peak FP16 Tensor TFLOPS (with FP32 Accumulate1): 59.5
# Peak INT8 Tensor TOPS: 238
My understanding is that the Tensor cores do 8x8 matrix multiplication in FP16, or 16x16 matrix multiplications in INT8. So I can fully understand that INT8 performance cannot be great when using convolutional layers with only 8 channels. However, convolutional layers with 8 channels should be fine for FP16, and convolutions layers with 16 filters should be fine for INT8. No?
Looking at the benchmark results, there are various things I notice:
-
Performance with 8 channels is really bad. Using FP32, my Direct3D9 PixelShaders are almost 3.5x as fast as TensorRT!
-
Performance with 16 channels is still relatively bad. Using FP32, my D3D9 PixelShaders are still 35% faster than TensorRT.
-
Performance with FP16 is always better than when using FP32, so for FP16 and INT8 inference, TensorRT clearly uses the Tensor cores. Ampere is not faster at FP16 compared to FP32. And memory bandwidth alone cannot explain the speed advantage of FP16, so that clearly shows Tensor cores are being used (for FP16 and INT8).
-
When doing 16 channels with 3x3 filters with FP16, we’re only at 30% of the theoretical performance. With 1x1 kernels, we’re only at 4.3% of the theoretical performance! Why is that? I would have thought that using 16 channels input and output for each layer should produce nearly optimal results, considering that Tensor cores do either 8x8 multiplications (FP16) or 16x16 multiplications (INT8).
-
When using 32 channels with 3x3 filters, we’re at 50% of the theoretical performance. That’s an improvement over 16 channels. But still, I’m not sure I understand why we’re not closer to 100%? Thanks to the high image resolution, it should be easy as cake to keep all Tensor cores busy at all times. Unless it’s all limited by memory bandwidth?
-
INT8 results are extremely weird. Ok, let’s disregard INT8 results when using 8 channels, since Tensor cores do 16x16 multiplications when using INT8. But even at 16 channels, INT8 often performs worse than FP16! How is that possible!? Only starting with 32 filters, INT8 becomes consistently faster than FP16. That really makes no sense to me. How can INT8 ever be slower than FP16, when using 16+ channel convolutional layers??
-
Another INT8 weirdness is that when using 1x1 filters, INT8 beats FP16 (except when using 8 channels, but that doesn’t count). However, when using 3x3 filters, INT8 is much less effective. That doesn’t make any sense to me. Can anybody explain?
-
Even when using 64 channels, we’re only at 31% of the theoretical performance of INT8. Why are we not closer to 100%?
-
It seems that TensorRT consumes extremely high amounts of VRAM. Which is very worrying for me. My Direct3D9 PixelShaders consume nearly zero extra VRAM (just a couple MBs for the ping pong textures). I’ve not tried yet to load multiple neural networks into TensorRT at the same time, so I can run multiple networks at once. I wonder if the VRAM stacks up for each neural network? If so, I’d run out of VRAM very quickly.
Overall, looking at these benchmark results, my impression is that TensorRT was mainly optimized for low image resolutions with high channel and batch numbers. But it does not seem to be optimized at all for high image resolutions and low channel numbers. Is that a fair thing to say?
I would politely request that the TensorRT team invests some time to optimize for high image resolutions and low channel numbers, with both 1x1 and 3x3 filter sizes. Also, please optimize for low VRAM consumption, when loading multiple different neural networks into TensorRT, for the purpose of running multiple different networks in a row, or at the same time.
FYI, this is all for the purpose of real time video-processing (during video playback). In this situation, using high channel numbers is not feasible, due to real-time constraints. Ideally I’d like to use 8 or max 16 channels. We do have very high image resolutions. E.g. 1920x1080 for Blu-Ray or even 3840x2160 for UHD Blu-Ray. So could you please also optimize TensorRT for this use case?