Hi there,
I am running into an issue where a conv2d layer is not using Tensor Cores for some configurations of dilations/padding.
For certain inputs size the layer uses a Tensor Core CUDNN implementation but not for others. Is there some known limitations/rules that should be followed to guarantee TCUs are used every time for 2d convolution, regardless of input size and dilation/padding parameters?
o Linux distro and version: Debian 9
o GPU type: RTX 2080 TI
o Nvidia driver version: 410.93
o CUDA version: 10.0
o CUDNN version: 7.6.5
o Python version [if using python]: Using c++
o Tensorflow and PyTorch version
o TensorRT version: 7.0.0.11
Thanks in advance for your help!
Hi,
For best practice, please refer to below link:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#optimize-layer
If possible please share your model and configuration setting so that we can help better.
Thanks
Thanks.
I have three parallel 2d conv nodes. They all have the same input size and same output size.
The difference is one has (6,6), other (12,12), other (18,18) padding. The kernel shape is (3,3) for all of them.
The one with (6,6) always runs on TCUs regardless of input size.
The one with (12,12) runs on TCUs for some input sizes.
The one with (18,18) never runs on TCUs.
My guess is that there is some rule that only triggers TCUs usage if the input/work size/output size is a multiple of 8 or something like that. But I would like to clarify what the rules are instead of trying to guess.
Thanks for your help
As mentioned in above link:
Tensor dimensions (or the number of input and output channels for FullyConnected layer) of multiples of 32 tend to have the best performance for FP16 and INT8 inference because of the utilization of Tensor Cores if the hardware supports them.
Tensor Core kernels for FP16 data requires striding between data rows to be multiples of 8 data elements. For example, a MatrixMultiply that is M x K times K x N requires M, K, and N to be multiple of 8 to use Tensor Core optimized kernels.
Can you share more details regarding stride, dilation, in / out channels? Also if possible please share the model & script file to repro the issue.
Thanks
1 Like
Sorry, I am not allowed to share the model but it is DeepLab with atrous classifier.
The convolutions that are giving me trouble are the atrous convolutions in the classifier.
In channels: 2048
Out channels: 256
Kernel: 3,3
Stride: 1,1
Pads: 6,6
Dilation: 6,6
In channels: 2048
Out channels: 256
Kernel: 3,3
Stride: 1,1
Pads: 12,12
Dilation: 12,12
In channels: 2048
Out channels: 256
Kernel: 3,3
Stride: 1,1
Pads: 18,18
Dilation: 18,18
The problem is the optimizer not being able to find an optimized implementation but resorting to use the implicit_convolve_sgemm instead of some turing_h1688 implementation like the other ‘regular’ convolutions.
Thanks SunilJB
Hi @ricardo10silva,
There are some additional fixes pushed in latest release, could you please try using latest TRT and let us know in case issue persist.
Thanks