hello,
I have a model with 3d operation, the main layer is conv3d. With nvporf, 90% inference time is cudnn::detail::implicit_convolveND_sgemm<float …>. After use fp16, little performance improvement. the main inference time is cudnn::detail::implicit_convolveND_sgemm<__half …>.
TensorRT Version is 7 With RTX2080TI, any suggestions? Thanks
Yes, TRT support 3d conv layer. Speed depends on lot of param like GPU type etc.
Kernel selection depends on layer parameters, we have fast kernels for some common used parameters, like 333 filter size.
Others will use a general default kernel implement, which might be slow.
I just using the latest CUDA/cuDNN/TRT version with CUDA 10.2, cuDNN 7.6.5 , TRT 7.0.0.11.
I analyzed the network time based on nvprof again. Network time is mainly concentrated in a conv3d layer with 3×3×3 filter size, 32 ngroups, 1 or 2 stride and 1 pading. The input(128x8x28x28, etc) and out(128x8x28x28, etc). The layer will perform implicit_convolveND_sgemm 32 times with fp16, and my model contains many such layer(33).
I tested the time consumption of part networks, with fp32 2.5ms, with fp16 2.8ms.
Hi,
3D group conv specific kernel is currently not supported in TRT7.
In TRT7 we will split group conv and call kernel for each group. In this case since you have 32 ngroups and we run 32 times conv, that’s might be causing the performance to drop.
I test the 3D group conv based on cuDNN 7.6.5. With nvprof, the kernel implicit_convolveND_sgemm still run 32 times. Does cuDnn supported the 3D group conv specific kernel ?
This link isn’t up to date though. There is currently INT8 support for 3d convolutions, but not according to your link. Therefore I’m not sure I should consider if there is support in tensor cores for 3D grouped convolutions. In fact I’ve opened an issue that shows that there seems not to be, No speedup from Tensor cores on 3d architecture with groupped convolutions · Issue #1198 · NVIDIA/TensorRT · GitHub, even though tensor cores kernels exist for 3d conv.