TensorRT 3 grouped deconvolution slower than non-grouped

With TensorRT 3.0.4, cudnn 7, and CUDA 9 I’ve found that a model using grouped deconvolutions is about twice as slow as the same model with non-grouped deconvolutions. I originally trained my model with grouped deconvolutions with mxnet and then padded the weights with zeroed weight values to approximate the same operation with non-grouped deconvolutions and found that the non-grouped model ran twice as fast. Is this expected? I assumed such an optimization would allow for significantly less operations since I’m just using this to bilinearly upsample my feature maps.

I found another forum post indicating that grouped deconvolution in tensorrt is implemented as a single kernel invocation for each feature channel. This equivalently becomes several hundred kernel invocations per layer instead of one. Is there a timeline for a fix for this from nvidia?

I have the same issue as moodie. Would also appreciate a fix for this.

We created a new “Deep Learning Training and Inference” section in Devtalk to improve the experience for deep learning and accelerated computing, and HPC users:
https://devtalk.nvidia.com/default/board/301/deep-learning-training-and-inference-/

We are moving active deep learning threads to the new section.

URLs for topics will not change with the re-categorization. So your bookmarks and links will continue to work as earlier.

-Siddharth

Please file a bug here: https://developer.nvidia.com/nvidia-developer-program
Please include the steps/files used to reproduce the problem along with the output of infer_device.