Poor group convolution performance in fp16

Grouped convolutions are required in MobileNet. I realize that cuDNN has included support for fast fp16 grouped convolutions. Unfortunately, my benchmarks seems to contradict this fact.

I am running a convolution with N = 1, C = 64, H = 56, W = 56, K = 64, groups = 64 (depthwise convolution)

One the Jetson nano, with cuDNN that comes with JetPack, I get a runtime of 1.5 ms for HALF, which is slower than 1.0 for FLOAT for the regular non-grouped convolution (which poses its own question: when is half faster than float?)

For grouped convolution, HALF gives me a runtime of 10.4 ms… while FLOAT gives a runtime of 12 ms. I believe this is not possible given the official inference benchmarks with MobileNet-SSD. What’s going on here? I am using the following to set my grouped convolutions:

int group_count = 64;

I set the in_channel in the filter descriptor to 1.

Is cuDNN slower than TensorRT?

I think I figured out the issue, though would appreciate a confirmation from a Nvidia employee.

NCHW is not supported for grouped conv, or fast group conv anyways. You have to use NHWC.


data format might matter.

More, please noticed that the acceleration between HALF and FLOAT also differ from the layers.
This depends a layer can be well-accelerated via HALF type data or not.

Our benchmark result for MobileNet-SSD should take all the used layer into account.