Poor group convolution performance in fp16

ziheng · February 8, 2020, 9:08pm

Grouped convolutions are required in MobileNet. I realize that cuDNN has included support for fast fp16 grouped convolutions. Unfortunately, my benchmarks seems to contradict this fact.

I am running a convolution with N = 1, C = 64, H = 56, W = 56, K = 64, groups = 64 (depthwise convolution)

One the Jetson nano, with cuDNN that comes with JetPack, I get a runtime of 1.5 ms for HALF, which is slower than 1.0 for FLOAT for the regular non-grouped convolution (which poses its own question: when is half faster than float?)

For grouped convolution, HALF gives me a runtime of 10.4 ms… while FLOAT gives a runtime of 12 ms. I believe this is not possible given the official inference benchmarks with MobileNet-SSD. What’s going on here? I am using the following to set my grouped convolutions:

int group_count = 64;
checkCUDNN(cudnnSetConvolutionGroupCount(
convolution_descriptor,group_count)
);

I set the in_channel in the filter descriptor to 1.

Is cuDNN slower than TensorRT?

ziheng · February 8, 2020, 9:16pm

I think I figured out the issue, though would appreciate a confirmation from a Nvidia employee.

NCHW is not supported for grouped conv, or fast group conv anyways. You have to use NHWC.

AastaLLL · February 10, 2020, 3:02am

Hi,

data format might matter.

More, please noticed that the acceleration between HALF and FLOAT also differ from the layers.
This depends a layer can be well-accelerated via HALF type data or not.

Our benchmark result for MobileNet-SSD should take all the used layer into account.

Thanks.