Algorithm recommended by cudnnFindConvolutionForwardAlgorithm fails to run when output channels are not a multiple of 8

Hi,

I noticed what seems to be a bug in CUDNN. When running on a Volta card, with CUDNN_DATA_HALF, and CUDNN_TENSOR_NHWC, cudnnFindConvolutionForwardAlgorithm returns algo = 1, or algo = 0 with tensor cores. But it fails to run if output channels are not a multiple of 8.

My configuration is CUDNN 7.0.5, with CUDA 9.0, 64-bit Linux.

I can give more information if necessary.

I believe cudnnFindConvolutionForwardAlgorithm should return an algorithm that works.

I have run more experiments, and found that it also fails when using convolution groups.

These are the default parameters for my convolution:
batch_size = 256;
inputs = 128;
outputs = 128;
groups = 1;
height = 19;
width = 19;
kernel_height = 3;
kernel_width = 3;
padding = 1;
dilation = 1;
data_type = CUDNN_DATA_HALF;
tensor_format = CUDNN_TENSOR_NHWC;

If I set outputs = 1, or outputs = 9, cudnnFindConvolutionForwardAlgorithm suggests an algorithm that causes cudnnGetConvolutionForwardWorkspaceSize to fail with CUDNN_STATUS_NOT_SUPPORTED. If I set outputs = 8, everything works well.

If I set groups = 4, the algorithm recommended by cudnnFindConvolutionForwardAlgorithm always fails, whatever the number of outputs.

It seems that groups only work with CUDNN_TENSOR_NCHW.