I noticed what seems to be a bug in CUDNN. When running on a Volta card, with CUDNN_DATA_HALF, and CUDNN_TENSOR_NHWC, cudnnFindConvolutionForwardAlgorithm returns algo = 1, or algo = 0 with tensor cores. But it fails to run if output channels are not a multiple of 8.
My configuration is CUDNN 7.0.5, with CUDA 9.0, 64-bit Linux.
I can give more information if necessary.
I believe cudnnFindConvolutionForwardAlgorithm should return an algorithm that works.