How to get better conv performance with cudnn?

Hi guys,

I’m wondering how to get better performance of fp16 convolution forward in cudnn?

CUDA Version: 12.0
Device: A10
cuDNN version: 8.7
Docker environment:
Nvidia Driver version: 525.105.17
Torch Version: 1.14.0a0+44dac51

What I have tried:

  1. Do algorithms search by cudnnFindConvolutionForwardAlgorithm, and turns out Tensor Core algo (CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM) with math type (CUDNN_TENSOR_OP_MATH) is faster
  2. Switch the compute type between float32 and float16, and turns out using float32 as compute type is faster for some reason
  3. Changing tensor format from NCHW to NHWC

And finally with these params:
n: 2
h: 64
w: 64
in channel: 640
out channel: 640
kernel size: (3,3)
stride: (1,1)

I got GPU time cost: 0.442778 ms, which is indeed faster than torch: 0.67805 ms

Then I tried TensorRT by doing trtexec --onnx=conv.onnx --fp16

It turns out that time cost could be 0.363779 ms

I believe that TensorRT is also using cuDNN to do conv, so there must be something else that I can do to speed up my conv code.

Here is my code that does conv2dWithBias and perf check. Just modify the suffix from .txt to .cu should make this compile
test_cudnn_cu.txt (18.4 KB)

Here is my Python script that does torch conv speed test and generates conv.onnx
test_cudnn_py.txt (980 Bytes)


Can you try to used the latest version of cuDNN and check the performance? Also, it is not necessarily true that TRT is using cuDNN under the hood. You can confirm by looking at the kernels through Nsight Systems.