How to get better conv performance with cudnn?

yibolu96 · September 21, 2023, 2:48am

Hi guys,

I’m wondering how to get better performance of fp16 convolution forward in cudnn?

Environment:
CUDA Version: 12.0
Device: A10
cuDNN version: 8.7
Docker environment: nvcr.io/nvidia/pytorch:23.02-py3
Nvidia Driver version: 525.105.17
Torch Version: 1.14.0a0+44dac51

What I have tried:

Do algorithms search by cudnnFindConvolutionForwardAlgorithm, and turns out Tensor Core algo (CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM) with math type (CUDNN_TENSOR_OP_MATH) is faster
Switch the compute type between float32 and float16, and turns out using float32 as compute type is faster for some reason
Changing tensor format from NCHW to NHWC

And finally with these params:
n: 2
h: 64
w: 64
in channel: 640
out channel: 640
kernel size: (3,3)
stride: (1,1)

I got GPU time cost: 0.442778 ms, which is indeed faster than torch: 0.67805 ms

Then I tried TensorRT by doing trtexec --onnx=conv.onnx --fp16

It turns out that time cost could be 0.363779 ms

I believe that TensorRT is also using cuDNN to do conv, so there must be something else that I can do to speed up my conv code.

Here is my code that does conv2dWithBias and perf check. Just modify the suffix from .txt to .cu should make this compile
test_cudnn_cu.txt (18.4 KB)

Here is my Python script that does torch conv speed test and generates conv.onnx
test_cudnn_py.txt (980 Bytes)

Thanks!!!

mnicely · September 25, 2023, 1:25pm

Can you try to used the latest version of cuDNN and check the performance? Also, it is not necessarily true that TRT is using cuDNN under the hood. You can confirm by looking at the kernels through Nsight Systems.

Topic		Replies	Views
Depthwise convolution in cudnn fp16 is slow than fp32 Jetson AGX Xavier cudnn	5	1496	August 4, 2020
FP16 cudnnConvolutionForward cuDNN	1	593	June 14, 2019
TensorRT inference time much faster than cuDNN TensorRT	4	1883	February 22, 2022
Incresement of work_space allocation using cudnn v7.6.x to do convolution with FP16 data type cuDNN	1	3215	January 29, 2020
TensorRT 2x slower than Cudnn for single Conv2D (74 ms vs. 156 ms) TensorRT	6	972	February 5, 2021
Why is my 'trivial' convolution kernel faster than cuDNN? CUDA Programming and Performance	4	578	May 29, 2022
cuDNN: Problems finding conv forward algorithm cuDNN	3	1294	May 23, 2021
TX2 cuDNN TRUE_HALF_CONFIG can't be faster than float32 cuDNN	0	569	March 14, 2019
Which algo should be passed for cudnnConvolutionForward() when TensorCore and NHWC ? cuDNN	1	1382	October 25, 2018
Cudnn convolution is significantly slow cuDNN	3	1256	April 19, 2022

How to get better conv performance with cudnn?

Related topics