Cudnn convolution performance(fp32, fp16. int8) on the jetson xavier

soohyung.zhang · May 18, 2022, 4:16pm

Device

Jetson Xavier
D2 default power/clock

SW

cuda : 11.4
cudnn : 8.3.2.49

In my test environment, cudnnConvolutionForward function is fastest in FP32 not FP16, INT8.
Below are the convolution settings in each mode.

FP32
checkCUDNN(cudnnSetTensor4dDescriptor(inTensorDesc, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, batch_count, in_channel, in_height, in_width));
checkCUDNN(cudnnSetFilter4dDescriptor(filterDesc, CUDNN_DATA_FLOAT, CUDNN_TENSOR_NCHW, out_channel, in_channel, filter_height, filter_width));
checkCUDNN(cudnnSetConvolution2dDescriptor(convDesc, padding_h, padding_w, stride_vertical, stride_horizontal, 1, 1, CUDNN_CROSS_CORRELATION, CUDNN_DATA_FLOAT));

FP16
checkCUDNN(cudnnSetTensor4dDescriptor(inTensorDesc, CUDNN_TENSOR_NCHW, CUDNN_DATA_HALF, batch_count, in_channel, in_height, in_width));
checkCUDNN(cudnnSetFilter4dDescriptor(filterDesc, CUDNN_DATA_HALF, CUDNN_TENSOR_NCHW, out_channel, in_channel, filter_height, filter_width));
checkCUDNN(cudnnSetConvolution2dDescriptor(convDesc, padding_h, padding_w, stride_vertical, stride_horizontal, 1, 1, CUDNN_CROSS_CORRELATION, CUDNN_DATA_HALF));

INT8
checkCUDNN(cudnnSetTensor4dDescriptor(inTensorDesc, CUDNN_TENSOR_NHWC, CUDNN_DATA_INT8, batch_count, in_channel, in_height, in_width));

checkCUDNN(cudnnSetFilter4dDescriptor(filterDesc, CUDNN_DATA_INT8, CUDNN_TENSOR_NHWC, out_channel, in_channel, filter_height, filter_width));

checkCUDNN(cudnnSetConvolution2dDescriptor(convDesc, padding_h, padding_w, stride_vertical, stride_horizontal, 1, 1, CUDNN_CONVOLUTION, CUDNN_DATA_INT32));

Common
int in_channel = 3; (4 used for int8 )
int in_height = 1208;
int in_width = 1920;
int batch_count = 1;
int filter_width = 5;
int filter_height = 5;
int out_channel = 32;
int padding_w = 2;
int padding_h = 2;
int stride_horizontal = 2;
int stride_vertical = 2;

Performance
FP32 : 3.00 ms
FP16 : 7.08 ms (pseudo : 2.75ms)
INT8 : 9.75 ms

Could you please check if I have set the parameters correctly?

soohyung.zhang · May 19, 2022, 3:25am

As a result of profiling using nvprof, it was confirmed that volta_scudnn_128x32_relu_small_nn_v1 was called in fp32, and void cudnn::detail::convolve_common_engine_int8_NHWC was called for convolution operation in int8. I’m guessing this is the cause of the performance difference. My expectation is that volta_icudnn should be called for int8 convolution . How can I compute an int8 convolution using volta_icudnn ?

FP32
GPU activities: 98.16% 3.90047s 1300 3.0004ms 2.9940ms 3.0165ms volta_scudnn_128x32_relu_small_nn_v1

INT8
GPU activities: 99.74% 15.8828s 1302 12.199ms 12.195ms 12.206ms void cudnn::detail::convolve_common_engine_int8_NHWC<char, char, int=128, int=5, int=6, int=3, int=3, int=4, bool=0, bool=0, bool=0, bool=0, bool=0>(int, int, int, char const , char const , int, char, conv_kernel_common_params, int, float, float, int, char, char const * const *)

spolisetty · June 14, 2022, 5:26am

Hi

Looks like you’re using the CUDNN_TENSOR_NHWC layout for INT8. With NHWC, cuDNN heuristics are already suggesting the optimal kernel convolve_common_engine_int8_NHWC, as shown from the profile results.

To use volta_icudnn kernels, we should change the layout to CUDNN_TENSOR_NCHW_VECT_C, this will outperform the NHWC kernel for this case.

Thank you.

system · June 28, 2022, 5:26am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
FP16 cudnnConvolutionForward cuDNN	1	525	June 14, 2019
Cudnn convolution performance by precision DRIVE AGX Xavier General driveos-cuda	6	1097	May 30, 2022
CUDNN: cudnnConvolutionForward very bad performance(very long execution time) on xavier Jetson AGX Xavier	4	1045	October 18, 2021
cuDNN v6 INT8 convolution failing with CUDNN_STATUS_NOT_SUPPORTED cuDNN	12	5252	March 3, 2020
Depthwise convolution in cudnn fp16 is slow than fp32 Jetson AGX Xavier cudnn	6	1349	October 18, 2021
On Jetson Xavier, which is faster: pseudo FP16 or true FP16? Jetson AGX Xavier tensorrt	5	502	June 29, 2022
Cudnn TF32 performs no better than FP32 on RTX3090 cuDNN cudnn	5	2519	January 28, 2021
CuDnn slow convolution operation cuDNN kernel	1	53	January 31, 2025
How to get better conv performance with cudnn? cuDNN	1	730	September 25, 2023
Why Convolution in 8bits with CUDNN6.0 takes more time than fp32 convolution? GPU-Accelerated Libraries	0	510	October 10, 2017

Cudnn convolution performance(fp32, fp16. int8) on the jetson xavier

Related topics