Cudnn convolution performance(fp32, fp16. int8) on the jetson xavier

Device

  • Jetson Xavier
  • D2 default power/clock

SW

  • cuda : 11.4
  • cudnn : 8.3.2.49

In my test environment, cudnnConvolutionForward function is fastest in FP32 not FP16, INT8.
Below are the convolution settings in each mode.

FP32
checkCUDNN(cudnnSetTensor4dDescriptor(inTensorDesc, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, batch_count, in_channel, in_height, in_width));
checkCUDNN(cudnnSetFilter4dDescriptor(filterDesc, CUDNN_DATA_FLOAT, CUDNN_TENSOR_NCHW, out_channel, in_channel, filter_height, filter_width));
checkCUDNN(cudnnSetConvolution2dDescriptor(convDesc, padding_h, padding_w, stride_vertical, stride_horizontal, 1, 1, CUDNN_CROSS_CORRELATION, CUDNN_DATA_FLOAT));

FP16
checkCUDNN(cudnnSetTensor4dDescriptor(inTensorDesc, CUDNN_TENSOR_NCHW, CUDNN_DATA_HALF, batch_count, in_channel, in_height, in_width));
checkCUDNN(cudnnSetFilter4dDescriptor(filterDesc, CUDNN_DATA_HALF, CUDNN_TENSOR_NCHW, out_channel, in_channel, filter_height, filter_width));
checkCUDNN(cudnnSetConvolution2dDescriptor(convDesc, padding_h, padding_w, stride_vertical, stride_horizontal, 1, 1, CUDNN_CROSS_CORRELATION, CUDNN_DATA_HALF));

INT8
checkCUDNN(cudnnSetTensor4dDescriptor(inTensorDesc, CUDNN_TENSOR_NHWC, CUDNN_DATA_INT8, batch_count, in_channel, in_height, in_width));

checkCUDNN(cudnnSetFilter4dDescriptor(filterDesc, CUDNN_DATA_INT8, CUDNN_TENSOR_NHWC, out_channel, in_channel, filter_height, filter_width));

checkCUDNN(cudnnSetConvolution2dDescriptor(convDesc, padding_h, padding_w, stride_vertical, stride_horizontal, 1, 1, CUDNN_CONVOLUTION, CUDNN_DATA_INT32));

Common
int in_channel = 3; (4 used for int8 )
int in_height = 1208;
int in_width = 1920;
int batch_count = 1;
int filter_width = 5;
int filter_height = 5;
int out_channel = 32;
int padding_w = 2;
int padding_h = 2;
int stride_horizontal = 2;
int stride_vertical = 2;

Performance
FP32 : 3.00 ms
FP16 : 7.08 ms (pseudo : 2.75ms)
INT8 : 9.75 ms

Could you please check if I have set the parameters correctly?

As a result of profiling using nvprof, it was confirmed that volta_scudnn_128x32_relu_small_nn_v1 was called in fp32, and void cudnn::detail::convolve_common_engine_int8_NHWC was called for convolution operation in int8. I’m guessing this is the cause of the performance difference. My expectation is that volta_icudnn should be called for int8 convolution . How can I compute an int8 convolution using volta_icudnn ?

FP32
GPU activities: 98.16% 3.90047s 1300 3.0004ms 2.9940ms 3.0165ms volta_scudnn_128x32_relu_small_nn_v1

INT8
GPU activities: 99.74% 15.8828s 1302 12.199ms 12.195ms 12.206ms void cudnn::detail::convolve_common_engine_int8_NHWC<char, char, int=128, int=5, int=6, int=3, int=3, int=4, bool=0, bool=0, bool=0, bool=0, bool=0>(int, int, int, char const , char const , int, char, conv_kernel_common_params, int, float, float, int, char, char const * const *)

Hi

Looks like you’re using the CUDNN_TENSOR_NHWC layout for INT8. With NHWC, cuDNN heuristics are already suggesting the optimal kernel convolve_common_engine_int8_NHWC, as shown from the profile results.

To use volta_icudnn kernels, we should change the layout to CUDNN_TENSOR_NCHW_VECT_C, this will outperform the NHWC kernel for this case.

Thank you.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.