Device
- Jetson Xavier
- D2 default power/clock
SW
- cuda : 11.4
- cudnn : 8.3.2.49
In my test environment, cudnnConvolutionForward function is fastest in FP32 not FP16, INT8.
Below are the convolution settings in each mode.
FP32
checkCUDNN(cudnnSetTensor4dDescriptor(inTensorDesc, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, batch_count, in_channel, in_height, in_width));
checkCUDNN(cudnnSetFilter4dDescriptor(filterDesc, CUDNN_DATA_FLOAT, CUDNN_TENSOR_NCHW, out_channel, in_channel, filter_height, filter_width));
checkCUDNN(cudnnSetConvolution2dDescriptor(convDesc, padding_h, padding_w, stride_vertical, stride_horizontal, 1, 1, CUDNN_CROSS_CORRELATION, CUDNN_DATA_FLOAT));
FP16
checkCUDNN(cudnnSetTensor4dDescriptor(inTensorDesc, CUDNN_TENSOR_NCHW, CUDNN_DATA_HALF, batch_count, in_channel, in_height, in_width));
checkCUDNN(cudnnSetFilter4dDescriptor(filterDesc, CUDNN_DATA_HALF, CUDNN_TENSOR_NCHW, out_channel, in_channel, filter_height, filter_width));
checkCUDNN(cudnnSetConvolution2dDescriptor(convDesc, padding_h, padding_w, stride_vertical, stride_horizontal, 1, 1, CUDNN_CROSS_CORRELATION, CUDNN_DATA_HALF));
INT8
checkCUDNN(cudnnSetTensor4dDescriptor(inTensorDesc, CUDNN_TENSOR_NHWC, CUDNN_DATA_INT8, batch_count, in_channel, in_height, in_width));
checkCUDNN(cudnnSetFilter4dDescriptor(filterDesc, CUDNN_DATA_INT8, CUDNN_TENSOR_NHWC, out_channel, in_channel, filter_height, filter_width));
checkCUDNN(cudnnSetConvolution2dDescriptor(convDesc, padding_h, padding_w, stride_vertical, stride_horizontal, 1, 1, CUDNN_CONVOLUTION, CUDNN_DATA_INT32));
Common
int in_channel = 3; (4 used for int8 )
int in_height = 1208;
int in_width = 1920;
int batch_count = 1;
int filter_width = 5;
int filter_height = 5;
int out_channel = 32;
int padding_w = 2;
int padding_h = 2;
int stride_horizontal = 2;
int stride_vertical = 2;
Performance
FP32 : 3.00 ms
FP16 : 7.08 ms (pseudo : 2.75ms)
INT8 : 9.75 ms
Could you please check if I have set the parameters correctly?