Performance regression of conv2d INT8 on cudnn 8

lxq2.t · December 1, 2021, 10:34am

Hi!
We have some issue with performance of int8 convolutions on gtx 1080ti and new cudnn - as we can see from internal cudnn logs, convolutionBiasActivationForward and convolutionForward on cudnn 8.2.4 internally uses backendExecute with engine0. If we use frontend api (via wrapper over backend), for our conv problem we have four compatible engines on gtx1080ti (engine0,engine28,engine43), where engine28 and engine43 are much faster than engine0. On Tesla T4 legacy api choices engine28 for execution of convolutionForward, as expected. Is there any way to force legacy api functions to perform benchmark testing or any way to change default engine? I also try to change algo, but it does not affect execution time.
Under nvprof on CuDNN 7.5 we can see that cudnn uses optimized kernel for nchw_vect_c/int8x4.
There relative execution time of same code for CuDNN 8.2.4 and CuDNN 7.5

input 8x32x128x128, strides 1,1 dilation 1,1
filter 32x8x3x3

cudnn 7.5

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|       10,762,681.08 |               92.91 |    1.7% |      1.32 | `NHWC_INT8_[IMPLICIT_PRECOMP_GEMM]_cudnnV7`
|          217,448.95 |            4,598.78 |    0.6% |      0.04 | `NCHW_VECT_C_INT8x4_[IMPLICIT_PRECOMP_GEMM]_cudnnV7` 
|          468,830.60 |            2,132.97 |    0.5% |      0.06 | `NCHW_FP_[IMPLICIT_GEMM]_cudnnV7`


cudnn 8.2.4

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|       10,676,776.25 |               93.66 |    2.2% |      1.31 | `NHWC_INT8_[IMPLICIT_PRECOMP_GEMM]_cudnnV7`
|        1,372,497.50 |              728.60 |    1.6% |      0.16 | `NCHW_VECT_C_INT8x4_[IMPLICIT_PRECOMP_GEMM]_cudnnV7`
|          399,049.75 |            2,505.95 |    0.4% |      0.06 | `NCHW_FP_[IMPLICIT_GEMM]_cudnnV7`

spolisetty · January 12, 2022, 11:22am

Hi,

Are you facing this issue only on first iteration ?

Thank you.

lxq2.t · January 14, 2022, 9:01am

Hi!
No, perfomance regression occurs on every iteration. And also we have very high slowdown on every INT8 depthwise separable convolution which is much slower than fp version.

Topic		Replies	Views
cuDNN v6 INT8 convolution failing with CUDNN_STATUS_NOT_SUPPORTED cuDNN	12	5232	March 3, 2020
Cudnn convolution performance(fp32, fp16. int8) on the jetson xavier cuDNN	3	1024	June 14, 2022
Unexpected cudnnConvolutionForward performance with varying input channels cuDNN	2	574	June 2, 2020
Cudnn8 regression on TX2 Jetson TX2 nvbugs , cudnn	6	1064	October 18, 2021
Low performance for convolution in cuDNN on Tesla V100 cuDNN	5	2077	August 2, 2018
cuDNN8: extreamly slow first iteration of CNN training or inference cuDNN	3	1730	December 30, 2021
cuDNN8 regression in algorithm selection heuristics cuDNN	6	2739	April 24, 2021
Why Convolution in 8bits with CUDNN6.0 takes more time than fp32 convolution? GPU-Accelerated Libraries	0	507	October 10, 2017
cuDNN runs pretty slow cuDNN	2	998	April 24, 2023
Cudnn convolution performance by precision DRIVE AGX Xavier General driveos-cuda	6	1086	May 30, 2022

Performance regression of conv2d INT8 on cudnn 8

Related topics