Hi!
We have some issue with performance of int8 convolutions on gtx 1080ti and new cudnn - as we can see from internal cudnn logs, convolutionBiasActivationForward and convolutionForward on cudnn 8.2.4 internally uses backendExecute with engine0. If we use frontend api (via wrapper over backend), for our conv problem we have four compatible engines on gtx1080ti (engine0,engine28,engine43), where engine28 and engine43 are much faster than engine0. On Tesla T4 legacy api choices engine28 for execution of convolutionForward, as expected. Is there any way to force legacy api functions to perform benchmark testing or any way to change default engine? I also try to change algo, but it does not affect execution time.
Under nvprof on CuDNN 7.5 we can see that cudnn uses optimized kernel for nchw_vect_c/int8x4.
There relative execution time of same code for CuDNN 8.2.4 and CuDNN 7.5
input 8x32x128x128, strides 1,1 dilation 1,1
filter 32x8x3x3
cudnn 7.5
| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 10,762,681.08 | 92.91 | 1.7% | 1.32 | `NHWC_INT8_[IMPLICIT_PRECOMP_GEMM]_cudnnV7`
| 217,448.95 | 4,598.78 | 0.6% | 0.04 | `NCHW_VECT_C_INT8x4_[IMPLICIT_PRECOMP_GEMM]_cudnnV7`
| 468,830.60 | 2,132.97 | 0.5% | 0.06 | `NCHW_FP_[IMPLICIT_GEMM]_cudnnV7`
cudnn 8.2.4
| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 10,676,776.25 | 93.66 | 2.2% | 1.31 | `NHWC_INT8_[IMPLICIT_PRECOMP_GEMM]_cudnnV7`
| 1,372,497.50 | 728.60 | 1.6% | 0.16 | `NCHW_VECT_C_INT8x4_[IMPLICIT_PRECOMP_GEMM]_cudnnV7`
| 399,049.75 | 2,505.95 | 0.4% | 0.06 | `NCHW_FP_[IMPLICIT_GEMM]_cudnnV7`