Int8 is 30% slower than fp16 in cudnn_samples_v8/conv_sample

CUDA11.2 + CuDNN8.1.1
GPU: RTX 4000
OS: Ubuntu 20.04

FP16:

./conv_sample -mathType1 -filterFormat1   -n32 -c32 -h300 -w300 -k32 -r3 -s3 -pad_h1 -pad_w1 -u1 -v1 -b -x

Executing: conv_sample -mathType1 -filterFormat1 -n32 -c32 -h300 -w300 -k32 -r3 -s3 -pad_h1 -pad_w1 -u1 -v1 -b -x
Using format CUDNN_TENSOR_NHWC (for INT8x4 and INT8x32 tests use CUDNN_TENSOR_NCHW_VECT_C)
Testing single precision
====USER DIMENSIONS====
input dims are 32, 32, 300, 300
filter dims are 32, 32, 3, 3
output dims are 32, 32, 300, 300
====PADDING DIMENSIONS====
padded input dims are 32, 32, 300, 300
padded filter dims are 32, 32, 3, 3
padded output dims are 32, 32, 300, 300
Testing conv
^^^^ CUDA : elapsed = 0.353999 sec,  
Testing half precision (math in single precision)
====USER DIMENSIONS====
input dims are 32, 32, 300, 300
filter dims are 32, 32, 3, 3
output dims are 32, 32, 300, 300
====PADDING DIMENSIONS====
padded input dims are 32, 32, 300, 300
padded filter dims are 32, 32, 3, 3
padded output dims are 32, 32, 300, 300
Testing conv
^^^^ CUDA : elapsed = 0.00474882 sec,

INT8:

./conv_sample -mathType1 -filterFormat2 -dataType2 -n32 -c32 -h300 -w300 -k32 -r3 -s3 -pad_h1 -pad_w1 -u1 -v1 -b -x
Executing: conv_sample -mathType1 -filterFormat2 -dataType2 -n32 -c32 -h300 -w300 -k32 -r3 -s3 -pad_h1 -pad_w1 -u1 -v1 -b -x
Using format CUDNN_TENSOR_NCHW_VECT_C (for single and double precision tests use a different format)
Testing int8x4 (math in int32)
====USER DIMENSIONS====
input dims are 32, 32, 300, 300
filter dims are 32, 32, 3, 3
output dims are 32, 32, 300, 300
====PADDING DIMENSIONS====
padded input dims are 32, 32, 300, 300
padded filter dims are 32, 32, 3, 3
padded output dims are 32, 32, 300, 300
Testing conv
^^^^ CUDA : elapsed = 0.327955 sec,  
Testing int8x32 (math in int32)
====USER DIMENSIONS====
input dims are 32, 32, 300, 300
filter dims are 32, 32, 3, 3
output dims are 32, 32, 300, 300
====PADDING DIMENSIONS====
padded input dims are 32, 32, 300, 300
padded filter dims are 32, 32, 3, 3
padded output dims are 32, 32, 300, 300
Testing conv
^^^^ CUDA : elapsed = 0.00628805 sec,

I believe RTX 4000 has int8 tensor cores, why is int8 slower than fp16?

ping

@shshao yes! It’s a known issue in pytorch.
You should use for RTX 4000 cuda 11.8(official support) and cudnn 8.7!!

Thank you so much! Do you mind giving more details for future reference, like what GPUs it happens on and how CUDA11.8 + CUDNN8.7 fixed it?